You are on page 1of 1106

Velocity v8

Data Warehousing
Methodology
Data Warehousing

Executive Summary

Data Warehousing, once dedicated to business


intelligence and reporting, and usually at the
departmental or business unit level, is today
becoming a strategic corporate initiative
supporting an entire enterprise across a
multitude of business applications. This brisk
pace of change, coupled with industry
consolidation and regulatory requirements,
demands that data warehouses step into a mission-critical, operational role.

Information Technology (IT) plays a crucial role in delivering the data foundation for key
performance indicators such as revenue growth, margin improvement and asset
efficiency at the corporate, business unit and departmental levels. And IT now has the
tools and methods to succeed at any of these levels. An enterprise-wide, integrated
hub is the most effective approach to track and improve fundamental business
measures. It is not only desirable, it is necessary and feasible. Here are the reasons
why:

● The traditional approach of managing information across divisions,


geographies, and segments through manual consolidation and reconciliation is
error -prone and cannot keep pace with the rapid changes and stricter
mandates in the business.
● The data must be trustworthy. Executive officers are responsible for the
accuracy of the data used to make management decisions, as well as for
financial and regulatory reporting.
● Technologies have matured to the point where industry leaders are reaping
the benefits of enterprise-wide data solutions, increasing their understanding
of the market, and improving their agility

Organizations may choose to implement different levels of Data Warehouses from line
of business level implementations to Enterprise Data Warehouses. As the size and
scope of a Warehouse increases so does the complexity, risk and effort. For those that
achieve an Enterprise Data Warehouse, the benefits are often the greatest. However,
an organization must be committed to delivering an Enterprise Data Warehouse and
must ensure the resources, budget and timeline are sufficient to overcome the
organizational hurdles to having a single repository of corporate data assets.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 2 of 1017


Business Drivers

The primary business drivers responsible for a data warehouse project vary and can be
very organization specific. However, a few generalities can be evidenced as trends
across most organizations. Below are some of the key drivers typically responsible for
driving data warehouse projects:

Desire for a ‘360 degree view’ around customers, products, or other subject areas

In order to make effective business decisions and have meaningful interactions with
customers, suppliers and other partners it is important to gather information from a
variety of systems to provide a ‘360 degree view’ of the entity. For example, consider a
software company looking to provide a 360 degree view of their customers. To provide
this view, it may require gathering and relating sales orders, prospective sales
interactions, maintenance payments, support calls and services engagements. These
items merged together paint a more complete picture of a particular customer’s value
and interaction with the organization. The challenge is that in any organization this data
might reside in numerous systems with different customer codes and structures across
different technologies, making the creation of a single report nearly impossible
programmatically. Thus a need arises for a centralized location to merge and
rationalize this data for easy reporting - such as a Data Warehouse.

Desire to provide intensive analytics reporting without impacting operational


systems

Operational systems are built and tuned for the best operational performance possible.
A slowdown in an order entry system may cost a business lost sales and decreased
customer satisfaction. Given that analytic reporting often requires summarizing and
gathering large amounts of information, queries against operational systems for
analytic purposes are usually discouraged and even outright prohibited for fear of
impacting system performance. One key value of a data warehouse is the ability to
access large data sets for analytic purposes while remaining physically separated from
operational systems. This ensures that operational system performance is not
adversely affected by analytic work and that business users are free to crunch large
data sets and metrics without impacting daily operations.

Maintaining or generating historical records

In most cases, operational systems only store current state information on orders,
transactions, customers, products and other data. Historical information has little use in
the operational world. Point of sale transactions, for example, may be purged from

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 3 of 1017


operational systems after 30 days when the return policy expires. When organizations
have a need for historical reporting it is often difficult or impossible to gather historical
values from operational systems due to their very nature. By implementing a Data
Warehouse where data is pulled in on a specified interval, historical values and
information can be retained in the warehouse for any length of time an organization
determines necessary. Data can also be stored and organized more efficiently for easy
retrieval for analytical purposes.

Standardizing on common definitions of corporate metrics across organizational


boundaries

As organizations grow, different areas of an organization may develop their own


interpretation of business definitions and objects. To one group, a customer might
mean anyone who purchased something from the web-site versus another group that
believes any business or individual that received services is a customer. In order to
standardize reporting and consolidation of these areas, organizations will embark on a
data warehouse project to define and calculate these metrics in a common fashion
across the organization.

There are many other specific business drivers that can spur the need for a Data
Warehouse. However, these are some of the most common seen across most
organizations.

Key Success Factors

To ensure success for a Data Warehouse implementation, there are key success
factors that must be kept in mind throughout the project. Many times data warehouses
are built by IT staff that have been pulled or moved from other implementation efforts
such as system implementations and upgrades. In these cases, the process for
implementing a Data Warehouse can be quite a change from past IT work. These Key
Success Factors point out important topics to consider as you begin project planning.

Understanding Key Characteristics of a Data Warehouse

When embarking on a Data Warehouse project it is important to recognize and keep in


mind the key differentiators between a Data Warehouse project and a typical system
implementation. Some of these differences are:

● Data Sources are from many disparate systems, internal and external to the
organization
● Data models are used to understand relationships and business rules within

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 4 of 1017


the data
● Data volumes for both Data Integration and analytic reporting are high
● Historical data is maintained, often for periods of years
● Data is often stored as both detailed level data and summarized or aggregated
data
● The underlying database system is tuned for querying large volumes of data
rather than for inserting single transaction data
● Data Warehouse data supports tactical and strategic decision making, rather
than operational processing
● A successful Data Warehouse is business driven. The goal of any Data
Warehouse is to identify the business information needs
● Data Warehouse applications can have enterprise–wide impact and visibility
and often enable reporting at the highest levels within an organization
● As an enterprise level application, executive level support is vital to success

Typically data must be modeled, structured and populated in a relational database for it
to be available for a Data Warehouse reporting project. Data Integration is designed
based on available Operational Application sources to pull, cleanse, transform and
populate an Enterprise Subject Area Database. Once the data is present in the Subject
Area Database, projects can fulfill their requirements to provide Business Intelligence
reporting to the Data Warehouse end users. This is done by identifying detailed
reporting requirements and designing corresponding Business Intelligence Data Marts
that capture all of the properly granulated facts and dimensions needed for reporting.
These Data Marts are then populated using a Data Integration process and coupled to
the Reporting components developed in the Business Intelligence tool.

Understanding Common Data Warehouse Project Types

Not every Data Warehouse project is a brand new implementation of a Data


Warehouse. Often Warehouses are deployed in phases where subsequent
implementations are simply adding new subject areas, new data sources or enhanced
reporting to the existing solution. General categories of Data Warehouse projects have
been defined below along with key considerations for each.

New Business Data Project

This type of project addresses the need to gather data from an area of the enterprise
where no prior familiarity of the business data or requirements exists. All project
components are required.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 5 of 1017


Logical data modeling is a crucial step in this type of project, as there is a need to
thoroughly understand and model the data requirements from a business perspective.
The logical data model serves as the fundamental foundation and blueprint upon which
all follow-on project work will be based. If not properly and sufficiently addressed,
ultimate success for this type of project will be difficult to achieve; and re-work after-the-
fact will be costly from both a time and money perspective.

Physical data modeling and data discovery components will drive out the identification
and design of the new database requirements and the new data sources. New Data
Integration processes must be created to bring new data into new Data Warehouse
database structures. A set of history loads may be required to backload the data and
bring it up to the current timeline. New Dimensional Data Mart and BI Reporting
offerings must be modeled, designed and implemented to satisfy the user information
and access needs.

Enhanced Data Source Project

This type of project addresses the need to add a new data source or to alter an existing
data source, but always within the context of already established logical data structures
and definitions. No logical data modeling is needed because no new business data
requirements are being entertained. Minor adjustments to the physical model and
database may be needed to accommodate changes in volume due to the new source
or new or altered views may be needed to report on the new data instances that may
now be available to the users.

Data discovery analysis comprises a key portion of this type of project, as does the
corresponding new or altered data integration processes that move the data to the
database. Business intelligence reports and queries may need to change to incorporate
new views or expanded drill-downs and data value relationships. Back loading
historical data may also be required. When enhancing existing data, Metadata
management efforts to track data from the physical data sources through the data
integration process and to business intelligence data marts and reports can assist with
impact analysis and scoping efforts.

Enhanced Business Intelligence Requirements-Only Project

This type of project is focused solely on the expansion or alteration of the business
intelligence reporting and query capability using existing subject area data. This type of
project does not entertain the introduction of any new or altered data (in structure or
content) to the warehouse subject area database. New or altered dimensional data
mart tables/views may be required to support the business intelligence enhancements;
otherwise the majority, if not all, of the work is within the business intelligence

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 6 of 1017


development component.

Executive Support and Organizational Buy-In

Successful Data Warehouse projects are usually characterized by strong organizational


commitment to the delivery of enterprise analytics. The value and return on investment
(ROI) must be clearly articulated early in the project and an acknowledgement of the
cost and time to achieve results needs to be fully explored and understood.

Typically Data Warehouse project efforts involve a steering committee of executives


and business leads that drive the priority and overall vision for the organization.
Executive commitment is not only needed for assigning appropriate resources and
budget, but also to assist the data warehouse team in breaking down organizational
barriers.

It is not uncommon for a data warehouse team to encounter challenges in getting


access to the data and systems necessary to build the data warehouse. Operational
system owners are focused on their system and its primary function and have little
interest in making their data available for warehousing efforts. At these times, the Data
Warehouse steering committee can step in or rally executive support to break down
these barriers in the organization.

It is important to assess the business case and executive sponsorship early on in the
Data Warehouse project. The project is at risk if the business value of the warehouse
cannot be articulated at the executive level and on down through the organization. If
executives do not have a clear picture of how the data warehouse will impact their
business and the value it will provide, it wont’ be long before a decision is made to
reduce or stop funding the effort.

Enterprise Vision and Milestone Delivery

The Data Warehouse team should always keep the end goal in mind for an enterprise
wide data warehouse. Often an enterprise data warehouse will strive to achieve a
‘single source of truth’ across the entire enterprise and across all data stores.
Delivering this in a ‘big bang’ approach nearly always fails. By the time all of the
enterprise modeling, data rationalization and data integration have taken place across
all facets of the organization, the value of the project is called into question, and the
project is either delayed or cancelled.

In order to keep an Enterprise Data Warehouse on track, phases of deployment should


be scheduled to provide value quickly and continuously throughout the lifecycle of the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 7 of 1017


project. It is important for project teams to find areas of high business value that can be
delivered quickly and then build upon that success as the enterprise vision is realized.
The string of regular success milestones and business value keeps the executive
sponsorship engaged and proves the value of the Data Warehouse to the organization
early and often.

The key to remember is that while these short term milestones are delivered, the Data
Warehouse Team should not lose sight of the end goal of the enterprise vision. For
example, when implementing customer retention metrics for two key systems - as an
early ‘win’ – be sure to consider the 5 other systems in the organization and try to
ensure that the model and process is flexible enough so that the current work will not
need to be re-architected when this data is added in a later phase. Keep the final goal
in mind when designing and building the incremental milestones.

Flexible and Adaptable Reporting

End-user reporting must provide flexibility, offering straightforward reports for basic
users, and for analytic users allowing drilling and roll-ups, views of both summary and
detailed data and ad-hoc reporting. Report design that is too rigid may lead to clutter
(as multiple structures are developed for reports that are very similar to each other) in
the Business Intelligence Application and in the Data Integration and Data Warehouse
Database contents. Providing flexible structures and reports allows data to be queried
from the same reports and database structures without redundancy or the time required
to develop new objects and Data Integration processes. Users can create reports from
the common structures, thus removing the bottleneck of IT activities and the need to
wait for development. Data modeling and physical database structures that reflect the
business model (rather than the requirements for a single report) enable flexibility as a
by-product.

Summary and Detail Data

Often reporting requirements are defined for summary data. While summary data may
be available from transaction and operational systems, it is best to bring the detailed
data into the Data Warehouse and summarize based on that detail. This avoids
potential problems due to different calculation methods, aggregating on different criteria
and other ways in which the summary data brought in as a source might differ from roll-
ups that begin with the raw detailed records. Because the summary offers smaller
database table sizes, it may be tempting to bring this data in first, and then bring in the
detailed data at a later stage in order to drill down to the details. Having standard
sources of the raw data and using the same source for various summaries increases
the quality of the data and avoids ending up with multiple versions of the truth.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 8 of 1017


Engage Business Users Early

Business Users must be engaged throughout the entire development process. The
resulting reports and supporting data from the Data Warehouse project should address
answers to business questions. If the users are not involved then it is probable that the
end-result will not meet their needs for business data; and the overall success of the
project is diminished. The more the business users feel that the solution is focused on
solving their analytic needs – the more likelihood there is of adoption.

Thorough Data Validation and Monitoring

Once lost, trust is difficult to regain. As a Data Warehouse is rolled out (and throughout
its existence) it is important to thoroughly validate the data it contains in order it to
maintain the end users trust in the data warehouse analytics. If a key metric is incorrect
(i.e., the gross sales amount for a region in a particular month) end users may loose
confidence in the system and all of its reports and metrics. If users lose faith in the
analytics, this can hamper enterprise adoption and even spell the end of a data
warehouse.

Not only is thorough testing and validation required to ensure that data is loaded
completely and accurately into the warehouse, but organizations will often create on-
going balancing and auditing procedures. These procedures are run on a regular basis
to ensure metrics are accurate and that they ‘tie out’ with source systems. Sometimes
these procedures are manual and sometimes they are automated. If the warehouse is
suspected to be inaccurate - or a daily load fails to run – communications are initiated
with end users to alert them to the problem. It is better to limit user reporting for a
morning until the issues are addressed, than to risk that an executive makes a critical
business decision with incorrect data.

Last updated: 27-May-08 23:05

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 9 of 1017


Roles

● Velocity Roles and


Responsibilities
● Application Specialist
● Business Analyst
● Business Project Manager
● Data Architect
● Data Integration Developer
● Data Quality Developer
● Data Steward/Data Quality Steward
● Data Warehouse Administrator
● Database Administrator (DBA)
● End User
● Metadata Manager
● PowerCenter Domain Administrator
● Presentation Layer Developer
● Production Supervisor
● Project Sponsor
● Quality Assurance Manager
● Repository Administrator
● Technical Architect
● Technical Project Manager
● Test Engineer
● Test Manager
● Training Coordinator
● User Acceptance Test Lead

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 10 of 1017


Velocity Roles and Responsibilities

The following pages describe the roles used throughout this Guide, along with the responsibilities typically
associated with each. Please note that the concept of a role is distinct from that of an employee or full time
equivalent (FTE). A role encapsulates a set of responsibilities that may be fulfilled by a single person in a part-time or full-
time capacity, or may be accomplished by a number of people working together. The Velocity Guide refers to roles with an
implicit assumption that there is a corresponding person in that role. For example, a task description may discuss the involvement
of "the DBA" on a particular project, however, there may be one or more DBAs, or a person whose part-time responsibility is
database administration.

In addition, note that there is no assumption of staffing level for each role -- that is, a small project may have one individual filling
the role of Data Integration Developer, Data Architect, and Database Administrator, while large projects may have multiple
individuals assigned to each role. In cases where multiple people represent a given role, the singular role name is used, and
project planners can specify the actual allocation of work among all relevant parties. For example, the methodology always refers
to the Technical Architect, when in fact, there may be a team of two or more people developing the Technical Architecture for a
very large development effort.

Data Integration Project - Sample Organization Chart

Last updated: 20-May-08 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 11 of 1017


Application Specialist

Successful data integration


projects are built on a foundation
of thorough understanding of the
source and target applications. The Application Specialist is responsible for providing
detailed information on data models, metadata, audit controls and processing controls
to Business Analysts, Technical Architects and others regarding the source and/or
target system. This role is normally filled by someone from a technical background who
is able to query/analyze the data ‘hands-on’. The person filling this role should have a
good business understanding of how the data is generated and maintained and good
relationships with the Data Steward and the users of the data.

Reports to:

● Technical Project Manager

Responsibilities:

● Authority on application system data and process models


● Advises on known and anticipated data quality issues
● Supports the construction of representative test data sets

Qualifications/Certifications

● Possesses excellent communication skills, both written and verbal


● Must be able to work effectively with both business and technical stakeholders
● Works independently with minimal supervision

Recommended Training

● Informatica Data Explorer

Last updated: 09-Apr-07 15:38

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 12 of 1017


Business Analyst

The primary role of the Business


Analyst (sometimes known as the
Functional Analyst) is to represent
the interests of the business in the development of the data integration solution. The
secondary role is to function as an interpreter for business and technical staff,
translating concepts and terminology and generally bridging gaps in understanding.

Under normal circumstances, someone from the business community fills this role,
since deep knowledge of the business requirement is indispensable. Ideally, familiarity
with the technology and the development life-cycle allows the individual to function as
the communications channel between technical and business users.

Reports to:

● Business Project Manager

Responsibilities:

● Ensures that the delivered solution fulfills the needs of the business (should be
involved in decisions related to the business requirements)
● Assists in determining the data integration system project scope, time and
required resources
● Provides support and analysis of data collection, mapping, aggregation and
balancing functions
● Performs requirements analysis, documentation, testing, ad-hoc reporting,
user support and project leadership
● Produces detailed business process flows, functional requirements
specifications and data models and communicates these requirements to the
design and build teams
● Conducts cost/benefit assessments of the functionality requested by end-users
● Prioritizes and balances competing priorities
● Plans and authors the user documentation set

Qualifications/Certifications

● Possesses excellent communication skills, both written and verbal

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 13 of 1017


● Must be able to work effectively with both business and technical stakeholders
● Works independently with minimal supervision
● Has knowledge of the tools and technologies used in the data integration
solution
● Holds certification in industry vertical knowledge (if applicable)

Recommended Training

● Interview/workshop techniques
● Project Management
● Data Analysis
● Structured analysis
● UML or other business design methodology
● Data Warehouse Development

Last updated: 09-Apr-07 15:20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 14 of 1017


Business Project
Manager

The Business Project Manager


has overall responsibility for the
delivery of the data integration solution. As such, the Business Project Manager works
with the project sponsor, technical project manager, user community, and development
team to strike an appropriate balance of business needs, resource availability, project
scope, schedule, and budget to deliver specified requirements and meet customer
satisfaction.

Reports to:

● Project Sponsor

Responsibilities:

● Develops and manages the project work plan


● Manages project scope, time-line and budget
● Resolves budget issues
● Works with the Technical Project Manager to procure and assign the
appropriate resources for the project
● Communicates project progress to Project Sponsor(s)
● Is responsible for ensuring delivery on commitments and ensuring that the
delivered solution fulfills the needs of the business
● Performs requirements analysis, documentation, ad-hoc reporting and project
leadership

Qualifications/Certifications

● Translates strategies into deliverables


● Prioritizes and balances competing priorities
● Possesses excellent communication skills, both written and verbal
● Results oriented team player
● Must be able to work effectively with both business and technical stakeholders
● Works independently with minimal supervision
● Has knowledge of the tools and technologies used in the data integration

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 15 of 1017


solution
● Holds certification in industry vertical knowledge (if applicable)

Recommended Training

● Project Management

Last updated: 06-Apr-07 17:55

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 16 of 1017


Data Architect

The Data Architect is responsible


for the delivery of a robust
scalable data architecture that
meets the business goals of the organization. The Data Architect develops the logical
data models, and documents the models in Entity-Relationship Diagrams (ERD). The
Data Architect must work with the Business Analysts and Data Integration Developers
to translate the business requirements into a logical model. The logical model is
captured in the ERD, which then feeds the work of the Database Administrator, who
designs and implements the physical database.

Depending on the specific structure of the development organization, the Data Architect
may also be considered a Data Warehouse Architect, in cooperation with the Technical
Architect. This role involves developing the overall Data Warehouse logical
architecture, specifically the configuration of the data warehouse, data marts, and an
operational data store or staging area if necessary. The physical implementation of the
architecture is the responsibility of the Database Administrator.

Reports to:

● Technical Project Manager

Responsibilities:

● Designs an information strategy that maximizes the value of data as an


enterprise asset
● Maintains logical/physical data models
● Coordinates the metadata associated with the application
● Develops technical design documents
● Develops and communicates data standards
● Maintains Data Quality metrics
● Plans architectures and infrastructures in support of data management
processes and procedures
● Supports the build out of the Data Warehouse, Data Marts and operational
data store
● Effectively communicates with other technology and product team members

Qualifications/Certifications

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 17 of 1017


● Strong understanding of data integration concepts
● Understanding of multiple data architectures that can support a Data
Warehouse
● Ability to translate functional requirements into technical design specifications
● Ability to develop technical design documents and test case documents
● Experience in optimizing data loads and data transformations
● Industry vertical experience is essential
● Project Solution experience is desired
● Has had some exposure to Project Management
● Has worked with Modeling Packages
● Has experience with at least one RDBMS
● Strong Business Analysis and problem solving skills
● Familiarity with Enterprise Architecture Structures (Zachman/TOGAF)

Recommended Training

● Modeling Packages
● Data Warehouse Development

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 18 of 1017


Data Integration
Developer

The Data Integration Developer is


responsible for the design, build,
and deployment of the project's data integration component. A typical data integration
effort usually involves multiple Data Integration Developers developing the Informatica
mappings, executing sessions, and validating the results.

Reports to:

● Technical Project Manager

Responsibilities:

● Uses the Informatica Data Integration platform to extract, transform, and load
data
● Develops Informatica mapping designs
● Develops Data Integration Workflows and load processes
● Ensures adherence to locally defined standards for all developed components
● Performs data analysis for both Source and Target tables/columns
● Provides technical documentation of Source and Target mappings
● Supports the development and design of the internal data integration
framework
● Participates in design and development reviews
● Works with System owners to resolve source data issues and refine
transformation rules
● Ensures performance metrics are met and tracked
● Writes and maintains unit tests
● Conduct QA Reviews
● Performs production migrations

Qualifications/Certifications

● Understands data integration processes and how to tune for performance


● Has SQL experience

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 19 of 1017


● Possesses excellent communications skills
● Has the ability to develop work plans and follow through on assignments with
minimal guidance
● Has Informatica Data Integration Platform experience
● Is an Informatica Certified Designer
● Has RDBMS experience
● Has the ability to work with business and system owners to obtain
requirements and manage expectations

Recommended Training

● Data Modeling
● PowerCenter – Level I & II Developer
● PowerCenter - Performance Tuning
● PowerCenter - Team Based Development
● PowerCenter - Advanced Mapping Techniques
● PowerCenter - Advanced Workflow Techniques
● PowerCenter - XML Support
● PowerCenter - Data Profiling
● PowerExchange

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 20 of 1017


Data Quality Developer

The Data Quality Developer (DQ


Developer) is responsible for
designing, testing, deploying, and
documenting the project's data quality procedures and their outputs. The DQ Developer
provides the Data Integration Developer with all relevant outputs and results from the
data quality procedures, including any ongoing procedures that will run in the Operate
phase or after project-end. The DQ Developer must provide the Business Analyst with
the summary results of data quality analysis as needed during the project. The DQ
Developer must also document at a functional level how the procedures work within the
data quality applications. The primary tasks associated with this role are to use
Informatica Data Quality and Informatica Data Explorer to profile the project source
data, define or confirm the definition of the metadata, cleanse and accuracy-check the
project data, check for duplicate or redundant records, and provide the Data Integration
Developer with concrete proposals on how to proceed with the ETL processes.

Reports to:

● Technical Project Manager

Responsibilities:

● Profile source data and determine all source data and metadata characteristics
● Design and execute Data Quality Audit
● Present profiling/audit results, in summary and in detail, to the business
analyst, the project manager, and the data steward
● Assist the business analyst/project manager/data steward in defining or
modifying the project plan based on these results
● Assist the Data Integration Developer in designing source-to-target mappings
● Design and execute the data quality plans that will cleanse, de-duplicate, and
otherwise prepare the project data for the Build phase
● Test Data Quality plans for accuracy and completeness
● Assist in deploying plans that will run in a scheduled or batch environment
● Document all plans in detail and hand-over documentation to the customer
● Assist in any other areas relating to the use of data quality processes, such as
unit testing

Qualifications/Certifications

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 21 of 1017


● Has knowledge of the tools and technologies used in the data quality solution
● Results oriented team player
● Possesses excellent communication skills, both written and verbal
● Must be able to work effectively with both business and technical stakeholders

Recommended Training

● Data Quality Workbench I & II


● Data Explorer Level I
● PowerCenter Level I Developer
● Basic RDBMS Training
● Data Warehouse Development

Last updated: 15-Feb-07 17:34

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 22 of 1017


Data Steward/Data
Quality Steward

The Data Steward owns the data


and associated business and
technical rules on behalf of the Project Sponsor. This role has responsibility for defining
and maintaining business and technical rules, liaising with the business and technical
communities, and resolving issues relating to the data. The Data Steward will be the
primary contact for all questions relating to the data, its use, processing and quality. In
essence, this role formalizes the accountability for the management of organizational
data.

Typically the Data Steward is a key member of a Data Stewardship Committee put into
place by the Project Sponsor. This committee will include business users and technical
staff such as Application Experts. There is often an arbitration element to the role
where data is put to different uses by separate groups of users whose requirements
have to be reconciled.

Reports to:

● Business Project Manager

Responsibilities:

● Records the business use for defined data


● Identifies opportunities to share and re-use data
● Decides upon the target data quality metrics
● Monitors the progress towards, and tuning of, data quality target metrics
● Oversees data quality strategy and remedial measures
● Participates in the enforcement of data quality standards
● Enters, maintains and verifies data changes
● Ensures the quality, completeness and accuracy of data definitions
● Communicates concerns, issues and problems with data to the individuals that
can influence change
● Researches and resolves data issues

Qualifications/Certifications

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 23 of 1017


● Possesses strong analytical and problem solving skills
● Has experience in managing data standardization in a large organization,
including setting and executing strategy
● Previous industry vertical experience is essential
● Possesses excellent communication skills, both written and verbal
● Exhibits effective negotiating skills
● Displays meticulous attention to detail
● Must be able to work effectively with both business and technical stakeholders
● Works independently with minimal supervision
● Project solution experience is desirable

Recommended Training

● Data Quality Workbench Level I


● Data Explorer Level I

Last updated: 15-Feb-07 17:34

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 24 of 1017


Data Warehouse
Administrator

The scope of the Data Warehouse


Administrator role is similar to that
of the DBA. A typical data integration solution however, involves more than a single
target database and the Data Warehouse Administrator is responsible for coordinating
the many facets of the solution, including operational considerations of the data
warehouse, security, job scheduling and submission, and resolution of production
failures.

Reports to:

● Technical Project Manager

Responsibilities:

● Monitors and supports the Enterprise Data Warehouse environment


● Manages the data extraction, transformation, movement, loading, cleansing
and updating processes into the DW environment
● Maintains the DW repository
● Implements database security
● Sets standards and procedures for the DW environment
● Implements technology improvements
● Works to resolve technical issues
● Contributes to technical and system architectural planning
● Tests and implements new technical solutions

Qualifications/Certifications

● Experience in supporting Data Warehouse environments


● Familiarity with database, integration and presentation technology
● Experience in developing and supporting real-time and batch-driven data
movements
● Solid understanding of relational database models and dimensional data
models
● Strategic planning and system analysis

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 25 of 1017


● Able to work effectively with both business and technical stakeholders
● Works independently with minimal supervision

Recommended Training

● DBMS Administration
● Data Warehouse Development
● PowerCenter Administrator Level I & II
● PowerCenter Security and Migration
● PowerCenter Metadata Manager

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 26 of 1017


Database
Administrator (DBA)

The Database Administrator


(DBA) in a Data Integration
Solution is typically responsible for translating the logical model (i.e., the ERD) into a
physical model for implementation in the chosen DBMS, implementing the model,
developing volume and capacity estimates, performance tuning, and general
administration of the DBMS. In many cases, the project DBA also has useful
knowledge of existing source database systems. In most cases, a DBA's skills are tied
to a particular DBMS, such as Oracle or Sybase. As a result, an analytic solution with
heterogeneous sources/targets may require the involvement of several DBAs. The
Project Manager and Data Warehouse Administrator are responsible for ensuring that
the DBAs are working in concert toward a common solution.

Reports to:

● Technical Project Manager

Responsibilities:

● Plans, implements and supports enterprise databases


● Establishes and maintains database security and integrity controls
● Delivers database services while managing to policies, procedures and
standards
● Tests and implements new technical solutions
● Monitors and supports the database infrastructure (including clients)
● Develops volume and capacity estimates
● Proposes and implements enhancements to improve performance and
reliability
● Provides operational support of databases, including backup and recovery
● Develops programs to migrate data between systems
● Works to resolve technical issues
● Contributes to technical and system architectural planning
● Supports data integration developers in troubleshooting performance issues
● Collaborates with other Departments (i.e., Network Administrators) to identify
and resolve performance issues

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 27 of 1017


Qualifications/Certifications

● Experience in database administration, backup and recovery


● Expertise in database configuration and tuning
● Appreciation of DI tool-set and associated tools
● Experience in developing and supporting ETL real-time and batch processes
● Strategic planning and system analysis
● Strong analytical and communication skills
● Able to work effectively with both business and technical stakeholders
● Ability to work independently with minimal supervision

Recommended Training

● DBMS Administration

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 28 of 1017


End User

The End User is the ultimate


"consumer" of the data in the data
warehouse and/or data marts. As
such, the end user represents a key customer constituent (management is another),
and must therefore be heavily involved in the development of a data
integration solution. Specifically, a representative of the End User community must be
involved in gathering and clarifying the business requirements, developing the solution
and User Acceptance Testing (if applicable).

Reports to:

● Business Project Manager

Responsibilities:

● Gathers and clarifies business requirements


● Reviews technical design proposals
● Participates in User Acceptance testing
● Provides feedback on the user experience

Qualifications/Certifications

● Strong understanding of the business' processes


● Good communication skills

Recommended Training

● Data Analyzer - Quickstart


● Data Analyzer - Report Development

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 29 of 1017


Metadata Manager

The Metadata Manager's primary


role is to serve as the central point
of contact for all corporate
metadata management. This role involves setting the company's metadata
strategy, developing standards with the data administration group, determining
metadata points of integration between disparate systems, and ensuring the ability to
deliver metadata to business and technical users. The Metadata Manager is required to
work across business and technical groups to ensure that consistent metadata
standards are followed in all existing applications as well as in new development. The
Metadata Manager also monitors PowerCenter repositories for accuracy and metadata
consistency.

Reports to:

● Business Project Manager

Responsibilities:

● Formulates and implements the metadata strategy


● Captures and integrates metadata from heterogeneous metadata sources
● Implements and governs best practices relating to enterprise metadata
management standards
● Determines metadata points of integration between disparate systems
● Ensures the ability to deliver metadata to business and technical users
● Monitors development repositories for accuracy and metadata consistency
● Identifies and profiles data sources to populate the metadata repository
● Designs metadata repository models

Qualifications/Certifications

● Business sector experience is essential


● Experience in implementing and managing a repository environment
● Experience in data modeling (relational and dimensional)
● Experience in using repository tools
● Solid knowledge of general data architecture concepts, standards and best

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 30 of 1017


practices
● Strong analytical skills
● Excellent communication skills, both written and verbal
● Proven ability to work effectively with both business users and technical
stakeholders

Recommended Training

● DBMS Basics
● Data Modeling
● PowerCenter - Metadata Manager

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 31 of 1017


PowerCenter Domain
Administrator

The PowerCenter Domain


Administrator is responsible for
administering the Informatica Data Integration environment. This involves the
management and administration of all components in the PowerCenter domain. The
PowerCenter Domain Administrator works closely with the Technical Architect and
other project personnel during the Architect, Build and Deploy phases to plan,
configure, support and maintain the desired PowerCenter configuration. The
PowerCenter Domain Administrator is reponsible for the domain security configuration,
licensing and the physical linstall and location of the services and nodes that compose
the domain.

Reports to:

● Technical Project Manager

Responsibilities:

● Manages the PowerCenter Domain, Nodes, Service Manager and Application


Services
● Develops Disaster recovery and failover strategies for the Data Integration
Environment
● Responsible for High Availability and PowerCenter Grid configuration
● Creates new services as nodes as needed
● Ensures proper configuration of the PowerCenter Domain components
● Ensures proper application of the licensing files to nodes and services
● Manages user and user group access to the domain components
● Manages backup and recovery of the domain metadata and appropriate
shared file directories
● Monitors domain services and troubleshoots any errors
● Applies software updates as required
● Tests and implements new technical solutions

Qualifications/Certifications

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 32 of 1017


● Informatica Certified Administrator
● Experience in supporting Data Warehouse environments
● Experience in developing and supporting ETL real-time and batch processes
● Solid understanding of relational database models and dimensional data
models

Recommended Training

● PowerCenter Administrator Level I and Level II

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 33 of 1017


Presentation Layer
Developer

The Presentation Layer Developer


is responsible for the design,
build, and deployment of the presentation layer component of the data integration
solution. This component provides the user interface to the data warehouses, data
marts and other products of the data integration effort. As the interface is highly visible
to the enterprise, a person in this role must work closely with end users to gain a full
understanding of their needs.

The Presentation Layer Developer designs the application, ensuring that the end-user
requirements gathered during the requirements definition phase are accurately met by
the final build of the application. In most cases, the developer works with front-end
Business Intelligence tools, such as Cognos, Business Objects and others. To be most
effective, the Presentation Layer Developer should be familiar with metadata concepts
and the Data Warehouse/Data Mart data model.

Reports to:

● Technical Project Manager

Responsibilities:

● Collaborates with ends users and other stakeholders to define detailed


requirements
● Designs business intelligence solutions that meet user requirements for
accessing and analyzing data
● Works with front-end business intelligence tools to design the reporting
environment
● Works with the DBA and Data Architect to optimize reporting performance
● Develops supporting documentation for the application
● Participates in the full testing cycle

Qualifications/Certifications

● Solid understanding of metadata concepts and the Data Warehouse/Data Mart


model

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 34 of 1017


● Aptitude with front-end business intelligence tools (i.e., Cognos, Business
Objects, Informatica Data Analyzer)
● Excellent problem solving and trouble-shooting skills
● Solid interpersonal skills and ability to work with business and system owners
to obtain requirements and manage expectations
● Capable of expressing technical concepts in business terms

Recommended Training

● Informatica Data Analyzer


● Data Warehouse Development

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 35 of 1017


Production Supervisor

The Production Supervisor


has operational oversight for the
production environment and
the daily execution of workflows, sessions and other data integration processes.
Responsibilities includes, but are not limited to - training and supervision of system
operators, review of execution statistics, managing the scheduling for upgrades to the
system and application software as well as the release of data integration processes.

Reports to:

● Information Technology Lead

Responsibilities:

● Manages the daily execution of workflows and sessions in the production


environment
● Trains and supervises the work of system operators
● Reviews and audits execution logs and statistics and escalates issues
appropriately
● Schedules the release of new sessions or workflows
● Schedules upgrades to the system and application software
● Ensures that work instructions are followed
● Monitors data integration processes for performance
● Monitors data integration components to ensure appropriate storage and
capacity for daily volumes

Qualifications/Certifications

● Production supervisory experience


● Effective leadership skills
● Strong problem solving skills
● Excellent organizational and follow-up skills

Recommended Training

● PowerCenter Level I Developer

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 36 of 1017


● PowerCenter Team Based Development
● PowerCenter Advanced Workflow Techniques
● PowerCenter Security and Migration

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 37 of 1017


Project Sponsor

The Project Sponsor is typically a


member of the business
community rather than an IT/IS
resource. This is important because the lack of business sponsorship is often a
contributing cause of systems implementation failure. The Project Sponsor often
initiates the effort, serves as project champion, guides the Project Managers in
understanding business priorities, and reports status of the implementation to executive
leadership. Once an implementation is complete, the Project Sponsor may also serve
as "chief evangelist", bringing word of the successful implementation to other areas
within the organization.

Reports to:

● Executive Leadership

Responsibilities:

● Provides the business sponsorship for the project


● Champions the project within the business
● Initiates the project effort
● Guides the Project Managers in understanding business requirements and
priorities
● Assists in determining the data integration system project scope, time,
budget and required resources
● Reports status of the implementation to executive leadership

Qualifications/Certifications

● Has industry vertical knowledge

Recommended Training

● N/A

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 38 of 1017


Quality Assurance
Manager

The Quality Assurance (QA)


Manager ensures that the original
intent of the business case is achieved in the actual implementation of the analytic
solution. This involves leading the efforts to validate the integrity of the data throughout
the data integration processes, and ensuring that the utlimate data target has been
accurately derived from the source data. The QA Manager can be a member of the IT
organization, but serve as a liaison to the business community (i.e., the Business
Analysts and End Users). In situations where issues arise with regard to the quality of
the solution, the QA Manager works with project management and the development
team to resolve them. Depending upon the test approach taken by the project team, the
QA Manager may also serve as the Test Manager.

Reports to:

● Technical Project Manager

Responsibilities:

● Leads the effort to validate the integrity of the data through the data integration
processes
● Ensures that the data contained in the data integration solution has been
accurately derived from the source data
● Develops and maintains quality assurance plans and test requirements
documentation
● Verifies compliance to commitments contained in quality plans
● Works with the project management and development teams to resolve issues
● Participates in the enforcement of data quality standards
● Communicates concerns, issues and problems with data
● Participates in the testing and post-production verification
● Together with the Technical Lead and the Repository Administrator, articulates
the development standards
● Advises on the development methods to ensure that quality is built in
● Designs the QA and standards enforcement strategy
● Together with the Test Manager, coordinates the QA and Test strategies

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 39 of 1017


● Manages the implementation of the QA strategy

Qualifications/Certifications

● Industry vertical knowledge


● Solid understanding of the Software Development Life Cycle
● Experience in quality assurance performance, auditing processes, best
practices and procedures
● Experience with automated testing tools
● Knowledge of Data Warehouse and Data Integration enterprise environments
● Able to work effectively with both business and technical stakeholders

Recommended Training

● PowerCenter Level I Developer


● Infomatica Data Explorer
● Informatica Data Quality Workbench
● Project Management

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 40 of 1017


Repository
Administrator

The Repository Administrator is


responsible for administering a
PowerCenter or Data Analyzer Repository. This requires maintaining the organization
and security of the objects contained in the repository. It entails developing and
maintaining the folder and schema structures, managing users, groups, and roles,
global/local repository relationships and backup and recovery. During the development
effort, the Repository Administrator is responsible for coordinating migrations,
maintaining database connections, establishing and promoting naming conventions
and development standards, and developing back-up and restore procedures for the
repositories. The Repository Administrator works closely with the Technical Architect
and other project personnel during the Architect, Build and Deploy phases to plan,
configure, support and maintain the desired PowerCenter and Data Analyzer
configuration.

Reports to:

● Technical Project Manager

Responsibilities:

● Develops and maintains the repository folder structure


● Manages user and user group access to objects in the repository
● Manages PowerCenter global/local repository relationships and security levels
● Coordinates migration of data during the development effort
● Establishes and promotes naming conventions and development standards
● Develops back-up and restore procedures for the repository
● Works to resolve technical issues
● Contributes to technical and system architectural planning
● Tests and implements new technical solutions

Qualifications/Certifications

● Informatica Certified Administrator


● Experience in supporting Data Warehouse environments

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 41 of 1017


● Experience in developing and supporting ETL real-time and batch processes
● Solid understanding of relational database models and dimensional data
models

Recommended Training

● PowerCenter Administrator Level I and Level II


● Data Analyzer Introduction

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 42 of 1017


Technical Architect

The Technical Architect is


responsible for the
conceptualization, design, and
implementation of a sound technical architecture, which includes both hardware and
software components. The Architect interacts with the Project Management and design
teams early in the development effort in order to understand the scope of the business
problem and its solution. The Technical Architect must always consider both current
(stated) requirements and future (unstated) directions. Having this perspective helps to
ensure that the architecture can expand to correspond with the growth of the data
integration solution. This is particularly critical given the highly iterative nature of data
integration solution development.

Reports to:

● Technical Project Manager

Responsibilities:

● Develops the architectural design for a highly scalable, large volume


enterprise solution
● Performs high-level architectural planning, proof-of-concept and software
design
● Defines and implements standards, shared components and approaches
● Functions as the Design Authority in technical design reviews
● Contributes to development project estimates, scheduling and development
reviews
● Approves code reviews and technical deliverables
● Assures architectural integrity
● Maintains compliance with change control, SDLC and development standards
● Develops and reviews implementation plans and contingency plans

Qualifications/Certifications

● Software development expertise (previous development experience of the


application type)
● Deep understanding of all technical components of the application solution

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 43 of 1017


● Understanding of industry standard data integration architectures
● Ability to translate functional requirements into technical design specifications
● Ability to develop technical design documents
● Strong Business Analysis and problem solving skills
● Familiarity with Enterprise Architecture Structures (Zachman/TOGAF) or
equivalent
● Experience and/or training in appropriate platforms for the project
● Familiarity with appropriate modeling techniques such as UML and ER
modeling as appropriate

Recommended Training

● Operating Systems
● DBMS
● PowerCenter Developer and Administrator - Level I
● PowerCenter New Features
● Basic and advanced XML

Last updated: 25-May-08 16:19

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 44 of 1017


Technical Project
Manager

The Technical Project Manager


has overall responsibility for
managing the technical resources within a project. As such, he/she works with the
project sponsor, business project manager and development team to assign the
appropriate resources for a project within the scope, schedule, and budget and to
ensure that project deliverables are met.

Reports to:

● Project Sponsor or Business Project Manager

Responsibilities:

● Defines and implements the methodology adopted for the project


● Liaises with the Project Sponsor and Business Project Manager
● Manages project resources within the project scope, time-line and budget
● Ensures all business requirements are accurate
● Communicates project progress to Project Sponsor(s)
● Is responsible for ensuring delivery on commitments and ensuring that the
delivered solution fulfills the needs of the business
● Performs requirements analysis, documentation, ad-hoc reporting and
resource leadership

Qualifications/Certifications

● Translates strategies into deliverables


● Prioritizes and balances competing priorities
● Must be able to work effectively with both business and technical stakeholders
● Has knowledge of the tools and technologies used in the data integration
solution
● Holds certification in industry vertical knowledge (if applicable)

Recommended Training

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 45 of 1017


● Project Management Techniques
● PowerCenter Developer Level I
● PowerCenter Administrator Level I
● Data Analyzer Introduction

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 46 of 1017


Test Engineer

The Test Engineer is responsible


for completion of test plans and
their execution. During test
planning, the Test Engineer works with the Testing Manager/Quality Assurance
Manager to finalize the test plans and to ensure that the requirements are testable. The
Test Engineer is also responsible for complete execution including design and
implementing test scripts, test suites of test cases, and test data. The Test Engineer
should be able to demonstrate knowledge of testing techniques and to provide
feedback to developers. He/She uses the procedures as defined in the test strategy to
execute, report results and progress of test execution and to escalate testing issues as
appropriate.

Reports to:

● Test Manager (or Quality Assurance Manager)

Responsibilities:

● Provides input to the test plan and executes it


● Carries out requested procedures to ensure that Data Integration systems and
services meet organization standards and business requirements
● Develops and maintains test plans, test requirements documentation, test
cases and test scripts
● Verifies compliance to commitments contained in the test plans
● Escalates issues and works to resolve them
● Participates in testing and post-production verification efforts
● Executes test scripts and documents and provides the results to the test
manager
● Provides feedback to developers
● Investigates and resolves test failures

Qualifications/Certifications

● Solid understanding of the Software Development Life Cycle


● Experience with automated testing tools
● Strong knowledge of Data Warehouse and Data Integration enterprise

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 47 of 1017


environments
● Experience in a quality assurance and testing environment
● Experience in developing and executing test cases and in setting up complex
test environments
● Industry vertical knowledge

Recommended Training

● PowerCenter Developer Level I &II


● Data Analyzer Introduction
● SQL Basics
● Data Quality Workbench

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 48 of 1017


Test Manager

The Test Manager is responsible


for coordinating all aspects of test
planning and execution. During
test planning, the Test Manager becomes familiar with the business requirements in
order to develop sufficient test coverage for all planned functionality. He/she also
develops a test schedule that fits into the overall project plan. Typically, the Test
Manager works with a development counterpart during test execution; the development
manager schedules and oversees the completion of fixes for bugs found during testing.

The test manager is also responsible for the creation of the test data set. An integrated
test data set is a valuable project resource in its own right; apart from its obvious role in
testing, the test data set is very useful to the developers of integration and presentation
components. In general, separate functional and volume test data sets will be required.
In most cases, these should be derived from the production environment. It may also
be necessary to manufacture a data set which triggers all the business rules and
transformations specified for the application.

Finally, the Test Manager must continually advocate adherence to the Test Plans.
Projects at risk of delayed completion often sacrifice testing at the expense of a high-
quality end result.

Reports to:

● Technical Project Manager (or Quality Assurance Manager)

Responsibilities:

● Coordinates all aspects of test planning and execution


● Carries out procedures to ensure that Data Integration systems and services
meet organization standards and business requirements
● Develops and maintains test plans, test requirements documentation, test
cases and test scripts
● Develops and maintains test data sets
● Verifies compliance to commitments contained in the test plans
● Works with the project management and development teams to resolve issues
● Communicates concerns, issues and problems with data

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 49 of 1017


● Leads testing and post-production verification efforts
● Executes test scripts and documents and publishes the results
● Investigates and resolves test failures

Qualifications/Certifications

● Solid understanding of the Software Development Life Cycle


● Experience with automated testing tools
● Strong knowledge of Data Warehouse and Data Integration enterprise
environments
● Experience in a quality assurance and testing environment
● Experience in developing and executing test cases and in setting up complex
test environments
● Experience in classifying, tracking and verifying bug fixes
● Industry vertical knowledge
● Able to work effectively with both business and technical stakeholders
● Project management

Recommended Training

● PowerCenter Developer Level I


● Data Analyzer Introduction
● Data Explorer

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 50 of 1017


Training Coordinator

The Training Coordinator is


responsible for the design,
development, and delivery of all
requisite training materials. The deployment of a data integration solution can only be
successful if the End Users fully understand the purpose of the solution, the data and
metadata available to them, and the types of analysis they can perform using the
application. The Training Coordinator will work the Project Management Team, the
development team, and the End Users to ensure that he/she fully understands the
training needs, and develops the appropriate training material and delivery approach.
The Training Coordinator will also schedule and manage the delivery of the actual
training material to the End Users.

Reports to:

● Business Project Manager

Responsibilities:

● Designs, develops and delivers training materials


● Schedules and manages logistical aspects of training for end users
● Performs training need analysis in conjunction with the Project Manager,
development team and end users
● Interviews subject matter experts
● Ensures delivery on training commitments

Qualifications/Certifications

● Experience in the training field


● Ability to create training materials in multiple formats (i.e., written, computer-
based, instructor-led, etc.)
● Possesses excellent communication skills, both written and verbal
● Results oriented team player
● Must be able to work effectively with both business and technical stakeholders
● Has knowledge of the tools and technologies used in the data integration
solution

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 51 of 1017


Recommended Training

● Training Needs Analysis


● Data Analyzer Introduction
● Data Analyzer Report Creation

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 52 of 1017


User Acceptance Test
Lead

The User Acceptance Test Lead is


responsible for leading the final
testing and gaining final approval from the business users. The User Acceptance Test
Lead interacts with the End Users and the design team during the development effort to
ensure the inclusion of all the user requirements within the original defined scope. He/
she then validates that the deployed solution meets the final user requirements.

Reports to:

● Business Project Manager

Responsibilities:

● Gathers and clarifies business requirements


● Interacts with the design team and end users during the development efforts to
ensure inclusion of users requirements within the defined scope
● Reviews technical design proposals
● Schedules and leads the user acceptance test effort
● Provides test script/case training to the user acceptance test team
● Reports on test activities and results
● Validates that the deployed solution meets the final user requirements

Qualifications/Certifications

● Experience planning and executing user acceptance testing


● Strong understanding of the business' processes
● Knowledge of the project solution
● Excellent communication skills

Recommended Training

● N/A

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 53 of 1017


Last updated: 12-Jun-07 16:06

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 54 of 1017


Phase 1: Manage

1 Manage

● 1.1 Define Project


❍ 1.1.1 Establish Business Project Scope
❍ 1.1.2 Build Business Case
❍ 1.1.3 Assess Centralized Resources
● 1.2 Plan and Manage Project
❍ 1.2.1 Establish Project Roles
❍ 1.2.2 Develop Project Estimate
❍ 1.2.3 Develop Project Plan
❍ 1.2.4 Manage Project
● 1.3 Perform Project Close

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 55 of 1017


Phase 1: Manage

Description

Managing the development of a data integration solution requires extensive planning. A


well-defined, comprehensive plan provides the foundation from which to build a project
solution. The goal of this phase is to address the key elements required for a solid
project foundation. These elements include:

● Scope - Clearly defined business objectives. The measurable, business-


relevant outcomes expected from the project should be established early in the
development effort. Then, an estimate of the expected Return on Investment
(ROI) can be developed to gauge the level of investment and anticipated
return. The business objectives should also spell out a complete inventory of
business processes to facilitate a collective understanding of these processes
among project team members.
● Planning/Managing - The project plan should detail the project scope as well
as its objectives, required work efforts, risks, and assumptions. A thorough,
comprehensive scope can be used to develop a work breakdown structure
(WBS) and establish project roles for summary task assignments. The plan
should also spell out the change and control process that will be used for the
project.
● Project Close/Wrap-Up - At the end of each project, the final step is to obtain
project closure. Part of this closure is to ensure the completeness of the effort
and obtain sign-off for the project. Additionally, a project evaluation will help in
retaining lessons learned and assessing the success of the overall effort.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Integration Developer (Secondary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 56 of 1017


Data Quality Developer (Secondary)

Data Transformation Developer (Secondary)

Presentation Layer Developer (Secondary)

Production Supervisor (Approve)

Project Sponsor (Primary)

Quality Assurance Manager (Approve)

Technical Architect (Primary)

Technical Project Manager (Primary)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 57 of 1017


Phase 1: Manage
Task 1.1 Define Project

Description

This task entails constructing the business context for the project, defining in business
terms the purpose and scope of the project as well as the value to the business (i.e.,
the business case).

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Primary)

Project Sponsor (Primary)

Considerations

There are no technical considerations during this task; in fact, any discussion of
implementation specifics should be avoided at this time. The focus here is on defining
the project deliverable in business terms with no regard for technical feasibility. Any
discussion of technologies is likely to sidetrack the strategic thinking needed to develop
the project objectives.

Best Practices
None

Sample Deliverables

Project Definition

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 58 of 1017


Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 59 of 1017


Phase 1: Manage
Subtask 1.1.1 Establish
Business Project Scope

Description

In many ways the potential for success of the development effort for a data
integration solution correlates directly to the clarity and focus of its business scope. If
the business purpose is unclear or the boundaries of the business objectives are poorly
defined, there is a much higher risk of failure or, at least, of a less-than-direct path to
limited success.

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Review Only)

Project Sponsor (Primary)

Considerations

The primary consideration in developing the Business Project Scope is balancing the
high-priority needs of the key beneficiaries with the need to provide results within the
near-term. The Project Manager and Business Analysts need to determine the key
business needs and determine the feasibility of meeting those needs to establish a
scope that provides value, typically within a 60 to 120 day time-frame.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 60 of 1017


Tip
As a general rule, involve as many project beneficiaries as possible in the needs
assessment and goal definition. A "forum" type of meeting may be the most efficient
way to gather the necessary information since it minimizes the amount of time
involved in individual interviews and often encourages useful dialog among the
participants. However, it is often difficult to gather all of the project beneficiaries and
the project sponsor together for any single meeting, so you may have to arrange
multiple meetings and summarize the input for the various participants.

Best Practices
None

Sample Deliverables

Project Charter

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 61 of 1017


Phase 1: Manage
Subtask 1.1.2 Build Business Case

Description

Building support and funding for a data integration solution nearly always requires convincing executive IT management of its value
to the business. The best way to do this, if possible, is to actually calculate the project's estimated return on investment (ROI) through
a business case that calculates ROI.

ROI modeling is valuable because it:

● Supplies a fundamental cost-justification framework for evaluating a data integration project.


● Mandates advance planning among all appropriate parties, including IT team members, business users, and executive
management.
● Helps organizations clarify and agree on the benefits they expect, and in that process, helps them set realistic expectations for
the data integration solution or the data quality initiative.

In addition to traditional ROI modeling on data integration initiatives, quantitative and qualitative ROI assessments should also
include assessments of data quality. Poor data quality costs organizations vast sums in lost revenues. Defective data leads
to breakdowns in the supply chain, poor business decisions, and inferior customer relationship management. Moreover, poor
quality data can lead to failures in compliance with industry regulations and even to outright project failure at the IT level.

It is vital to acknowledge data quality issues at an early stage in the project. Consider a data integration project that is planned
and resourced meticulously but that is undertaken on a dataset where the data is of a poorer quality than anyone realized. This
can lead to the classic “code-load-explode” scenario, wherein the data breaks down in the target system due to a poor
understanding of the data and metadata. What is worse, a data integration project can succeed from an IT perspective but deliver
little if any business value if the data within the system is faulty. For example, a CRM system containing a dataset with a large
quantity of redundant or inaccurate records is likely to be of little value to the business. Often an organization does not realize it
has data quality issues until it is too late. For this reason, data quality should be a consideration in ROI modeling for all data
integration projects – from the beginning.

For more details on how to quantify business value and associated data integration project cost, please see Assessing the
Business Case.

Prerequisites

1.1.1 Establish Business Project Scope

Roles

Business Project Manager (Secondary)

Considerations

The Business Case must focus on business value and, as much as possible, quantify that value. The business beneficiaries
are primarily responsible for assessing the project benefits, while technical considerations drive the cost assessments. These
two assessments - benefits and costs - form the basis for determining overall ROI to the business.

Building the Business Case

Step 1 - Business Benefits

When creating your ROI model, it is best to start by looking at the expected business benefit of implementing the data
integration solution. Common business imperatives include:

● Improving decision-making and ensuring regulatory compliance.


● Modernizing the business to reduce costs.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 62 of 1017


● Merging and acquiring other organizations.
● Increasing business profitability.
● Outsourcing non-core business functions to be able to focus on your company’s core value proposition.

Each of these business imperatives requires support via substantial IT initiatives. Common IT initiatives include:

● Business intelligence initiatives.


● Retirement of legacy systems.
● Application consolidation initiatives.
● Establishment of data hubs for customer, supplier, and/or product data.
● Business process outsourcing (BPO) and/or Software as a Service (SaaS).

For these IT initiatives to be successful, you must be able to integrate data from a variety of disparate systems. The form of those
data integration projects may vary. You may have a:

● Data Warehousing project, which enables new business insight usually through business intelligence.
● Data Migration project, where data sources are moved to enable a new application or system.
● Data Consolidation project, where certain data sources or applications are retired in favor of another.
● Master Data Management project, where multiple data sources come together to form a more complex, master view of the
data.
● Data Synchronization project, where data between two source systems need to stay perfectly consistent to enable different
applications or systems.
● B2B Data Transformation project, where data from external partners is transformed to internal formats for processing by
internal systems and responses are transformed back to partner appropriate formats.
● Data Quality project, where the goals are to cleanse data and to correct errors such as duplicates, missing information,
mistyped information and other data deficiencies.

Once you have established the heritage of your data integration project back to its origins in the business imperatives, it is important
to estimate the value derived from the data integration project. You can estimate the value by asking questions such as:

● What is the business goal of this project? Is this relevant?


● What are the business metrics or key performance indicators associated with this goal?
● How will the business measure the success of this initiative?
● How does data accessibility affect the business initiative? Does having access to all of your data improve the business
initiative?
● How does data availability affect the business initiative? Does having data available when it’s needed improve the business
initiative?
● How does data quality affect the business initiative? Does having good data quality improve the business initiative? Conversely,
what is the potential negative impact of having poor data quality on the business initiative?
● How does data auditability affect the business? Does having an audit trail of your data improve the business initiative from a
compliance perspective?
● How does data security affect the business? Does ensuring secure data improve the business initiative?

After asking the questions above, you’ll start to be able to equate business value, in a monetary number, with the data
integration project. Remember to not only estimate the business value over the first year after implementation, but also over the
course of time. Most business cases and associated ROI models factor in expected business value for at least three years.

If you are still struggling with estimating business value with the data integration initiative, see the table below that outlines
common business value categories and how they relate to various data integration initiatives:

Business Value Explanation Typical Metrics Data Integration Examples


Category

INCREASE REVENUE

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 63 of 1017


New Customer Lower the costs of acquiring - cost per new customer acquisition - Marketing analytics
Acquisition new customers - cost per lead - Integration of third party data
- # new customers acquired/month per (from credit bureaus, directory
sales rep or per office/store services, salesforce.com, etc.)

Cross-Sell / Up-Sell Increase penetration and sales - % cross-sell rate - Single view of customer across
within existing customers - # products/customer all products, channels
- % share of wallet - Marketing analytics & customer
- customer lifetime value segmentation
- Customer lifetime value analysis

Sales and Channel Increase sales productivity, - sales per rep or per employee - Sales/agent productivity
Management and improve visibility into - close rate dashboard
demand - revenue per transaction - Sales & demand analytics
- Customer master data
integration
- Demand chain synchronization

New Product / Accelerate new product/service - # new products launched/year - Data sharing across design,
Service Delivery introductions, and improve "hit - new product/service launch time development, production and
rate" of new offerings - new product/service adoption rate marketing/sales teams
- Data sharing with third parties e.
g. contract manufacturers,
channels, marketing agencies

Pricing / Set pricing and promotions to - margins - Cross-geography/cross-channel


Promotions stimulate demand while - profitability per segment pricing visibility
improving margins - cost-per-impression, cost-per-action - Differential pricing analysis and
tracking
- Promotions effectiveness
analysis

LOWER COSTS

Supply Chain Lower procurement costs, - purchasing discounts - product master data integration
Management increase supply chain visibility, - inventory turns - demand analysis
and improve inventory - quote-to-cash cycle time - cross-supplier purchasing
management - demand forecast accuracy history

Production & Lower the costs to manufacture - production cycle times - cross-enterprise inventory rollup
Service Delivery products and/or deliver services - cost per unit (product) - scheduling and production
- cost per transaction (service) synchronization
- straight-through-processing rate

Logistics & Lower distribution costs and - distribution costs per unit - integration with third party
Distribution improve visibility into - average delivery times logistics management and
distribution chain - delivery date reliability distribution partners

Invoicing, Improve invoicing and - # invoicing errors - invoicing/collections


Collections and collections efficiency, and - DSO (days sales outstanding) reconciliation
Fraud Prevention detect/prevent fraud - % uncollectible - fraud detection
- % fraudulent transactions

Financial Streamline financial - End-of-quarter days to close - Financial data warehouse/


Management management and reporting - Financial reporting efficiency reporting
- Asset utilization rates - Financial reconciliation
- Asset management/tracking

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 64 of 1017


MANAGE RISK

Compliance Risk(e. Prevent compliance outages to -# negative audit/inspection findings - Financial reporting
g. SEC/SOX/Basel avoid investigations, penalties, - probability of compliance lapse - Compliance monitoring &
II/PCI) and negative impact on brand - cost of compliance lapses (fines, reporting
recovery costs, lost business)
- audit/oversight costs

Financial/Asset Improve risk management of - errors & omissions - Risk management data
Risk Management key assets, including financial, - probability of loss warehouse
commodity, energy or capital - expected loss - Reference data integration
assets - safeguard and control costs - Scenario analysis
- Corporate performance
management

Business Reduce downtime and lost - mean time between failure (MTBF) - Resiliency and automatic
Continuity/ business, prevent loss of key - mean time to recover (MTTR) failover/recovery for all data
Disaster Recovery data, and lower recovery costs - recovery time objective (RTO) integration processes
Risk - recover point objective (RPO -- data
loss)

Step 2 – Calculating the Costs

Now that you have estimated the monetary business value from the data integration project in Step 1, you will need to calculate
the associated costs with that project in Step 2. In most cases, the data integration project is inevitable – one way or another
the business initiative is going to be accomplished – so it is best to compare two alternative cost scenarios. One scenario would
be implementing that data integration with tools from Informatica, while the other scenario would be implementing the data
integration project without Informatica’s toolset.

Some examples of benchmarks to support the case for Informatica lowering the total cost of ownership (TCO) on data integration
and data quality projects are outlined below:

Benchmarks from Industry Analysts, Consultants, and Authors

Forrester Research, "The Total Economic Impact of Deploying Informatica PowerCenter", 2004

The average savings of using a data integration/ETL tool vs. hand coding:

• 31% in development costs

• 32% in operations costs

• 32% in maintenance costs

• 35% in overall project life-cycle costs

Gartner, "Integration Competency Center: Where Are Companies Today?", 2005

• The top-performing third of Integration Competency Centers (ICCs) will save an average of:

• 30% in data interface development time and costs

• 20% in maintenance costs

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 65 of 1017


• The top-performing third of ICCs will achieve 25% reuse of integration components

Larry English, Improving Data Warehouse and Business Information Quality, Wiley Computer
Publishing, 1999.

• "The business costs of non-quality data, including irrecoverable costs, rework of products
and services, workarounds, and lost and missed revenue may be as high as 10 to 25 percent
of revenue or total budget of an organization."

• "Invalid data values in the typical customer database averages around 15 to 20 percent…
Actual data errors, even though the values may be valid, may be 25 to 30 percent or more in
those same databases."

• "Large organizations often have data redundantly stored 10 times or more."

Ponemon Institute-- Study of costs incurred by 14 companies that had security breaches affecting
between 1,500 to 900,000 consumer records

• Total costs to recover from a breach averaged $14 million per company, or $140 per lost
customer record

• Direct costs for incremental, out-of-pocket, unbudgeted spending averaged $5 million per
company, or $50 per lost customer for outside legal counsel, mail notification letters, calls to
individual customers, increased call center costs and discounted product offers

• Indirect costs for lost employee productivity averaged $1.5 million per company, or $15 per
customer record

• Opportunity costs covering loss of existing customers and increased difficulty in recruiting new
customers averaged $7.5 million per company, or $75 per lost customer record.

• Overall customer loss averaged 2.6 percent of all customers and ranged as high as 11 percent

In addition to lowering cost of implementing a data integration solution, Informatica adds value to the ROI model by mitigating risk
in the data integration project. In order to quantify the value of risk mitigation, you should consider the cost of project overrun and
the associated likelihood of overrun when using Informatica vs. when you don’t use Informatica for your data integration project. An
example analysis of risk mitigation value is below:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 66 of 1017


Step 3 – Putting it all Together

Once you have calculated the three year business/IT benefits and the three year costs of using PowerCenter vs. not
using PowerCenter, put all of this information into a format that is easy-to-read for IT and line of business executive management.
The following isa sample summary of an ROI model:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 67 of 1017


For data migration projects it is frequently necessary to prove that using Informatica technology for the data migration efforts
has benefits over traditional means. To prove the value, three areas should be considered:

1. Informatica Software can reduce the overall project timeline by accelerating migration development efforts.
2. Informatica delivered migrations will have lower risk due to ease of maintenance, less development effort, higher quality of data,
and increased project management tools with the metadata driven solution.
3. Availability of lineage reports as to how the data was manipulated by the data migration process and by whom.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 19:09

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 68 of 1017


Phase 1: Manage
Subtask 1.1.3 Assess
Centralized Resources

Description

The pre-existence of any centralized resources such as an Integration Competency


Center (ICC) has an obvious impact on the tasks to be undertaken in a data integration
project. The objective in Velocity is not to replicate the material that is available
elsewhere on the set-up and operation of an ICC(http://www.informatica.com/solutions/
icc/default.htm ). However, there are points in the development cycle where the
availability of some degree of centralized resources has a material effect on the
Velocity Work Breakdown Structure (WBS); some tasks are altered, some may no
longer be required, and it is even possible that some new tasks will be created.

If an ICC does not already exist, this subtask is finished since there are no centralized
resources to assess and all the tasks in the Velocity WBS are the responsibility of the
development team.

If an ICC does exist, it is necessary to assess the extent and nature of the resources
available in order to demarcate the responsibilities between the ICC and project teams.
Typically, the ICC acquires responsibility for some or all of the data integration
infrastructure (essentially the Non-Functional Requirements) and the project teams are
liberated to focus on the functional requirements. The precise division of labor is
obviously dependent on the degree of centralization and the associated ICC model that
has been adopted.

In the task descriptions that follow, an ICC section is included under the Considerations
heading where alternative or supplementary activity is required if an ICC is in place.

Prerequisites
None

Roles

Business Project Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 69 of 1017


It is the responsiblity of the project manager to review the Velocity WBS in the light of
the services provided by the ICC. The responsibility for each subtask should be
established.

Best Practices

Selecting the Right ICC Model

Planning the ICC Implementation

Sample Deliverables
None

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 70 of 1017


Phase 1: Manage
Task 1.2 Plan and Manage
Project

Description

This task incorporates the initial project planning and management activities as well as
project management activities that occur throughout the project lifecycle. It includes the
initial structure of the project team and the project work steps based on the business
objectives and the project scope, and the continuing management of expectations
through status reporting, issue tracking and change management.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Integration Developer (Secondary)

Data Quality Developer (Secondary)

Presentation Layer Developer (Secondary)

Project Sponsor (Approve)

Technical Architect (Primary)

Technical Project Manager (Primary)

Considerations

In general, project management activities involve reconciling trade-offs between


business requests as to functionality and timing with technical feasibility and budget
considerations. This often means balancing between sensitivity to project goals and

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 71 of 1017


concerns ("being a good listener") on the one hand, and maintaining a firm grasp of
what is feasible ("telling the truth") on the other.

The tools of the trade, apart from strong people skills (especially, interpersonal
communication skills), are detailed documentation and frequent review of the status of
the project effort against plan, of the unresolved issues, and of the risks regarding
enlargement of scope ("change management"). Successful project management is
predicated on regular communication of these project aspects with the project
manager, and with other management and project personnel.

For data migration projects there is often a project management office (PMO) in place
The PMO is typically found in high dollar, high profile projects such as implementing a
new ERP system that will often cost in the millions of dollars. It is important to identify
the roles and gain the understanding of the PMO as to how these roles are needed and
will intersect with the broader system implementation. More specifically, these roles will
have responsibility beyond the data migration, so the resource requirements for the
Data Migration must be understood and guaranteed as part of the larger effort
overseen by the PMO.

For B2B projects, technical considerations typically play an important role. The format
of data received from partners (and replies sent to partners) forms a key consideration
in overall business operations and has a direct impact on the planning and scoping of
changes. Informatica recommends having the Technical Architect directly involved
throughout the process.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 19:13

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 72 of 1017


Phase 1: Manage
Subtask 1.2.1 Establish
Project Roles

Description

This subtask involves defining the roles/skill sets that will be required to complete the
project. This is a precursor to building the project team and making resource
assignments to specific tasks.

Prerequisites
None

Roles

Business Project Manager (Primary)

Project Sponsor (Approve)

Technical Project Manager (Primary)

Considerations

The Business Project Scope established in 1.1.1 Establish Business Project Scope
provides a primary indication of the required roles and skill sets. The following types of
questions are useful discussion topics and help to validate the initial indicators:

● What are the main tasks/activities of the project and what skills/roles are
needed to accomplish them?
● How complex or broad in scope are these tasks? This can indicate the level of
skills needed.
● What responsibilities will fall to the company resources and which are off-
loaded to a consultant? Who (i.e. company resource or consultant) will provide
the project management? Who will have primary responsibility for
infrastructure requirements? ...for data architecture? ...for documentation? ...
for testing? ...for deployment/training/support?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 73 of 1017


● How much development and testing will be involved?

This is a definitional activity and very distinct from the later assignment of resources.
These roles should be defined as generally as possible rather than attempting to match
a requirement with a resource at hand.

After the project scope and required roles have been defined, there is often pressure to
combine roles due to limited funding or availability of resources. There are some roles
that inherently provide a healthy balance with one another, and if one person fills both
of these roles, project quality may suffer.

The classic conflict is between development roles and highly procedural or operational
roles. For example, a QA Manager or Test Manager or Lead should not be the same
person as a Project Manager or one of the development team. The QA Manager is
responsible for determining the criteria for acceptance of project quality and managing
quality-related procedures. These responsibilities directly conflict with the developer’s
need to meet a tight development schedule. For similar reasons, development
personnel are not ideal choices for filling such operational roles as Metadata Manager,
DBA, Network Administrator, Repository Administrator, or Production Supervisor.
Those roles require operational diligence and adherence to procedure as opposed to
ad hoc development. When development roles are mixed with operational roles,
resulting ‘shortcuts’ often lead to quality problems in production systems.

Tip
Involve the Project Sponsor.

Before defining any roles, be sure that the Project Sponsor is in agreement as to the
project scope and major activities, as well as the level of involvement expected from
company personnel and consultant personnel. If this agreement has not been
explicitly accomplished, review the project scope with the Project Sponsor to resolve
any remaining questions.

In defining the necessary roles, be sure to provide the Sponsor with a full description
of all roles, indicating which will rely on company personnel and which will use
consultant personnel. This sets clear expectations for company involvement and
indicates if there is a need to fill additional roles with consultant personnel if the
company does not have personnel available in accordance with the project timing.

The Role Descriptions in Roles provides typical role definitions. The Project Role Matrix
can serve as a starting point for completing the project-specific roles matrix.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 74 of 1017


Best Practices
None

Sample Deliverables

Project Definition

Project Role Matrix

Work Breakdown Structure

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 75 of 1017


Phase 1: Manage
Subtask 1.2.2 Develop
Project Estimate

Description

Once the overall project scope and roles have been defined, details on project
execution must be developed. These details should answer the questions of what must
be done, who will do it, how long it will take, and how much will it cost.

The objective of this subtask is to develop a complete WBS and, subsequently, a solid
project estimate.

Two important documents required for project execution are the:

● Work Breakdown Structure (WBS), which can be viewed as a list of tasks that
must be completed to achieve the desired project results. (See Developing a
Work Breakdown Structure (WBS) for more details)
● Project Estimate, which, at this time, focuses solely on development costs
without consideration for hardware and software liabilities.

Estimating a project is never an easy task, and often becomes more difficult as project
visibility increases and there is an increasing demand for an "exact estimate". It is
important to understand that estimates are never exact. However, estimates are useful
for providing a close approximation of the level of effort required by the project. Factors
such as project complexity, team skills, and external dependencies always have an
impact on the actual effort required.

The accuracy of an estimate largely depends on the experience of the estimator (or
estimators). For example, an experienced traveller who frequently travels the route
between his/her home or office and the airport can easily provide an accurate estimate
of the time required for the trip. When the same traveller is asked to estimate travel
time to or from an unfamiliar airport however, the estimation process becomes much
more complex, requiring consideration of numerous factors such as distance to the
airport, means of transportation, speed of available transportation, time of day that the
travel will occur, expected weather conditions, and so on. The traveller can arrive at a
valid overall estimate by assigning time estimates to each factor, then summing the
whole. The resulting estimate however, is not likely to be nearly as accurate as the one
based on knowledge gained through experience. The same holds true for estimating

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 76 of 1017


the time and resources required to complete development on a data integration solution
project.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Integration Developer (Secondary)

Data Quality Developer (Secondary)

Data Transformation Developer (Secondary)

Presentation Layer Developer (Secondary)

Project Sponsor (Approve)

Technical Architect (Secondary)

Technical Project Manager (Secondary)

Considerations

An accurate estimate depends greatly on a complete and accurate Work Breakdown


Structure. Having the entire project team review the WBS when it is near completion
helps to ensure that it includes all necessary project tasks. Project deadlines often slip
because some tasks are overlooked and, therefore, not included in the initial estimates.

Sample Data Requirements for B2B Projects

For B2B projects (and non B2B projects that have significant unstructured or semi-
structured data transformation requirements) the actual creation and subsequent QA of
transformations relies on having sufficient samples of input and output data; and
specifications for data formats.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 77 of 1017


When estimating for projects that use Informatica’s B2B Data Transformation,
estimates should include sufficient time to allow for the collection and assembly of
sample data, any cleansing of sample data required (for example to conform to HIPAA
or financial privacy regulations), and for any data analysis or metadata discovery to be
performed on the sample data.

By their nature, the full authoring of B2B data transformations cannot be completed (or
in some cases proceed) without the availability of adequate sample data both for input
to transformations and for comparison purposes during the quality assurance process.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 19:17

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 78 of 1017


Phase 1: Manage
Subtask 1.2.3 Develop
Project Plan

Description

In this subtask, the Project Manager develops a schedule for the project using the
agreed-upon business project scope to determine the major tasks that need to be
accomplished and estimates of the amount of effort and resources required.

Prerequisites
None

Roles

Business Project Manager (Primary)

Project Sponsor (Approve)

Technical Project Manager (Secondary)

Considerations

The initial project plan is based on agreements-to-date with the Project Sponsor
regarding project scope, estimation of effort, roles, project timelines and any
understanding of requirements.

Updates to the plan (as described in Developing and Maintaining the Project Plan) are
typically based on changes to scope, approach, priorities, or simply on more precise
determinations of effort and of start and/or completion dates as the project unfolds. In
some cases, later phases of the project, like System Test (or "alpha"), Beta Test and
Deployment, are represented in the initial plan as a single set of activities, and will be
more fully defined as the project progresses. Major activities (e.g., System Test,
Deployment, etc.) typically involve their own full-fledged planning processes once the
technical design is completed. At that time, additional activities may be added to the
project plan to allow for more detailed tracking of those project activities.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 79 of 1017


Perhaps the most significant message here is that an up-to-date plan is critical for
satisfactory management of the project and for timely completion of its tasks. Keeping
the plan updated as events occur and client understanding or needs and expectations
change requires an on-going effort. The sooner the plan is updated and changes
communicated to the Project Sponsor and/or company management, the less likely that
expectations will be frustrated to a problematic level.

Best Practices

Data Migration Velocity Approach

Sample Deliverables

Project Roadmap

Work Breakdown Structure

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 80 of 1017


Phase 1: Manage
Subtask 1.2.4 Manage
Project

Description

In the broadest sense, project management begins before the project starts and
continues until its completion and perhaps beyond. The management effort includes:

● Managing the project beneficiary relationship(s), expectations and involvement


● Managing the project team, its make-up, involvement, priorities, activities and
schedule
● Managing all project issues as they arise, whether technical, logistical,
procedural, or personal.

In a more specific sense, project management involves being constantly aware of, or
preparing for, anything that needs to be accomplished or dealt with to further the
project objectives, and making sure that someone accepts responsibility for such
occurrences and delivers in a timely fashion.

Project management begins with pre-engagement preparation and includes:

● Project Kick-off, including the initial project scope, project organization, and
project plan
● Project Status and reviews of the plan and scope
● Project Content Reviews, including business requirements reviews and
technical reviews
● Change Management as scope changes are proposed, including changes to
staffing or priorities
● Issues Management

Project Acceptance and Close

Prerequisites
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 81 of 1017


Roles

Business Project Manager (Primary)

Project Sponsor (Review Only)

Technical Project Manager (Primary)

Considerations

In all management activities and actions, the Project Manager must balance the needs
and expectations of the Project Sponsor and project beneficiaries with the needs,
limitations and morale of the project team. Limitations and specific needs of the team
must be communicated clearly and early to the Project Sponsor and/or company
management to mitigate unwarranted expectations and avoid an escalation of
expectation-frustration that can have a dire effect on the project outcome. Issues that
affect the ability to deliver in any sense, and potential changes to scope, must be
brought to the Project Sponsor's attention as soon as possible and managed to
satisfactory resolution.

In addition to "expectation management", project management includes Quality


Assurance for the project deliverables. This involves soliciting specific requirements
with subsequent review of deliverables that include in addition to the data integration
solution documentation, user interfaces, knowledge-transfer and testing procedures.

Best Practices
None

Sample Deliverables

Issues Tracking

Project Review Meeting Agenda

Project Status Report

Scope Change Assessment

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 82 of 1017


Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 83 of 1017


Phase 1: Manage
Task 1.3 Perform Project
Close

Description

This is a summary task that entails closing out the project and creating project wrap-up
documentation.

Each project should end with an explicit closure procedure. This process should include
Sponsor acknowledgement that the project is complete and the end product meets
expectations. A Project Close Report should be completed at the conclusion of the
effort, along with a final status report.

The project close documentation should highlight project accomplishments, lessons


learned, justifications for tasks expected but not completed, and any recommendations
for future work on the end product. This task should also generate a reconciliation
document, reconciling project time/budget estimates with actual time and cost
expenditures.

As mentioned earlier in this chapter, experience is an important tool for succeeding in


future efforts. Building upon the experience of a project team and publishing this
information will help future teams succeed in similar efforts.

Prerequisites
None

Roles

Business Project Manager (Primary)

Production Supervisor (Approve)

Project Sponsor (Approve)

Quality Assurance Manager (Approve)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 84 of 1017


Technical Project Manager (Approve)

Considerations

None

Best Practices
None

Sample Deliverables

Project Close Report

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 85 of 1017


Phase 2: Analyze

2 Analyze

● 2.1 Define Business


Drivers, Objectives and Goals
● 2.2 Define Business Requirements
❍ 2.2.1 Define Business Rules and Definitions
❍ 2.2.2 Establish Data Stewardship
● 2.3 Define Business Scope
❍ 2.3.1 Identify Source Data Systems
❍ 2.3.2 Determine Sourcing Feasibility
❍ 2.3.3 Determine Target Requirements
❍ 2.3.5 Build Roadmap for Incremental Delivery
● 2.4 Define Functional Requirements
● 2.5 Define Metadata Requirements
❍ 2.5.1 Establish Inventory of Technical Metadata
❍ 2.5.2 Review Metadata Sourcing Requirements
❍ 2.5.3 Assess Technical Strategies and Policies
● 2.6 Determine Technical Readiness
● 2.7 Determine Regulatory Requirements
● 2.8 Perform Data Quality Audit
❍ 2.8.1 Perform Data Quality Analysis of Source Data
❍ 2.8.2 Report Analysis Results to the Business

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 86 of 1017


Phase 2: Analyze

Description

Increasingly, organizations
demand faster, better, and cheaper delivery of data integration and business
intelligence solutions. Many development failures and project cancellations can be
traced to an absence of adequate upfront planning and scope definition. Inadequately
defined or prioritized objectives and project requirements foster scenarios where
project scope becomes a moving target as requirements may change late in the game,
requiring repeated rework of design or even development tasks. The purpose of
the Analyze Phase is to build a solid foundation for project scope through a deliberate
determination of the business drivers, requirements, and priorities that will form the
basis of the project design and development.

Once the business case for a data integration or business intelligence solution is
accepted and key stakeholders are identified, the process of detailing and prioritizing
objectives and requirements can begin - with the ultimate goal of defining project scope
and, if appropriate, a roadmap for major project stages.

Prerequisites
None

Roles

Application Specialist (Primary)

Business Analyst (Primary)

Business Project Manager (Primary)

Data Architect (Primary)

Data Integration Developer (Primary)

Data Quality Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 87 of 1017


Data Steward/Data Quality Steward (Primary)

Database Administrator (DBA) (Primary)

Legal Expert (Primary)

Metadata Manager (Primary)

Project Sponsor (Secondary)

Security Manager (Primary)

System Administrator (Primary)

Technical Architect (Primary)

Technical Project Manager (Primary)

Considerations

Functional and technical requirements must focus on the business goals and objectives
of the stakeholders, and must be based on commonly agreed-upon definitions of
business information. The initial business requirements are then compared to feasibility
studies of the source systems to help the prioritization process that will result in a
project roadmap and rough timeline. This sets the stage for incremental delivery of the
requirements so that some important needs are met as soon as possible, thereby
providing value to the business even though there may be a much longer timeline to
complete the entire project. In addition, during this phase it can be valuable to identify
the available technical metadata as a way to accelerate the design and improve its
quality. A successful Analyze Phase can serve as a foundation for a successful project.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 88 of 1017


Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 89 of 1017


Phase 2: Analyze
Task 2.1 Define Business
Drivers, Objectives and
Goals

Description

In many ways, the potential for success of any data integration/business intelligence
solution correlates directly to the clarity and focus of its business scope. If the business
objectives are vague, there is a much higher risk of failure or, at least, of a less-than-
direct path to likely limited success.

Business Drivers

The business drivers explain why the solution is needed and is being recommended at
a particular time by identifying the specific business problems, issues, or increased
business value that the project is likely to resolve or deliver. Business drivers may
include background information necessary to understand the problems and/or needs.
There should be clear links between the project’s business drivers and the company’s
underlying business strategies.

Business Objectives

Objectives are concrete statements describing what the project is trying to achieve.
Objectives should be explicitly defined so that they can be evaluated at the conclusion
of a project to determine if they were achieved.

Objectives written for a goal statement are nothing more than a deconstruction of the
goal statement into a set of necessary and sufficient objective statements. That is,
every objective must be accomplished to reach the goal, and no objective is
superfluous.

Objectives are important because they establish a consensus between the project
sponsor and the project beneficiaries regarding the project outcome. The specific
deliverables of an IT project, for instance, may or may not make sense to the project
sponsor. However, the business objectives should be written so they are
understandable by all of the project stakeholders.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 90 of 1017


Business Goals

Goal statements provide the overall context for what the project is trying to accomplish.
They should align with the company's stated business goals and strategies. Project
context is established in a goal statement by stating the project's object of study, its
purpose, its quality focus, and its viewpoint. Characteristics of a well-defined goal
should reference the project's business benefits in terms of cost, time, and/or quality.
Because goals are high-level statements, it may take more than one project to achieve
a stated goal. If the goal's achievement can be measured, it is probably defined at too
low a level and may actually be an objective. If the goal is not achievable through any
combination of projects, it is probably too abstract and may be a vision statement.

Every project should have at least one goal. It is the agreement between the company
and the project sponsor about what is going to be accomplished by the project. The
goal provides focus and serves as the compass for determining if the project outcomes
are appropriate. In the project management life cycle, the goal is bound by a number of
objective statements. These objective statements clarify the fuzzy boundary of the goal
statement. Taken as a pair, the goal and objectives statements define the project. They
are the foundation for project planning and scope definition.

Prerequisites
None

Roles

Business Project Manager (Review Only)

Project Sponsor (Review Only)

Considerations

Business Drivers

The business drivers must be defined using business language. Identify how the
project is going to resolve or address specific business problems. Key components
when identifying business drivers include:

● Describe facts, figures, and other pertinent background information to support


the existence of a problem.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 91 of 1017


● Explain how the project resolves or helps to resolve the problem in terms
familiar to the business.
● Show any links to business goals, strategies, and principles.

Large projects often have significant business and technical requirements that drive the
project's development. Consider explaining the origins of the significant requirements
as a way of explaining why the project is needed.

Business Objectives

Before the project starts, define and agree on the project objectives and the business
goals they define. The deliverables of the project are created based on the objectives -
not the other way around. A meeting between all major stakeholders is the best way to
create the objectives and gain a consensus on them at the same time. This type of
meeting encourages discussion among participants and minimizes the amount of time
involved in defining business objectives and goals. It may not be possible to gather all
the project beneficiaries and the project sponsor together at the same time so multiple
meetings may have to be arranged with the results summarized.

While goal statements are designed to be vague, a well-worded objective is Specific,


Measurable, Attainable/Achievable, Realistic and Time-bound (SMART).

● Specific: An objective should address a specific target or accomplishment.


● Measurable: Establish a metric that indicates that an objective has been met.
● Attainable: If an objective cannot be achieved, then it's probably a goal.
● Realistic: Limit objectives to what can realistically be done with available
resources.
● Time-bound: Achieve objectives within a specified time frame.

At a minimum, make sure each objective contains four parts, as follows:

● An outcome - describe what the project will accomplish.


● A time frame - the expected completion date of the project.
● A measure - metric(s) that will measure success of the project.
● An action - how to meet the objective.

The business objectives should take into account the results of any data quality
investigations carried out before or during the project. If the project source data quality

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 92 of 1017


is low, then the project's ability to achieve its objectives may be compromised. If the
project has specific data-related objectives, such as regulatory compliance objectives,
then a high degree of data quality may be an objective in its own right. For this reason,
data quality investigations (such as a Data Quality Audit) should be carried out as early
as is feasible in the project life-cycle. See 2.8 Perform Data Quality Audit.

Generally speaking, the number of objectives comes down to how much business
investment is going to be made in pursuit of the project's goals. High investment
projects generally have many objectives. Low investment projects must be more
modest in the objectives they pursue. There is considerable discretion in how granular
a project manager may get in defining objectives. High-level objectives generally need
a more detailed explanation and often lead to more definition in the project's
deliverables to obtain the objective. Lower level, detailed objectives tend to require less
descriptive narrative and deconstruct into fewer deliverables to obtain. Regardless of
the number of objectives identified, the priority should be established by ranking the
objectives with their respective impacts, costs, and risks.

Business Goals

The goal statement must also be written in business language so that anyone who
reads it can understand it without further explanation. The goal statement should:

● Be short and to the point.


● Provide overall context for what the project is trying to accomplish.
● Be aligned to business goals in terms of cost, time and quality.

Smaller projects generally have a single goal. Larger projects may have more than one
goal, which should also be prioritized. Since the goal statement is meant to be succinct,
regardless of the number of goals a project has, the goal statement should always be
brief and to the point.

Best Practices
None

Sample Deliverables
None

Last updated: 18-May-08 17:36

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 93 of 1017


Phase 2: Analyze
Task 2.2 Define Business
Requirements

Description

A data integration/business intelligence solution development project typically


originates from a company's need to provide management and/or customers with
business analytics or to provide business application integration. As with any technical
engagement, the first task is to determine clear and focused business requirements to
drive the technology implementation. This requires determining what information is
critical to support the project objectives and its relation to important strategic and
operational business processes. Project success will be based on clearly identifying
and accurately resolving these informational needs with the proper timing.

The goal of this task is to ensure the participation and consensus of the project sponsor
and key beneficiaries during the discovery and prioritization of these information
requirements.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Quality Developer (Secondary)

Data Steward/Data Quality Steward (Primary)

Legal Expert (Approve)

Metadata Manager (Primary)

Project Sponsor (Approve)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 94 of 1017


Considerations

In a data warehouse/business intelligence project, there can be strategic or tactical


requirements.

Strategic Requirements

● The customer management is typically interested in strategic questions that


often include a significant timeframe. For example, ‘How has the turnover of
product ‘x’ increased over the last year?’ or, 'What is the revenue of area ‘a’ in
January of this year as compared to last year?’. Answers to strategic questions
provide company executives with the information required to build on the
company strengths and/or to eliminate weaknesses.

Strategic requirements are typically implemented through a data warehouse type


project with appropriate visualization tools.

Tactical Requirements

● The tactical requirements serve the ‘day to day’ business. Operational level
employees want solutions to enable them to manage their on-going work and
solve immediate problems. For instance, a distributor running a fleet of trucks
has an unavailable driver on a particular day. They would want to answer
questions such as, 'How can the delivery schedule be altered in order to meet
the delivery time of the highest priority customer?' Answers to these questions
are valid and pertinent for only a short period of time in comparison to the
strategic requirements.

Tactical requirements are often implemented via operational data integration.

Best Practices
None

Sample Deliverables
None

Last updated: 02-May-08 12:05

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 95 of 1017


Phase 2: Analyze
Subtask 2.2.1 Define
Business Rules and
Definitions

Description

A business rule is a compact and simple statement that represents some important
aspect of a business process or policy. By capturing the rules of the business—the
logic that governs its operation—systems can be created that are fully aligned with the
needs of the organization.

Business rules stem from the knowledge of business personnel and constrain some
aspect of the business. From a technical perspective, a business rule expresses
specific constraints on the creation, updating, and removal of persistent data in an
information system. For example, a new bank account cannot be created unless the
customer has provided an adequate proof of identification and address.

Prerequisites
None

Roles

Data Quality Developer (Secondary)

Data Steward/Data Quality Steward (Primary)

Legal Expert (Approve)

Metadata Manager (Primary)

Security Manager (Approve)

Considerations

Formulating business rules is an iterative process, often stemming from statements of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 96 of 1017


policy in an organization. Rules are expressed in natural language. The following set of
guidelines follow best practices and provide practical instructions on how to formulate
business rules:

● Start with a well-defined and agreed upon set of unambiguous definitions


captured in a definitions repository. Re-use existing definitions if available.
● Use meaningful and precise verbs to connect the definitions captured above.
● Use standard expressions to constrain business rules, such as must, must not,
only if, no more than, etc. For example, the total commission paid to broker
ABC can be no more than xy% of the total revenue received for the sale of
widgets.
● Use standard expressions for derivation business rules like "x is calculated
from/", "summed from", etc. For example, "the departmental commission paid
is calculated as the total commission multiplied by the departmental rollup
rate."

The aim is to define atomic business rules, that is, rules that cannot be decomposed
further. Each atomic business rule is a specific, formal statement of a single term, fact,
derivation, or constraint on the business. The components of business rules, once
formulated, provide direct inputs to a subsequent conceptual data modeling and
analysis phase. In this approach, definitions and connections can eventually be
mapped onto a data model and constraints and derivations can be mapped onto a set
of rules that are enforced in the data model.

Best Practices
None

Sample Deliverables

Business Requirements Specification

Last updated: 01-Feb-07 18:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 97 of 1017


Phase 2: Analyze
Subtask 2.2.2 Establish
Data Stewardship

Description

Data stewardship is about keeping the business community involved and focused on
the goals of the project being undertaken. This subtask outlines the roles and
responsibilities that key personnel can assume within the framework of an overall
stewardship program. This participation should be regarded as ongoing because
stewardship activities need to be performed at all stages of a project lifecycle and
continue through the operational phase.

Prerequisites
None

Roles

Business Analyst (Secondary)

Business Project Manager (Primary)

Data Steward/Data Quality Steward (Secondary)

Project Sponsor (Approve)

Considerations

A useful mix of personnel to staff a stewardship committee may include:

● An executive sponsor
● A business steward
● A technical steward
● A data steward

Executive Sponsor

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 98 of 1017


● Chair of the data stewardship committee
● Ultimate point of arbitration
● Liaison to management for setting and reporting objectives
● Should be recruited from project sponsors or management

Technical Steward

● Member of the data stewardship committee


● Liaison with technical community
● Reference point for technical-related issues and arbitration
● Should be recruited from the technical community with a good knowledge of
the business and operational processes

Business Steward

● Member of the data stewardship committee


● Liaison with business users
● Reference point for business-related issues and arbitration
● Should be recruited from the business community

Data Steward

● Member of the data stewardship committee


● Balances data and quality targets set by the business with IT/project
parameters
● Responsible for all issues relating to the data, including defining and
maintaining business and technical rules and liaising with the business and
technical communities
● Reference point for arbitration where data is put to different uses by separate
groups of users whose requirements have to be reconciled

The mix of personnel for a particular activity should be adequate to provide expertise in
each of the major business areas that will be undertaken in the project.

The success of the stewardship function relies on the early establishment and
distribution of standardized documentation and procedures. These should be
distributed to all of the team members working on stewardship activities.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 99 of 1017


The data stewardship committee should be involved in the following activities:

● Arbitration
● Sanity checking
● Preparation of metadata
● Support

Arbitration

Arbitration means resolving data contention issues, deciding which is the best data to
use, and determining how this data should best be transformed and interpreted so that
it remains meaningful and consistent. This is particularly important during the phases
where ambiguity needs to be resolved, for example, when conformed dimensions and
standardized facts are being formulated by the analysis teams.

Sanity Checking

There is a role for the data stewardship committee to check the results and ensure that
the transformation rules and processes have been applied correctly. This is a key
verification task and is particularly important in evaluating prototypes developed in the
Analyze Phase , during testing, and after the project goes live.

Preparation of Metadata

The data stewardship committee should be actively involved in the preparation and
verification of technical and business metadata. Specific tasks are:

● Determining the structure and contents of the metadata


● Determining how the metadata is to be collected
● Determining where the metadata is to reside
● Determining who is likely to use the metadata
● Determining what business benefits are provided
● Determining how the metadata is to be acquired

Depending on the tools used to determine the metadata (for example, PowerCenter
Profiling option, Informatica Data Explorer), the Data Steward may take a lead role in
this activity.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 100 of 1017


● Business metadata - The purpose of maintaining this type of information is to
clarify context, aid understanding, and provide business users with the ability
to perform high level searches for information.

Business metadata is used to answer questions such as: How does this division
of the enterprise calculate revenue?"

● Technical metadata - The purpose of maintaining this type of information is


for impact analysis, auditing, and source-target analysis.

Technical metadata is used to perform analysis such as: “What would be the
impact of changing the length of a field from 20 to 30 characters and what
systems would be affected?”

Support

The data stewardship committee should be involved in the inception and preparation of
training of the user community by answering questions about data and the tools
available to perform analytics. During the Analyze Phase the team would provide
inputs to induction training programs prepared for system users when the project goes
live. Such programs should include, for example, technical information about how to
query the system and semantic information about the data that is retrieved.

New Functionality

The data stewardship committee needs to assess any major additions to functionality.
The assessment should consider return on investment, priority, and scalability in terms
of new hardware/software requirements. There may be a need to perform this activity
during the Analyze Phase if functionality that was initially overlooked is to be included
in the scope of the project. After the project has gone live, this activity is of key
importance because new functionality needs to be assessed for ongoing development.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 101 of 1017


Last updated: 15-Feb-07 17:55

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 102 of 1017


Phase 2: Analyze
Task 2.3 Define Business
Scope

Description

The business scope forms the boundary that defines where the project begins and
ends. Throughout the project discussions about the business requirements and
objectives, it may appear that everyone views the project scope in the same way.
However, there is commonly confusion about what falls inside the boundary of a
specific project and what does not. Developing a detailed project scope and socializing
it with your project team, sponsors, and key stakeholders is critical.

Prerequisites
None

Roles

Informatica Velocity v6 (Primary)

Data Architect (Primary)

Data Integration Developer (Primary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Secondary)

Metadata Manager (Primary)

Project Sponsor (Secondary)

Technical Architect (Primary)

Technical Project Manager (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 103 of 1017


Considerations

The primary consideration in developing the business scope is balancing the high-
priority needs of the key beneficiaries with the need to provide results within the near-
term. The Project Manager and Business Analysts need to determine the key business
needs and determine the feasibility of meeting those needs to establish a scope that
provides value, typically within a 60 to 120 day time-frame.

Quick WINS are accomplishments in a relatively short time, without great expense and
with a positive outcome - they can be included in the business scope. WINS stand for
Ways to Implement New Solutions.

Tip
As a general rule, involve as many project beneficiaries as possible in the needs
assessment and goal definition. A "forum" type of meeting may be the most efficient
way to gather the necessary information since it minimizes the amount of time
involved in individual interviews and often encourages useful dialog among the
participants. However, it is often difficult to gather all of the project beneficiaries and
the project sponsor together for any single meeting, so you may have to arrange
multiple meetings and summarize the input for the various participants.

A common mistake made by project teams is to define the project scope only in general
terms. This lack of definition causes managers and key beneficiaries throughout the
company to make assumptions related to their own processes or systems falling inside
or outside of the scope of the project. Then later, after significant work has been
completed by the project team, some managers are surprised to learn that their
assumptions were not correct, resulting in problems for the project team. Other project
teams report problems with "scope creep" as their project gradually takes on more and
more work. The safest rule is “the more detail, the better” along with details regarding
what related elements are not within scope or will be delayed to a later effort.

Best Practices
None

Sample Deliverables
None

Last updated: 18-May-08 17:35

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 104 of 1017


Phase 2: Analyze
Subtask 2.3.1 Identify
Source Data Systems

Description

Before beginning any work with the data, it is necessary to determine precisely what
data is required to support the data integration solution. In addition, the developers
must also determine what source systems house the data, where the data resides in
the source systems, and how the data is accessed.

In this subtask, the development project team needs to validate the initial list of source
systems and source formats and obtain documentation from the source system owners
describing the source system schemas. For relational systems, the documentation
should include Entity-Relationship diagrams (E-R diagrams) and data dictionaries, if
available. For file based data sources (e.g., unstructured, semi-structured and complex
XML) documentation may also include data format specifications for both internal and
public (in the case of open data format standards) and any deviations from public
standards. The development team needs to carefully review the source system
documentation to ensure that it is complete (i.e., specifies data owners and
dependencies) and current. The team also needs to ensure that the data is fully
accessible to the developers and analysts that are building the data integration solution.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Architect (Primary)

Data Integration Developer (Primary)

Data Quality Developer (Primary)

Data Transformation Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 105 of 1017


Considerations

In determining the source systems for data elements, it is important to request copies
of the source system data to serve as samples for further analysis. This is a
requirement in 2.8.1 Perform Data Quality Analysis of Source Data , but is also
important at this stage of development. As data volumes in the production environment
are often large, it is advisable to request a subset of the data for evaluation purposes.
However, requesting too small of a subset can be dangerous in that it fails to provide a
complete picture of the data and may hide any quality issues that truly exist.

Another important element of the source system analysis is to determine the life
expectancy of the source system itself. Try to determine if the source system is likely
to be replaced or phased out in the foreseeable future. As companies merge, or
technologies and processes improve, many companies upgrade or replace their
systems. This can present challenges to the team as the primary knowledge of those
systems may be replaced as well. Understanding the life expectancy of the source
system will play a crucial part in the design process.

For example, assume you are building a customer data warehouse for a small bank.
The primary source of customer data is a system called Shucks, and you will be
building a staging area in the warehouse to act as a landing area for all of the source
data. After your project starts, you discover that the bank is being bought out by a
larger bank and that Shucks will be replaced within three months by the larger bank's
source of customer data: a system called Grins. Instead of having to redesign your
entire data warehouse to handle the new source system, it may be possible to design a
generic staging area that could fit any customer source system instead of building a
staging area based on one specific source system. Assuming that the bulk of your
processing occurs after the data has landed in the staging area, you can minimize the
impact of replacing source systems by designing a generic staging area that would
essentially allow you to plug in the new source system. Designing this type of staging
area however, takes a large amount of planning and adds time to the schedule, but will
be well worth the effort because the warehouse is now able to handle source system
changes.

For Data Migration, the source systems that are in scope should be understood at the
start of the project. During the Analyze Phase these systems should be confirmed and
communicated to all key stakeholders. If there is a disconnect between which systems
are in and out of scope it is important to document and analyze the impact. Identifying
new source systems may exponentially increment the amount of resources needed on
the project and require re-planning. Make a point to over-communicate what systems
are in-scope.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 106 of 1017


Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 19:28

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 107 of 1017


Phase 2: Analyze
Subtask 2.3.2 Determine
Sourcing Feasibility

Description

Before beginning to work with the data, it is necessary to determine precisely what data
is required to support the data integration solution.

In addition, the developers must determine:

● what source systems house the data.


● where the data resides in the source systems.
● how the data is accessed.

Take care to focus only on data that is within the scope of the requirements.
Involvement of the business community is important in order to prioritize the business
data needs based upon how effectively the data supports the users' top priority
business problems.

Determining sourcing feasibility is a two-stage process, requiring:

● A thorough and high-level understanding of the candidate source systems.


● A detailed analysis of the data sources within these source systems.

Prerequisites
None

Roles

Application Specialist (Primary)

Business Analyst (Primary)

Data Architect (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 108 of 1017


Data Quality Developer (Primary)

Metadata Manager (Primary)

Considerations

In determining the source systems for data elements, it is important to request copies
of the source system data to serve as samples for further analysis. Because data
volumes in the production environment are often large, it is advisable to request a
subset of the data for evaluation purposes. However, requesting too small a subset can
be dangerous in that it fails to provide a complete picture of the data and may hide any
quality issues that exist.

Particular care needs to be taken when archived historical data (e.g., data archived on
tapes) or syndicated data sets (i.e., externally provided data such as market research)
is required as a source to the data integration application. Additional resources and
procedures may be required to sample and analyze these data sources.

Candidate Source System Analysis

A list of business data sources should have been prepared during the business
requirements phase. This list typically identifies 20 or more types of data that are
required to support the data integration solution and may include, for example, sales
forecasts, customer demographic data, product information (e.g., categories and
classifiers), and financial information (e.g., revenues, commissions, and budgets).

The candidate source systems (i.e., where the required data can be found) can be
identified based on this list. There may be a single source or multiple sources for the
required data.

Types of source include:

● Operational sources — The systems an organization uses to run its business.


It may be any combination of the ERP and legacy operational systems.
● Strategic sources — The data may be sourced from existing strategic decision
support systems; for example, executive information systems.
● External sources — Any information source provided to the organization by an
external entity, such as Nielsen marketing data or Dun & Bradstreet.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 109 of 1017


The following checklist can help to evaluate the suitability of data sources, which can
be particularly important for resolving contention amongst the various sources.

● Appropriateness of the data source with respect to the underlying business


functions.
● Source system platform.
● Unique characteristics of the data source system.
● Source systems boundaries with respect to the scope of the project being
undertaken.
● The accuracy of the data from the data source.
● The timeliness of the data from the data source.
● The availability of the data from the data source.
● Current and future deployment of the source data system.
● Access licensing requirements and limitations.

Consider, for example, a low-latency data integration application that requires credit
checks to be performed on customers seeking a loan. In this case, the relevant source
systems may be:

● A call center that captures the initial transactional request and passes this
information in real time to a data integration application.
● An external system against which a credibility check needs to be performed by
the data integration application (i.e., to determine a credit rating).
● An internal data warehouse accessed by the data integration application to
validate and complement the information.

Timeliness, reliability, accuracy of data, and a single source for reference data may be
key factors influencing the selection of the source systems. Note that projects typically
under-estimate problems in these areas. Many projects run into difficulty because poor
data quality, both at high (metadata) and low (record) levels, impacts the ability to
perform transform and load operations.

An appreciation of the underlying technical feasibility may also impact the choice of
data sources and should be within the scope of the high-level analysis being
undertaken. This activity is about compiling information about the “as is and as will be”
technological landscape that affect the characteristics of the data source systems and
their impact on the data integration solution. Factors to consider in this survey are:

● Current and future organizational standards

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 110 of 1017


● Infrastructure.
● Services.
● Networks.
● Hardware, software, operational limitations.
● Best Practices.
● Migration strategies.
● External data sources.
● Security criteria.

For B2B solutions, solutions with significant file based data sources (and other
solutions with complex data transformation requirements) it is necessary to also assess
data sizes, volumes and the frequency of data updates with respect to the ability to
parse and transform the data and the implications that will have on hardware and
software requirements.

A high-level analysis should also allow for the early identification of risks associated
with the planned development, for example:

● If a source system is likely to be replaced or phased out in the foreseeable


future. As companies merge, or technologies and processes improve, many
companies upgrade or replace their systems. This can present challenges to
the team, as the primary knowledge of those systems may be replaced as
well. Understanding the life expectancy of the source system plays a crucial
part in the design process.
● If the scope of the development is larger than anticipated.
● If the data quality is determined to be inadequate in one or more respects.

Completion of this high-level analysis should reveal a number of feasible source


systems, as well as the points of contact and owners of the source systems. Further, it
should produce a checklist identifying any issues about inaccuracies of the data
content, gaps in the data, and any changes in the data structures over time.

Data Quality

The next step in determining source feasibility is to perform a detailed analysis of the
data sources, both in structure and in content, and to create an accurate model of the
source data systems. Understanding data sources requires the participation of a data
source expert/Data Quality Developer and a business analyst to clarify the relevance,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 111 of 1017


technical content, and business meaning of the source data.

A complete set of technical documentation and application source code should be


available for this step. Documentation should include Entity-Relationship diagrams (E-R
diagrams) for the source systems; these diagrams then serve as the blueprints for
extracting data from the source systems.

It is important not to rely solely on the technical documentation to obtain accurate


descriptions of the source data, since this documentation may be out of date and
inaccurate. Data profiling is a useful technique to determine the structure and integrity
of the data sources, particularly when used in conjunction with the technical
documentation. The data profiling process involves analyzing the source data, taking
an inventory of available data elements, and checking the format of those data
elements. It is important to work with the actual source data, either the complete
dataset or a representative subset, depending on the data volume. Using sample data
derived from the actual source systems is essential for identifying data quality issues
and for determining if the data meets the business requirements of the organization.

The output of the data profiling effort is a survey, whose recipients include the data
stewardship committee, which documents:

● Inconsistencies of data structures with respect to the documentation.


● Gaps in data.
● Invalid data.
● Missing data.
● Missing documentation.
● Inconsistencies of data with respect to the business rules.
● Inconsistencies in standards and style.
● An assessment whether the source data is in a suitable condition for
extraction.
● Re-engineering requirements to correct content errors.

Bear in mind that the issue of data quality can cleave in two directions: discovering the
structure and metadata characteristics of the source data, and analyzing the low-level
quality of the data in terms of record accuracy, duplication, and other metrics. In-depth
structural and metadata profiling of the data sources can be conducted through
Informatica Data Explorer. Low-level/per-record data quality issues also must be
uncovered and, where necessary, corrected or flagged for correction at this stage in the
project. See 2.8 Perform Data Quality Audit for more information on required data
quality and data analysis steps.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 112 of 1017


Determine Source Availability

The next step is to determine when all source systems are likely to be available for data
extraction. This is necessary in order to determine realistic start and end times for the
load window. The developers need to work closely with the source system
administrators during this step because the administrators can provide specific
information about the hours of operations for their systems.

The Source Availability Matrix lists all the sources that are being used for data
extraction and specifies the systems' downtimes during a 24-hour period. This matrix
should contain details of the availability of the systems on different days of the week,
including weekends and holidays.

For Data Migration projects access to data is not normally a problem given the premise
of the solution. Typically, data migration projects have high level sponsorship and
whatever is needed is provided. However, for smaller-impact projects it is important
that direct access is provided to all systems that are in scope. If direct access is not
available, timelines should be increased and risk items should be added to the project.
Historically, most projects without direct access go over-time due to lack of availability
of key resources to provide extracted data. If this can be avoided by providing direct
access it should.

Determine File Transformation Constraints

For solutions with complex data transformation requirements, the final step is to
determine the feasibility of transforming the data to target formats and any implications
that will have on the eventual system design.

Very large flat file formats often require splitting processes to be introduced into the
design in order to split the data into manageable sized chunks for subsequent
processing. This will require identification of appropriate boundaries for splitting and
may require additional steps to convert the data into formats that are suitable for
splitting.

For example large PDF-based data sources may require conversion into some other
format such as XML before the data can be split.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 113 of 1017


Sample Deliverables
None

Last updated: 20-May-08 19:37

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 114 of 1017


Phase 2: Analyze
Subtask 2.3.3 Determine
Target Requirements

Description

This subtask provides detailed business requirements that lead to design of the target
data structures for a data integration project. For Operational Data Integration projects,
this may involve identifying a subject area or transaction set within an existing
operational schema or a new data store. For Data Warehousing / Business Intelligence
projects, this typically involves putting some structure to the informational
requirements. The preceding business requirements tasks (see Prerequisites) provide
a high-level assessment of the organization's business initiative and provide business
definitions for the information desired.

Note that if the project involves enterprise-wide data integration, it is important that the
requirements process involve representatives from all interested departments and that
those parties reach a semantic consensus early in the process.

Prerequisites
None

Roles

Application Specialist (Secondary)

Business Analyst (Primary)

Data Architect (Primary)

Data Steward/Data Quality Steward (Secondary)

Data Transformation Developer (Secondary)

Metadata Manager (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 115 of 1017


Technical Architect (Primary)

Considerations

Operational Data Integration

For an operational data integration project, requirements should be based on existing


or defined business processes. However, for data warehousing projects, strategic
information needs must be explored to determine the metrics and dimensions desired.

Metrics

Metrics should indicate an actionable business measurement. An example for a


consultancy might be:

"Compare the utilization rate of consultants for period x, segmented by industry,


for each of the major geographies as compared to the prior period"

Often a mix of financial (e.g., budget targets) and operational (e.g., trends in customer
satisfaction) key performance metrics is required to achieve a balanced measure of the
organizational performance.

The key performance metrics may be directly sourced from an existing operational
system or may require integration of data from various systems. Market analytics may
indicate a requirement for metrics to be compared to external industry performance
criteria.

The key performance metrics should be agreed-upon through a consensus of the


business users to provide common and meaningful definitions. This facilitates the
design of processes to treat source metrics that may arrive in a variety of formats from
various source systems.

Dimensions

The key to determining dimension requirements is to formulate a business-oriented


description for the segmentation requirements for each of the desired metrics. This may
involve an iterative process of interaction with the business community during
requirements gathering sessions, paying attention to words such as “by” and “where”.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 116 of 1017


For example, a Pay-TV operator may be interested in monitoring the effectiveness of a
new campaign geared at enrolling new subscribers. In a simple case, the number of
new subscribers would be an essential metric; it may, however, be important to the
business community to perform an analysis based on the dimensions (e.g., by
demography, by group, or by time).

A technical consideration at this stage is to understand whether the dimensions are


likely to be rapidly changing or slowly changing, since this can affect the structure of an
eventual data model built from this analysis. Rapidly-changing dimensions are those
whose values may change frequently over their lifecycle (e.g., a customer attribute that
changes many times a year) as opposed to a slowly-changing dimension such as an
organization that may only change when a reorganization occurs.

It is also important at this stage to determine as many likely summarization levels of a


dimension as possible. For example, time may have a hierarchical structure comprising
year, quarter, month, and day while geography may be broken down into Major Region,
Area, Subregion, etc. It is also important to clarify the lowest level of detail that is
required for reporting.

The metric and dimension requirements should be prioritized according to perceived


business value to aid in the discussion of project scope in case there are choices to
make regarding what to include or exclude.

Data Migration Projects

Data migration projects should be exclusively driven by the target system needs, not by
what is available in the source systems. Therefore, it is recommended to identify the
target system needs early in the Analyze Phase and focus the analysis activities on
those objects.

B2B Projects

For B2B and non B2B projects that have significant flat file based data targets,
consideration needs to be given to the target data to be generated. Considerations
include:

● What are target file and data formats?


● What sizes of target files need to be supported? Will they require
recombination of multiple intermediate data formats?
● Are there applicable intermediate or target canonical formats that can be

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 117 of 1017


created or leveraged?
● What XML schemas are needed to support the generation of the target
formats?
● Do target formats conform to well known proprietary or open data format
standards?
● Does target data generation need to be accomplished within specific time or
other performance related thresholds?
● How are errors both in data received and in overall B2B operation
communicated back to the internal operations staff and to external trading
partners?
● What mechanisms are used to send data back to external partners?
● What applicable middleware, communications and enterprise application
software is used in the overall B2B operation? What data transformation
implications does the choice of middleware and infrastructure software
impose?
● How is overall B2B interaction governed? What process flows are involved in
the system and how are they managed (for example via B2B Data Exchange,
external BPM software etc.)?
● Are there machine readable specifications that can be leveraged directly or on
modification to support “Specification driven transformation” based creation of
data transformation scripts?
● Is sample data available for testing and verification of any data transformation
scripts created?

At a higher level, the number and complexity of data sources, the number and
complexity of data targets and the number and complexity of intermediate data formats
and schemas determine the overall scope of the data transformation and integration
aspects of B2B data integration projects as a whole.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 19:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 118 of 1017


Phase 2: Analyze
Subtask 2.3.5 Build Roadmap for Incremental Delivery

Description

Data Integration projects, whether data warehousing or operational data integration, are often large-scale, long-term
projects. This can also be the case with analytics visualization projects or metadata reporting/management projects. Any
complex project should be considered a candidate for incremental delivery. Under this strategy the entirety of the
comprehensive objectives of the project are broken up into prioritized deliverables, each of which can be completed within
approximately three months. This gives near-term deliverables that provide early value to the business (which can be
helpful in funding discussions) and conversely, is an important avenue for early end-user feedback that may enable the
development team to avoid major problems. This feedback may point out misconceptions or other design flaws which, if
undetected, could cause costly rework later on.

This roadmap, then, provides the project stakeholders with a rough timeline for completion of their entire objective, but also
communicates the timing of these incremental sub-projects based on their prioritization. Below is an example of a timeline
for a Sales and Finance data warehouse with the increments roughly spaced each quarter. Each increment builds on the
completion of the prior increment, but each delivers clear value in itself.

Q1 Yr 1 Q2 Yr 1 Q3 Yr 1 Q4 Yr 1 Q1 Yr 2
Implement Data
Warehouse
Architecture
Revenue Analytics
Complete Bookings,Billings,
Backlog
GL Analytics
COGS Analysis

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Architect (Primary)

Data Steward/Data Quality Steward (Secondary)

Project Sponsor (Secondary)

Technical Architect (Primary)

Technical Project Manager (Primary)

Considerations

The roadmap is the culmination of business requirements analysis and prioritization. The business requirements are
reviewed for logical subprojects (increments), source analysis is reviewed to provide feasibility, and business priorities are
used to set the sequence of the increments, factoring in feasibility and the interoperability or dependencies of the
increments.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 119 of 1017


The objective is to start with increments that are highly feasible, have no dependencies and provide significant value to the
business. One or two of these “quick hit” increments is important to build end-user confidence and patience as the later,
more complex increments may be harder to deliver. It is critical to gain the buy-in of the main project stakeholders regarding
priorities and agreement on the roadmap sequence.

Advantages of incremental delivery include:

● Customer value is delivered earlier – the business sees an early start to its ROI.
● Early increments elicit feedback and sometimes clearer requirements that will be valuable in designing the later
increments.
● Much lower risk of overall project failure because of the plan for early, attainable successes.
● Highly likely that even if all of the long-term objectives are not achieved (they may prove infeasible or lose favor with
the business), the project still provides the value of the increments that are completed.
● Because the early increments reflect high-priority business needs, they may attract more visibility and have greater
perceived value than the project as a whole.

Disadvantages can be

● There is always some extra effort involved in managing the release of multiple increments. However, there is less
risk of costly rework effort due to misunderstood (or changing) requirements because of early feedback from end-
users.
● There may be schema redesign or other rework necessary after initial increments because of unforeseen
requirements or interdependencies.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 17:20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 120 of 1017


Phase 2: Analyze
Task 2.4 Define Functional
Requirements

Description

For any project to be ultimately successful, it must resolve the business objectives in a
way that the end users find easy to use and satisfactory in addressing their needs. A
functional requirements document is necessary to ensure that the project team
understands these needs in detail and is capable of proceeding with a system design
based upon the end user needs. The business drivers and goals provide a high-level
view of these needs and serve as the starting point for the detailed functional
requirements document. Business rules and data definitions further clarify specific
business requirements and are very important in developing detailed functional
requirements and ultimately the design itself.

Prerequisites
None

Roles

Business Project Manager (Review Only)

Considerations

Different types of projects require different functional requirements analysis processes.


For example, an understanding of how key business users will use analytics reporting
should drive the functional requirements for a business analytics project, while the
requirements for data migration or operational data integration projects should be
based on an analysis of the target transactions they are expected to support and what
the receiving system needs in order to process the incoming data. Requirements for
metadata management projects involve reviewing IT requirements for reporting and
managing project metadata, surveying the corporate information technology landscape
to determine potential sources of metadata, and interviewing potential users to
determine reporting needs and preferences.

Business Analytics Projects

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 121 of 1017


Developers need to understand the end-users expectations and preferences in terms of
analytic reporting in order to determine the functional requirements for data
warehousing or analytics projects. This understanding helps to determine the details
regarding what data to provide, with what frequency, at what level of summarization,
and periodicity, with what special calculations, and so forth.

The analysis may include studying existing reporting, interviewing current information
providers (i.e., those currently developing reports and analyses for Finance and other
departments), and even reviewing mock-ups and usage scenarios with key end-users.

Data Migration

Functional requirements analysis for data migration projects involves a thorough


understanding of the target transactions within the receiving system(s) and how the
systems will process the incoming data for those transactions. The business
requirements should indicate frequency of load for migration systems that will be run in
parallel for a period of time (i.e., repeatedly).

Operational Data Integration

These projects are similar to data migration projects in terms of the need to understand
the target transactions and how the data will be processed to accommodate them. The
processing may involve multiple load steps, each with a different purpose, some
operational and perhaps some for reporting. There may also be real-time requirements
for some, and there may be a need for interfaces with queue-based messaging
systems in situations where EAI-type integration between operational databases is
involved or master data management requirements.

Data Integration Projects

For all data integration projects (i.e., all of the above), developers also need to review
the source analysis with the DBAs to determine the functional requirements of the
source extraction processes.

B2B Projects

For B2B projects and flat file/XML-based data integration projects, the data formats that
are required for trading partners to interact with the system, the mechanisms for trading
partners and operators to determine the success and failure of transformations and the
internal interactions with legacy systems and other applications all form part of the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 122 of 1017


requirements of the system. These, in turn, may impose additional User Interface and/
or EAI type integration requirements.

For large B2B projects, overall business process management will typically form part of
the overall system which may impose requirements around the use of partner
management software such as B2B Data Exchange and/or business process
management software.

Often B2B systems may have real-time requirements and involve the use of interfaces
with queue-based messaging systems, web services and other application integration
technologies.

While these are technical, rather than business requirements, for Business Process
Outsourcing and other types of B2B interaction, technical considerations often form a
core component of the business operation.

Building the Specifications

For each distinct set of functional requirements, the Functional Requirements


Specifications template can provide a valuable guide for determining the system
constraints, inputs, outputs, and dependencies. For projects using a phased approach,
priorities need to be assigned to functions based on the business needs, dependencies
for those functions, and general development efficiency. Prioritization will determine in
what phase certain functionality is delivered.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 19:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 123 of 1017


Phase 2: Analyze
Task 2.5 Define Metadata
Requirements

Description

Metadata is often articulated as ‘data about data’. It is the collection of information that
further describes the data used in the data integration project. Examples of metadata
include:

● Definition of the data element


● Business names of the element
● System abbreviations for that element
● The data type (string, date, decimal, etc.)
● Size of the element
● Source location

In terms of flat file and XML sources, metadata can include open and proprietary data
standards and an organization’s interpretations of those standards. In addition to the
previous examples, flat file metadata can include:

● Standards documents governing layout and semantics of data formats and


interchanges.
● Companion or interpretation guides governing an organization’s interpretation
of data in a particular standard.
● Specifications of transformations between data formats.
● COBOL copybook definitions for flatfile data to be passed to legacy or
backend systems.

All of these pieces of metadata are of interest to various members of the metadata
community, some are of interest only to certain technical staff members, while other
pieces may be very useful for business people attempting to navigate through the
enterprise data warehouse or across and through various business/subject area-
orientated data marts. That is, metadata can provide answers to such typical business
questions as:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 124 of 1017


● What does a particular piece of data mean (i.e., its definition in business
terminology)?
● What is the time scale for some number?
● How is some particular metric calculated?
● Who is the data owner?

Metadata also provides answers to Technical questions:

● What does this mapping do (i.e., source to target dependency)?


● How will a change over here affect things over there (i.e., impact analysis)?
● Where are the bottlenecks (i.e., in reports or mappings)?
● How current is my information?
● What is the load history for a particular object?
● Which reports are being accessed most frequently and by whom?

The components of a metadata requirements document include:

● Decision on how metadata will be used in the organization


● Assign data ownership
● Decision on who should use what metadata, and why, and how
● Determine business and source system definitions and names
● Determine metadata sources (i.e., modeling tools, databases, ETL, BI, OLAP,
XML Schemas, etc.)
● Determine training requirements
● Determine the quality of the metadata sources (i.e., absolute, relative,
historical, etc.)
● Determine methods to consolidate metadata from multiple sources
● Identify where metadata will be stored (e.g., central, distributed, or both)
● Evaluate the metadata products and their capabilities (i.e., repository-based,
CASE dictionary, warehouse manager, etc.)
● Determine responsibility for:

❍ Capturing
❍ Establishing standards and procedures
❍ Maintaining and securing the metadata
❍ Proper use, quality control, and update procedures

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 125 of 1017


● Establish metadata standards and procedures
● Define naming standards (i.e., abbreviations, class words, code values, etc.)
● Create a Metadata committee
● Determine if the metadata storage will be active or passive
● Determine the physical requirements of the metadata storage Determine and
monitor measures to establish the use and effectiveness of the metadata.

Prerequisites
None

Roles

Application Specialist (Review Only)

Business Analyst (Primary)

Business Project Manager (Primary)

Data Architect (Primary)

Data Steward/Data Quality Steward (Primary)

Database Administrator (DBA) (Primary)

Metadata Manager (Primary)

System Administrator (Primary)

Considerations

One of the primary objectives of this subtask is to attain broad consensus among all
key business beneficiaries regarding metadata business requirements priorities, it is
critical to obtain as much participation as possible in this process.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 126 of 1017


B2B Projects

For B2B and flat file oriented data integration projects, metadata is often defined in less
structured forms than for data dictionaries or other traditional means of managing
metadata. The process of designing the system may include the need to determine and
document the metadata consumed and produced by legacy and 3rd party systems. In
some cases applicable metadata may need to be mined from sample operational data
from unstructured and semi structured system documentation.

For B2B projects, getting adequate sample source and target data can become a
critical part of defining the metadata requirements.

Best Practices
None

Sample Deliverables
None

Last updated: 20-May-08 20:05

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 127 of 1017


Phase 2: Analyze
Subtask 2.5.1 Establish
Inventory of Technical
Metadata

Description

Organizations undertaking new initiatives require access to consistent and reliable data
resources. Confidence in the underlying information assets and an understanding of
how those assets relate to one another can provide valuable leverage in the
strategic decision-making process. As organizations grow through mergers and
consolidations, systems that generate data become isolated resources unless they are
properly integrated. Integrating these data assets and turning them into key
components of the decision-making process requires significant effort.

Metadata is required for a number of purposes:

● Provide a data dictionary


● Assist with change management and impact analysis
● Provide a ‘system of record’ (lineage)
● Facilitate data auditing to comply with regulatory requirements
● Provide a basis on which formal data cleansing can be conducted
● Identify potential choices of canonical data formats
● Facilitate definition of data mappings

An inventory of sources (i.e., repositories) is necessary in order to understand the


availability and coverage of metadata, the ease of accessing and collating what is
available, and any potential gaps in metadata provisioning. The inventory is also the
basis on which the development of metadata collation and reporting can be planned. In
particular, if Metadata Manager is used, there may be a need to develop custom
resources to access certain metadata repositories, which can require significant effort.
A metadata inventory will provide the basis for which informed estimates and project
plans can be prepared.

Prerequisites
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 128 of 1017


Roles

Application Specialist (Review Only)

Business Analyst (Primary)

Data Steward/Data Quality Steward (Primary)

Metadata Manager (Primary)

Technical Architect (Primary)

Considerations

The first part of the process is to establish a Metadata Inventory that lists all metadata
sources.

This investigation will establish:

● The (generally recognized) name of each source.


● The type of metadata (usually the product maintaining it) and the format in
which is kept (e.g., database type and version).
● The priority assigned to investigation.
● Cross-references to other documents (e.g., design or modeling documents).
● The type of reporting expected from the metadata.
● The availability of an XConnect (assuming Metadata Manager is used) to
access repository and collate the metadata.

The second part of the process is to investigate in detail those metadata repositories or
sources that will be required to meet the next phase of requirements. This investigation
will establish:

● Ownership and stewardship of the metadata (responsibilities of the owners


and stewards are usually pre-defined by an organization and not part of
preparing the metadata inventory).
● Existence of a metadata model (one will need to be developed if it does not
exist, usually by the Business Analysts and System Specialist).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 129 of 1017


● System and business definition of the metadata items.
● Frequency and methods of updating the repository.
● Extent of any update history.
● Those elements required for reporting/analysis purposes.
● The quality of the metadata sources (i.e., ‘quality’ can be measured
qualitatively by a questionnaire issued to users, but may be better measured
against metrics that either exist with the organization or are proposed as part
of developing this inventory).
● The development effort involved in developing a method of accessing/
extracting metadata (for Metadata Manager, a custom XConnect) if none
already exists. (Ideally, the estimates should be in man-days by skill, and
include a list of prerequisites and dependencies).

B2B Projects

For B2B and flat file oriented data integration projects, metadata is often defined and
maintained in the form of non-database oriented metadata such as XML schemas or
data format specifications (and specifications as to how standards should be
interpreted). Metadata may need to be mined from sample data, legacy systems and/or
mapping specifications.

Metadata repositories may take the form of document repositories using document
management or source control technologies.

In B2B systems, the responsibility for tracking metadata may shift to members of the
technical architecture team; as traditional database design, planning and maintenance
may play a lesser role in these systems.

Best Practices
None

Sample Deliverables

Metadata Inventory

Last updated: 20-May-08 20:12

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 130 of 1017


Phase 2: Analyze
Subtask 2.5.2 Review
Metadata Sourcing
Requirements

Description

Collect business requirements about metadata that is expected to be stored and


analyzed. These requirements are determined by the reporting needs for metadata, as
well as details of the metadata source, and the ability to extract and load this
information to the respective metadata repository or metadata warehouse. Having a
thorough understanding of these requirements is a must for a smooth and timely
implementation of any metadata analysis solution per business requirements.

Prerequisites
None

Roles

Business Project Manager (Review Only)

Metadata Manager (Primary)

System Administrator (Review Only)

Considerations

Determine Metadata Reporting Requirements

Metadata reporting requirements should drive the specific metadata to be collected, as


well as the implementation of tools to collect, store, and display it. The need to expose
metadata to developers is quite different than a need to expose metadata to operations
personnel and quite different than a need to expose metadata to business users. Each
of these pieces of the metadata picture requires different information and can be stored
in and handled by different metadata repositories.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 131 of 1017


Developers typically require metadata that helps determine how source information
maps to a target, as well as information that can help with impact analysis in the case
of change to a source or target structure or transformation logic. If there are data
quality routines to be implemented, source metadata can also help to determine the
best method for such implementation, as well as specific expectations regarding the
quality of the source data.

Operations personnel generally require metadata regarding either the data integration
processes or business intelligence reporting, or both. This information is helpful in
determining issues or problems with delivering information to the final end-user with
regard to items such as the expected source data sizes versus actual processed; the
time to run specific processes, and if load windows are being met; the number of end
users running specific reports; the time of day reports are being run and when the load
on the system is highest; etc. This metadata allows operations to address issues as
they arise.

When reviewing metadata, business users want to know how the data was generated
(and related) and what manipulation, if any, was performed to produce
it. Information looked at ranges from specific reference metadata (i.e., ontologies and
taxonomies) to the transformations and/or calculations that were used to create the
final report values.

Sources of Metadata

After initial reporting requirements are developed, the location and accessibility of the
metadata must be considered.

Some sources of metadata only exist in documentation, or can be considered “home


grown” by the systems that are used to perform specific tasks on the data, or exist as
knowledge gained through the course of working with the data. If it is important to
include this information in a metadata repository or warehouse, it is important to note
that there is not likely to be an automated method of extracting and loading this type of
metadata. In the best case scenario, a custom process can be created to load this
metadata and in the worst case, this information needs to be entered manually.

Various other more formalized sources of metadata usually have automated methods
for loading to a metadata repository or warehouse. This includes information that is
held in data modeling tools, data integration platforms, database management systems
and business intelligence tools.

It is important to note that most sources of metadata that can be loaded in an


automated fashion contain mechanisms for holding some custom / unstructured type

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 132 of 1017


metadata, such as description fields. This methodology may obviate the need for
creating custom methods of loading metadata or manually entering the same metadata
in various locations.

Metadata Storage and Loading

For each of the various types of metadata reporting requirements mentioned, as well as
the various types of metadata sources, different methods of storage may fit better than
others and affect how the various metadata can be sourced.

In the case of metadata for developers and operations personnel, this type can
generally be found and stored in the repositories of the software used to accomplish
the tasks, such as the PowerCenter repository or the business intelligence software
repository. Usually, these software packages include sufficient reporting capability to
meet the required needs of this type of reporting. At the same time, most of these
metadata repositories include locations for manually entering metadata, as well as
automatically importing metadata from various sources.

Specifically, when using the PowerCenter repository as a metadata hub, there are
various locations where description fields can be used to include unstructured type /
more descriptive metadata. Mechanisms such as metadata extensions also allow for
user-defined fields of metadata. In terms of automated loading of
metadata, PowerCenter can import definitions from data modeling tools using Metadata
Exchange. Also, metadata from various sources, targets, and other objects can be
imported natively from the connections the PowerCenter software can make to these
systems, including items such as database management systems, ERP systems via
PowerConnects, and XML schema definitions.

In general, however, if robust reporting is required, or reporting across multiple


software metadata repositories, a metadata warehouse platform such as Informatica
Metadata Manager may be more appropriate to handle such functions.

In the case of metadata requirements for a business user, this usually requires a
platform that can integrate the metadata from various metadata sources, as well as
provide a relatively robust reporting function, which specific software metadata
repositories usually lack. Thus, in these cases, a platform like Metadata Manager is
optimal.

When using Metadata Manager, custom XConnects need to be created to


accommodate any metadata source that does not already have a pre-built loading
interface or any source where the pre-built interface does not extract all the required
metatdata. (For details about developing a custom Xconnect, refer to the Informatica

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 133 of 1017


Metadata Manager 8.5.1 Custom Metadata Integration Guide). Metadata
Manager contains various XConnect interfaces for data modeling tools, the
PowerCenter data integration platform, database management systems and business
intelligence tools. (For a specific list, refer to the Metadata Manager Administrator
Guide).

Metadata Analysis and Reports

The specific types of analysis and reports must also be considered with regard to
specifically what metadata needs to be sourced.

For metadata repositories like PowerCenter, the available analysis is very specific and
little information beyond what is normally sourced into the repository can be available
for reporting.

In the case of a metadata warehouse platform such as Metadata Manager, more


comprehensive reporting can be created.

From a high-level, the following analysis is possible with Metadata Manager:

● Metadata browsing
● Metadata searching
● Where-used analysis
● Lineage analysis
● Packaged reports

Metadata Manager provides more specific metadata analysis to help analyze source
repository metadata, including:

● Business intelligence reports - to analyze a business intelligence system, such


as report information, user activity, and how long it takes to run reports.
● Data integration reports - to analyze data integration operations, such as
reports that identify data integration problems, and analyze data integration
processes.
● Database management reports - to explore database objects, such as
schemas, structures, methods, triggers, and indexes, and the relationships
among them.
● Metamodel reports – to analyze how metadata is classified for each repository.
(For more information about metamodels, refer to the Metadata Manager

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 134 of 1017


Administrator Guide).
● ODS reports – to analyze data in particular metadata repositories.

It may be possible that even with a metadata warehouse platform like Metadata
Manager, some analysis requirements cannot be fulfilled by the above-mentioned
features and out-of-the-box reports. Analysis should be performed to identify any gaps
and to determine if any customization or design can be done within Metadata Manager
to resolve the gaps.

Bear in mind that Informatica Data Explorer (IDE) also provides a range of source data
and metadata profiling and source-to-target mapping capabilities.

Best Practices
None

Sample Deliverables
None

Last updated: 09-May-08 13:52

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 135 of 1017


Phase 2: Analyze
Subtask 2.5.3 Assess
Technical Strategies and
Policies

Description

Every IT organization operates using an established set of corporate strategies and


related development policies. Understanding and detailing these approaches may
require discussions ranging from Sarbanes-Oxley compliance to specific supported
hardware and software considerations. The goal of this subtask is to detail and assess
the impact of these policies as they relate to the current project effort.

Prerequisites
None

Roles

Application Specialist (Review Only)

Business Project Manager (Primary)

Data Architect (Primary)

Database Administrator (DBA) (Primary)

System Administrator (Primary)

Considerations

Assessing the impact of an enterprise’s IT policies may incorporate a wide range of


discussions covering an equally wide range of business and developmental areas. The
following types of questions should be considered in beginning this effort.

Overall

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 136 of 1017


● Is there an overall IT Mission Statement? If so, what specific directives might
affect the approach to this project effort?

Environment

● What are the current hardware or software standards? For example, NT vs.
UNIX vs. Linux? Oracle vs. SQL Server? SAP vs. PeopleSoft?
● What, if any, data extraction and integration standards currently exist?
● What source systems are currently utilized? For example, mainframe? flat file?
relational database?
● What, if any, regulatory requirements exist regarding access to and historical
maintenance of the source data?
● What, if any, load window restrictions exist regarding system and/or source
data availability?
● How many environments are used in a standard deployment? For example: 1)
Development, 2) Test, 3) QA, 4) Pre-Production, 5) Production.
● What is or will be the end-user presentation layer?

Project Team

● What is a standard project team structure? For example, Project Sponsor,


Business Analyst, Project Manager, Developer, etc.
● Are dedicated support resources assigned? Or are they often shared among
initiatives (e.g., DBAs)?
● Is all development performed by full time employees? Are contractors and/or
offshore resources employed?

Project Lifecycle

● What is a typical development lifecycle? What are standard milestones? What


criteria are typically applied to establish production readiness?
● What change controls mechanisms/procedures are in place? Are these
controls strictly policy-based, or are any specific change-control software in
use?
● What, if any, promotion/release standards are used?
● What is the standard for production support?

Metadata and Supporting Documentation

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 137 of 1017


● What types of supporting documentation are typically required?
● What, if any, is the current metadata strategy within the enterprise?

Resolving the answers to questions such as these will enable a greater accuracy in
project planning, scoping, and staffing efforts. Additionally, the understanding gained
from this assessment ensures that any new project effort will better marry its approach
to the established practices of the organization.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 18:11

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 138 of 1017


Phase 2: Analyze
Task 2.6 Determine
Technical Readiness

Description

The goal of this task is to determine the readiness of an IT organization with respect to
its technical architecture, implementation of said architecture, and the associated
staffing required to support the technical solution. Conducting this analysis, through
interviews with the existing IT team members (such as those noted in the Roles
section), provides evidence as to whether or not the critical technologies and
associated support system are sufficiently mature as to not present significant risk to
the endeavor.

Prerequisites
None

Roles

Business Project Manager (Primary)

Database Administrator (DBA) (Primary)

System Administrator (Primary)

Technical Architect (Primary)

Considerations

Carefully consider the following questions when evaluating the technical readiness of a
given enterprise:

● Has the architecture team been staffed and trained in the assessment of
critical technologies?
● Have all of the decisions been made regarding the various components of the
infrastructure, including: network, servers, and software?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 139 of 1017


● Has a schedule been established regarding the ordering, installing, and
deployment of the servers and network?
● If in place, what are the availability, capacity, scalability, and reliability of the
infrastructure?
● Has the project team been fully staffed and trained, including but not limited to:
a Project Manager, Technical Architect, System Administrator, Developer(s),
and DBA(s)? (See 1.2.1 Establish Project Roles).
● Are proven implementation practices and approaches in place to ensure a
successful project? (See 2.5.3 Assess Technical Strategies and Policies).
● Has the Technical Architect evaluated and verified the Informatica
PowerCenter Quickstart configuration requirements?
● Has the repository database been installed and configured?

By gaining a better understanding of questions such as these, developers can achieve


a clearer picture of whether or not that organization is sufficiently ready to move
forward with the project effort. This information also helps to develop a more accurate
and reliable project plan.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:44

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 140 of 1017


Phase 2: Analyze
Task 2.7 Determine
Regulatory Requirements

Description

Many organizations must now comply with a range of regulatory requirements such as
financial services regulation, data protection, Sarbanes-Oxley, retention of data for
potential criminal investigations, and interchange of data between organizations. Some
industries may also be required to complete specialized reports for government
regulatory bodies. This can mean prescribed reporting, detailed auditing of data, and
specific controls over actions and processing of the data.

These requirements differ from the "normal" business requirements in that they are
imposed by legislation and/or external bodies. The penalties for not precisely meeting
the requirements can be severe. However, there is a "carrot and stick" element to
regulatory compliance. Regulatory requirements and industry standards can also
present the business with an opportunity to improve its data processes and update the
quality of its data in key areas. Successful compliance — for example, in the banking
sector, with the Basel II Accord — brings the potential for more productive and
profitable uses of data.

As data is prepared for the later stages in a project, the project personnel must
establish what government or industry standards the project data must adhere to and
devise a plan to meet these standards. These steps include establishing a catalog of all
reporting and auditing required, including any prescribed content, formats, processes,
and controls. The definitions of content (e.g., inclusion/exclusion rules, timescales,
units, etc.) and any metrics or calculations, are likely to be particularly important.

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 141 of 1017


Legal Expert (Primary)

Considerations

Areas where requirements arise include the following:

● Sarbanes-Oxley regulations in the U.S. mean a proliferation of controls on


processes and data. Developers need to work closely with an organization’s
Finance Department to ascertain exactly how Sarbanes-Oxley affects the
project. There may be implications for how environments are set up and
controls for migration between environments (e.g., between Development,
Test, and Production), as well as for sign-offs, specified verification, etc.
● Another regulatory system applicable to financial companies is the Basel II
Accord. While Basel II does not have the force of law, it is a de facto
requirement within the international financial community.
● Other industries are demanding adherence to new data standards, both
communally, by coming together around common data models such as bar
codes and RFID (radio frequency identification), and individually, as
enterprises realize the benefits of synchronizing their data storage conventions
with suppliers and customers. Such initiatives are sometimes gathered under
the umbrella of Global Data Synchronization (GDS); the key benefit of GDS
is that it is not a compliance chore but a positive and profitable initiative for a
business.

If your project must comply with a government or industry regulation, or if the business
simply insists on high standards for its data (for example, to establish a “single version
of the truth” for items in the business chain), then you must increase your focus on data
quality in the project. 2.8 Perform Data Quality Audit is dedicated to performing a Data
Quality Audit that can provide the project stakeholders with a detailed picture of the
strengths and weaknesses of the project data in key compliance areas such as
accuracy, completeness, and duplication.

For example, compliance with a request for data under Section 314 of the USA-
PATRIOT Act is likely to be difficult for a business that finds it has large numbers of
duplicate records, or records that contain empty fields, or fields populated with default
values. Such problems should be identified and addressed before the data is moved
downstream in the project.

Regulatory requirements often require the ability to clearly audit the processes affecting
the data. This may require a metadata reporting system that can provide viewing and
reporting of data lineage and ‘where-used.’ Remember, such a system can produce

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 142 of 1017


spin-off benefits for IT in terms of automated project documentation and impact
analysis.

Industry and regulatory standards for data interchange may also affect data model and
ETL designs. HIPAA and HL7-compliance may dictate transaction definitions that affect
healthcare-related projects, as may SWIFT or Basel II for finance-related data.

Potentially there are now two areas to investigate in more detail: data and metadata.

● Map the requirements back to the data and/or metadata required using a
standard modeling approach.
● Use data models and the metadata catalog to assess the availability and
quality of the required data and metadata. Use the data models of the systems
and data sources involved, along with the inventory of metadata.
● Verify that the target data models meet the regulatory requirements.

Processes and Auditing Controls

It is important that data can be audited at every stage of processing where it is


necessary. To this end, review any proposed processes and audit controls to verify that
the regulatory requirements can be met and that any gaps are filled.

Also, ensure that reporting requirements can be met, again filling any gaps. It is
important to check that the format, content, and delivery mechanisms for all reports
comply with the regulatory requirements.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 18:13

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 143 of 1017


Phase 2: Analyze
Task 2.8 Perform Data
Quality Audit

Description

Data Quality is a key factor for several tasks and subtasks in the Analyze Phase. The
quality of the proposed project source data, in terms of both its structure and content, is
a key determinant of the specifics of the business scope and of the success of the
project in general. For information on issues relating primarily to data structure, see
subtask 2.3.2 Determine Sourcing Feasibility, which focuses on the quality of the data
content.

Problems with the data content must be communicated to senior project personnel as
soon as they are discovered. Poor data quality can impede the proper execution of
later steps in the project, such as data transformation and load operations, and can
also compromise the business’ ability to generate a return on the project investment.
This is compounded by the fact that most businesses underestimate the extent of their
data quality problems. There is little point in performing a data warehouse, migration, or
integration project if the underlying data is in bad shape.

The Data Quality Audit is designed to analyze representative samples of the source
data and discover their data quality characteristics so that these can be articulated to
all relevant project personnel. The project leaders can then decide what actions, if any,
are necessary to correct data quality issues and ensure that the successful completion
of the project is not in jeopardy.

Prerequisites
None

Roles

Business Project Manager (Secondary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 144 of 1017


Technical Project Manager (Secondary)

Considerations

The Data Quality Audit can typically be conducted very quickly, but the actual time
required is determined by the starting condition of the data and the success criteria
defined at the beginning of the audit. The main steps are as follow:

● Representative samples of source data from all main areas are provided to the
Data Quality Developer.
● The Data Quality Developer uses a data analysis tool to determine the quality
of the data according to several criteria.
● The Data Quality Developer generates summary reports on the data and
distributes these to the relevant roles for discussion and next steps.

Two important aspects of the audit are (1) the data quality criteria used, and (2) the
type of report generated.

Data Quality Criteria

You can define any number and type of criteria for your data quality. However, there
are six standard criteria:

● Accuracy is concerned with the general accuracy of the data in a dataset. It is


often determined by comparing the dataset with a reliable reference source,
for example, a dictionary file containing product reference data.
● Completeness is concerned with missing data, that is, fields in the dataset
that have been left empty or whose default values have been left unchanged.
For example, many data input fields have a default date setting of 01/01/1900.
If a record includes 01/01/1900 as a data of birth, it is highly likely that the field
was never populated.
● Conformity is concerned with data values of a similar type that have been
entered in a confusing or unusable manner, for example, telephone numbers
that include/omit area codes.
● Consistency is concerned with the occurrence of disparate types of data
records in a dataset created for a single data type (e.g., the combination of
personal and business information in a dataset intended for business data
only).
● Integrity is concerned with the recognition of meaningful associations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 145 of 1017


between records in a dataset. For example, a dataset may contain records for
two or more family members in a household but without any means for the
organization to recognize or use this information.
● Duplication is concerned with data records that duplicate one another’s
information, that is, with identifying redundant records in the dataset or records
with meaningful information in common. For example:

❍ A dataset may contain user-entered records for “Batch No. 12345” and
“Batch 12345”, where both records describe the same batch.
❍ A dataset may contain several records with common surnames and street
addresses, indicating that the records refer to a single household; this
type of information is relevant to marketing personnel.

This list is not absolute; the characteristics above are sometimes described with other
terminology, such as redundancy or timeliness. Every organization’s data needs are
different, and the prevalence and relative priority of data quality issues differ from one
organization and one project to the next. Note that the accuracy factor differs from the
other five factors in the following respect: whereas, for example,a pair of duplicate
records may be visible to the naked eye, it can be difficult to tell simply by “eye-balling”
if a given data record is inaccurate. Accuracy can be determined by applying fuzzy
logic to the data or by validating the records against a verified reference data set.

Best Practices

Developing the Data Quality Business Case

Sample Deliverables
None

Last updated: 21-Aug-07 14:06

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 146 of 1017


Phase 2: Analyze
Subtask 2.8.1 Perform Data Quality Analysis of Source Data

Description

The data quality audit is a business rules-based approach that aims to help define project expectations through the use of data
quality processes (or plans) and data quality scorecards. It involves conducting a data analysis on the project data, or on
a representative sample of the data, and producing an accurate and qualified summary of the data’s quality. This subtask focuses
on data quality analysis. The results are processed and presented to the business users in the next subtask 2.8.2 Report
Analysis Results to the Business.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Secondary)

Considerations

There are three key steps in the process:

1. Select Target Data

The main objective of this step is to meet with the data steward and business owners to identify the data sources to be analyzed.
For each data source, the Data Quality Developer will need all available information on the data format, content, and structure,as
well as input on known data quality issues. The result of this step is a list of the sources of data to be analyzed, along with
the identification of all known issues. These define the initial scope of the audit. The following figure illustrates selecting target
data from multiple sources.

2. Run Data Quality Analysis

This step identifies and quantifies data quality issues in the source data. Data quality analysis plans are configured in Informatica
Data Quality (IDQ) Workbench. (The plans should be configured in a manner that enables the production of scorecards in the
next subtask. A scorecard is a graphical representation of the levels of data quality in the dataset.) The plans designed at this

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 147 of 1017


stage identify cases of incomplete or absent data values. Using IDQ, the Data Quality Developer can identify all such data
content issues.

Data analysis provides detailed metrics to guide the next steps of the audit. For example:

● For character data, analysis identifies all distinct values (such as code values) and their frequency distribution.
● For numeric data, analysis provides statistics on the highest, lowest, average, and total, as well as the number of positive
values, negative values, zero/null values, and any non-numeric values.
● For dates, analysis identifies the highest and lowest dates, the number of blank/null fields, as well as any invalid date values.
● For consumer packaging data, analysis can detect issues such as bar codes with correct/incorrect numbers of digits.

The figure below shows sample IDQ report output.

3. Define Business Rules

The key objectives of this step are to identify issues in the areas of completeness, conformity, and consistency, to prioritize data
quality issues, and to define customized data quality rules. These objectives involve:

● Discussions of data quality analyses with business users to define completeness, conformity, and consistency rules for each
data element.
● Tuning and re-running the analysis plans with these business rules.

For each data set, a set of base rules must be established to test the conformity of the attributes' data values against basic
rule definitions. For example, if an attribute has a date type, then that attribute should only have date information stored. At a
minimum, all the necessary fields must be tested against the base rule sets. The following figure illustrated business rule evaluation.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 148 of 1017


Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 18:17

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 149 of 1017


Phase 2: Analyze
Subtask 2.8.2 Report Analysis Results
to the Business

Description

The steps outlined in subtask 2.8.1 lead to the preparation of the Data Quality Audit Report, which
is delivered in this subtask. The Data Quality Audit report highlights the state of the data analyzed in
an easy-to-read, high-impact fashion.

The report can include the following types of file:

● Data quality scorecards - charts and graphs of data quality that can be pre-set to present
and compare data quality across key fields and data types
● Drill-down reports that permit reviewers to access the raw data underlying the summary
information
● Exception files

In this subtask, potential risk areas are identified and alternative solutions are evaluated. The Data
Quality Audit concludes with a presentation of these findings to the business and project
stakeholders and agreement on recommended next steps.

Prerequisites
None

Roles

Business Analyst (Secondary)

Business Project Manager (Secondary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Primary)

Technical Project Manager (Secondary)

Considerations

There are two key activities in this subtask: delivering the report, and framing a discussion for the
business about what actions to take based on the report conclusions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 150 of 1017


Delivering the report involves formatting the analysis results from subtask 2.8.1 into a framework that
can be easily understood by the business. This includes building data quality scorecards, preparing
the data sources for the scorecards, and possibly creating audit summary documentation such as a
Microsoft Word document or a PowerPoint slideshow. The data quality issues can then be evaluated,
recommendations made, and project targets set.

Creating Scorecards

Informatica Data Quality (IDQ) is used to identify, measure, and categorize data quality issues
according to business criteria. IDQ reports information in several formats, including database tables,
CSV files, HTML files, and graphically. (Graphical displays, or scorecards, are linked to the
underlying data so that viewers can move from high-level to low-level views of the data.)

Part of the report creation process is the agreement of pass/fail scores for the data and the
assignment of weights to the data performance for different criteria. For example, the business may
state that at least 98 percent of values in address data fields must be accurate and weight the zip
+four field as most important. Once the scorecards are defined, the data quality plans can be re-used
to track data quality progress over time and throughout the organization.

The data quality scorecard can also be presented through a dashboard framework, which adds value
to the scorecard by grouping graphical information in business-intelligent ways.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 151 of 1017


As can be seen in the above figure, a dashboard can present measurements in a “traffic light”
manner (color-coded green/amber/red) to provide quick visual cues as to the quality of and actions
needed for the data.

Reviewing the Audit Results and Deciding the Next Step

By integrating various data analysis results within the dashboard application, the stakeholders can
review the current state of data quality and decide on appropriate actions within the project.

The set of stakeholders should include one or more members of the data stewardship committee, the
project manager, data experts, a Data Quality Developer, and representatives of the business.
Together, these stakeholders can review the data quality audit conclusions and conduct a cost-
benefit comparison of the desired data quality levels versus the impact on the project of the steps to
achieve these levels.

In some projects — for example, when the data must comply with government or industry regulations
— the data quality levels are non-negotiable, and the project stakeholders must work to those
regulations. In other cases, the business objectives may be achieved by data quality levels that are
less than 100 percent. In all cases, the project data must obtain a minimum quality levels in order to
pass through the project processes and be accepted by the target data source.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 152 of 1017


For these reasons, it is necessary to discuss data quality as early as possible in project planning.

Ongoing Audits and Data Quality Monitoring

Conducting a data quality audit one time provides insight into the then-current state of the data, but
does not reflect how project activity can change data quality over time. Tracking levels of data quality
over time, as part of an ongoing monitoring process, provides a historical view of when and how
much the quality of data has improved. The following figure illustrates how ongoing audits can chart
progress in data quality.

As part of a statistical control process, data quality levels can be tracked on a periodic basis and
charted to show if the measured levels of data quality reach and remain in an acceptable range, or
whether some event has caused the measured level to fall below what is acceptable. Statistical
control charts can help in notifying data stewards when an exception event impacts data quality and
can help to identify the offending information process. Historical statistical tracking and charting
capabilities are available within a data quality scorecard, and scorecards can be easily updated;
once configured, the scorecard typically does not need to be re-created for successive data quality
analyses.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 153 of 1017


Last updated: 15-Feb-07 17:29

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 154 of 1017


Phase 3: Architect

3 Architect

● 3.1 Develop Solution


Architecture
❍ 3.1.1 Define Technical Requirements
❍ 3.1.2 Develop Architecture Logical View
❍ 3.1.3 Develop Configuration Recommendations
❍ 3.1.4 Develop Architecture Physical View
❍ 3.1.5 Estimate Volume Requirements
● 3.2 Design Development Architecture
❍ 3.2.1 Develop Quality Assurance Strategy
❍ 3.2.2 Define Development Environments
❍ 3.2.3 Develop Change Control Procedures
❍ 3.2.4 Determine Metadata Strategy
❍ 3.2.5 Develop Change Management Process
● 3.3 Implement Technical Architecture
❍ 3.3.1 Procure Hardware and Software
❍ 3.3.2 Install/Configure Software

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 155 of 1017


Phase 3: Architect

Description

During this Phase of the project,


the technical requirements are defined, the project infrastructure is developed and the
development standards and strategies are defined. The conceptual architecture is
designed; which forms the basis for determining capacity requirements and
configuration recommendations. The environments and strategies for the entire
development process are defined. The strategies include development standards,
quality assurance, change control processes and metadata strategy. It is critical that
the architecture decisions made during this phase are guided by an understanding of
the business needs. As Data Integration architectures become more real-time and
mission critical, good architecture decisions will ensure the success of the overall effort.
This phase should culminate in the implementation of the hardware and software that
will allow the Design Phase and the Build Phase of the project to begin.

Proper execution during the Architect Phase is especially important for for Data
Migration projects. In the Architect Phase a series of key tasks are undertaken to
accelerate development, ensure consistency and expedite completion of the data
migration.

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Primary)

Data Architect (Primary)

Data Integration Developer (Secondary)

Data Quality Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 156 of 1017


Data Warehouse Administrator (Review Only)

Database Administrator (DBA) (Primary)

Metadata Manager (Primary)

Presentation Layer Developer (Secondary)

Project Sponsor (Approve)

Quality Assurance Manager (Primary)

Repository Administrator (Primary)

Security Manager (Secondary)

System Administrator (Primary)

Technical Architect (Primary)

Technical Project Manager (Primary)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 25-May-08 16:13

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 157 of 1017


Phase 3: Architect
Task 3.1 Develop Solution
Architecture

Description

The scope of solution architecture in a data integration or an enterprise data


warehouse project is quite broad and involves careful consideration of many disparate
factors.

Data integration solutions have grown in scope as well as the amount of data they
process. This necessitates careful consideration of architectural issues across a
number of architectural domains. Well-designed solution architecture is very crucial to
any data integration effort, and can be the most influential, visible part of the whole
effort. A robust solution architecture not only meets the business requirements but it
also exceeds the expectations of the business community. Given the continuous state
of change that has become a trademark of information technology, it is prudent to have
an architecture that is not only easy to implement and manage, but also flexible enough
to accommodate changes in the future, easily extendable, reliable (with minimal or no
downtime), and vastly scalable.

This task approaches the development of the architecture as a series of stepwise


refinements:

● First, reviewing the requirements.


● Then developing a logical model of the architecture for consideration.
● Refining the logical model into a physical model, and
● Validating the physical model.

In addition, because the architecture must consider anticipated data volumes, it is


necessary to develop a thorough set of estimates. The Technical Architect is
responsible for ensuring that the proposed architecture can support the estimated
volumes.

Prerequisites
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 158 of 1017


Roles

Business Analyst (Primary)

Data Architect (Primary)

Data Quality Developer (Primary)

Data Warehouse Administrator (Review Only)

Database Administrator (DBA) (Primary)

System Administrator (Primary)

Technical Architect (Primary)

Technical Project Manager (Review Only)

Considerations

A holistic view of architecture encompasses three realms, the development


architecture, the execution architecture, and the operations architecture. These three
areas of concern provide a framework for considering how any system is built, how it
runs, and how it is operated. Although there may be some argument about whether an
integration solution is a "system," it is clear that it has all the elements of a software
system, including databases, executable programs, end users, maintenance releases,
and so forth. Of course, all of these elements must be considered in the design and
development of the enterprise solution.

Each of these architectural areas involves specific responsibilities and concerns:

● Development Architecture, which incorporates technology standards, tools,


and the techniques and services required in the development of the enterprise
solution. This may include many of the services described in the execution
architecture, but also involves services that are unique to development
environments such as security mechanisms for controlling access to
development objects, change control tools and procedures, and migration
capabilities.
● Execution Architecture, which includes the entire supporting infrastructure

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 159 of 1017


required to run an application or set of applications. In the context of an
enterprise-wide integration solution, this includes client and server hardware,
operating systems, database management systems, network infrastructure,
and any other technology services employed in the runtime delivery of the
solution.
● Operations Architecture, which is a unified collection of technology services,
tools, standards, and controls required to keep a business application
production or development environment operating at the designed service
level. This differs from the execution architecture in that its primary users are
system administrators and production support personnel.

The specific activities that comprise this task focus primarily on the Execution
Architecture. 3.2 Design Development Architecture focuses on the development
architecture and the Operate Phase discusses the important aspects of operating a
data integration solution. Refer to the Operate Phase for more information on the
operations architecture.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:44

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 160 of 1017


Phase 3: Architect
Subtask 3.1.1 Define
Technical Requirements

Description

In anticipation of architectural design and subsequent detailed technical design steps,


the business requirements and functional requirements must be reviewed and a high-
level specification of the technical requirements developed. The technical
requirements will drive these design steps by clarifying what technologies will be
employed and, from a high-level, how they will satisfy the business and functional
requirements.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Quality Developer (Secondary)

Technical Architect (Primary)

Technical Project Manager (Review Only)

Considerations

The technical requirements should address, at least at a conceptual level,


implementation specifications based on the findings to date (regarding data rules,
source analysis, strategic decisions, etc.) such as:

● Technical definitions of business rule derivations (including levels of


summarization.
● Definitions of source and target schema – at least at logical/conceptual level.
● Data acquisition and data flow requirements.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 161 of 1017


● Data quality requirements (at least at a high level).
● Data consolidation/integration requirements (at least at a high level).
● Report delivery and access specifications.
● Performance requirements (both “back-end” and presentation performance).
● Security requirements and structures (access, domain, administration, etc.).
● Connectivity specifications and constraints (especially limits of access to
operational systems).
● Specific technologies required (if requirements clearly indicate such).

For Data Migration projects the technical requirements are fairly consistent and known
They will require processes to:

● Populate the reference data structures


● Acquire the data from source systems
● Convert to target definitions
● Load to the target application
● Meet the necessary audit functionalities

The details of which will be covered in a data migration strategy.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:44

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 162 of 1017


Phase 3: Architect
Subtask 3.1.2 Develop
Architecture Logical View

Description

Much like a logical data model, a logical view of the architecture provides a high-level
depiction of the various entities and relationships as an architectural blueprint of the
entire data integration solution. The logical architecture helps people to visualize the
solution and show how all the components work together. The major purposes of the
logical view are:

● To describe how the various solution elements work together (i.e., databases,
ETL, reporting, and metadata).
● To communicate the conceptual architecture to project participants to validate
the architecture.
● To serve as a blueprint for developing the more detailed physical view.

The logical diagram provides a road map of the enterprise initiative and an opportunity
for the architects and project planners to define and describe, in some detail, the
individual components.

The logical view should show relationships in the data flow and among the functional
components; indicating, for example, how local repositories relate to the global
repository (if applicable).

The logical view must take into consideration all of the source systems required to
support the solution, the repositories that will contain the runtime metadata, and all
known data marts and reports. This is a “living” architectural diagram, to be refined as
you implement or grow the solution.

The logical view does not contain detailed physical information such as server names,
IP addresses, hardware specifications, etc. These details will be fleshed out in the
development of the physical view.

Prerequisites
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 163 of 1017


Roles

Data Architect (Secondary)

Technical Architect (Primary)

Considerations

The logical architecture should address reliability, availability, scalability, performance,


usability, extensibility, interoperability, security, and QA. It should incorporate all of the
high-level components of the information architecture, including but not limited to:

● All relevant source systems


● ETL repositories; BI repositories
● Metadata Management, Metadata Reporting
● Real-time Messaging, Web Services, XML Server
● Data Quality tools, Data Modeling tools
● PowerCenter Servers, Repository Server
● Target data structures, e.g., data warehouse, data marts, ODS
● Web Application Servers
● ROLAP engines, Portals, MOLAP cubes, Data Mining

For Data Migration projects a key component is the documentation of the various utility
database schemas. This will likely include legacy staging, pre-load staging, reference
data, and audit database schemas. Additionally, database schemas for Informatica
Data Quality and Informatica Data Explorer will also be included.

Best Practices

Designing Data Integration Architectures

PowerCenter Enterprise Grid Option

Sample Deliverables
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 164 of 1017


Last updated: 06-Dec-07 15:36

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 165 of 1017


Phase 3: Architect
Subtask 3.1.3 Develop
Configuration
Recommendations

Description

Using the Architecture Logical View as a guide, and considering any corporate
standards or preferences, develop a set of recommendations for how to technically
configure the analytic solution. These recommendations will serve as the basis for
discussion with the appropriate parties, including project management, the Project
Sponsor, system administrators, and potentially the user community. At this point, the
recommendations of the Data Architect and Technical Architect should be very well
formed, based on their understanding of the business requirements and the current and
planned technical standards.

The recommendations will be formally documented in the next subtask 3.1.4 Develop
Architecture Physical View but are not documented at this stage since they are still
considered open to debate. Discussions with interested constituents should focus on
the recommended architecture, not on protracted debate over the business
requirements.

It is critical that the scope of the project be set - and agreed upon - prior to developing
and documenting the technical configuration recommendations. Changes in the
requirements at this point can have a definite impact on the project delivery date.

(Refer back to the Manage Phase for a discussion of scope setting and control issues).

Prerequisites
None

Roles

Data Architect (Secondary)

Technical Architect (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 166 of 1017


Considerations

The configuration recommendations must balance a number of factors in order to be


adopted:

● Technical solution - The recommended configuration must, of course, solve


the technical challenges posed by the analytic solution. In particular, it must
consider data capacity and volume throughput requirements.
● Conformity - The recommended solution should work well within the context
of the organization's existing infrastructure and conform to the organization's
future infrastructure direction.
● Cost - The incremental cost of the solution must fit within whatever budgetary
parameters have been established by project management. In many cases,
incremental costs can be reduced by leveraging existing available hardware
resources and leveraging PowerCenter’s server grid technology.

The primary areas to consider in developing the recommendations include, but are not
necessarily limited to:

Server Hardware and Operating System - Many IT organizations


mandate – or strongly encourage - the choice of server hardware and
operating system to fit into the corporate standards. Depending on
the size and throughput requirements, the server may be either
UNIX, Linux, or NT-based. The technical architectures should also
provide a recommendation of a 32-bit architecture or a 64 bit
architecture based on the cost/benefit of each. It is advisable to
consider the vast advantages of 64-bit OS and PowerCenter as this is
likely to provide increased resources and enable faster processing
speeds. This is also likely to support the handling of larger numbers
in data. It is also important to ensure the hardware is built for OLAP
applications, which typically tend to be computational intensive as
compared to OLTP systems which require hyper threading. This
determination is important for ensuring improved performance.
Also make sure the RAM size is determined in accordance with the
systems to be built. In many cases RAM disks can be used in place of
RAM when increased RAM availability is an issue. This is especially
important when the PowerCenter application creates huge cache files.

Consult the Platform Availability Matrix at my.informatica.com for specifics on


the applications under consideration for the project. Bear in mind that not all

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 167 of 1017


applications have the same level of availability on every platform. This is also
true for database connectivity (see Database Management System below).

Disk Storage Systems – The architecture of the disk storage


system should also be included in the architecture configuration.
Some organizations leverage a Storage Area Network (SAN) to store
all data, while other organizations opt for local storage. In any case,
careful consideration should be given to disk array and striping
configuration in order to optimize performance for the related
systems (i.e., database, ETL, and BI).

Database Management System – Similar to organizational


standards that mandate hardware or operating system choices, many
organizations also mandate the choice of a database management
system. In instances where a choice of the DBMS is available, it is
important to remember that PowerCenter and Data Analyzer support
a vast array of DBMSs on a variety of platforms (refer to the
PowerCenter Installation Guide and Data Analyzer Installation Guide
for specifics). A DBMS that is supported by all components in the
technical infrastructure, such as OS, ETL, and BI, to name a few,
should be chosen.

PowerCenter Server – The PowerCenter server should, of course,


be considered when developing the architecture recommendations.
Considerations should include network traffic (between the repository
server, PowerCenter server, database server, and client machines),
the location of the PowerCenter repository database, and the physical
storage that will contain the PowerCenter executables as well as
source, target, and cache files.

Data Analyzer or other Business Intelligence Data Integration


Platforms – Whether using Data Analyzer or a different BI tool for
analytics, the goal is to develop configuration recommendations that
result in a high-performance application passing data efficiently
between source system, ETL server, database tables, and BI end-
user reports. For Web-based analytic tools such as Data Analyzer,
one should also consider user requirements that may dictate that a
secure Web-server infrastructure be utilized to provide reporting
access outside of the corporate firewall to enable features such as
reporting access from a mobile device. Typically, a secure Web-
server infrastructure that utilizes a demilitarized zone (DMZ) will
result in a different technical architecture configuration than an

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 168 of 1017


infrastructure that simply supports reporting from within the
corporate firewall.

TIP
Use the Architecture Logical View as a starting point for discussing the
technical configuration recommendations. As drafts of the physical view are
developed, they will be helpful for explaining the planned architecture.

Best Practices

PowerCenter Enterprise Grid Option

Sample Deliverables
None

Last updated: 06-Dec-07 14:55

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 169 of 1017


Phase 3: Architect
Subtask 3.1.4 Develop
Architecture Physical View

Description

The physical view of the architecture is a refinement of the logical view, but takes into
account the actual hardware and software resources necessary to build the
architecture. Much like a physical data model, this view of the architecture depicts
physical entities (i.e., servers, workstations, and networks) and their attributes (i.e.,
hardware model, operating system, server name, IP address). In addition, each entity
should show the elements of the logical model supported by it. For example, a UNIX
server may be serving as a PowerCenter server engine, Data Analyzer server engine,
and may also be running Oracle to store the associated PowerCenter repositories.

The physical view is the summarized planning document for the architecture
implementation. The physical view is unlikely to explicitly show all of the technical
information necessary to configure the system, but should provide enough information
for domain experts to proceed with their specific responsibilities. In essence, this view
is a common blueprint that the system's general contractor (i.e. the Technical
Architect) can use to communicate to each of the subcontractors (i.e. UNIX
Administrator, Mainframe Administrator, Network Administrator, Application Server
Administrator, DBAs, etc).

Prerequisites
None

Roles

Data Warehouse Administrator (Approve)

Database Administrator (DBA) (Primary)

System Administrator (Primary)

Technical Architect (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 170 of 1017


Considerations

None

Best Practices

PowerCenter Enterprise Grid Option

Sample Deliverables
None

Last updated: 06-Dec-07 15:35

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 171 of 1017


Phase 3: Architect
Subtask 3.1.5 Estimate
Volume Requirements

Description

Estimating the data volume and physical storage requirements of a data integration
project is a critical step in the architecture planning process. This subtask represents a
starting point for analyzing data volumes, but does not include a definitive discussion of
capacity planning. Due to the varying complexity and data volumes associated with
data integration solutions, it is crucial to review each technical area of the proposed
solution with the appropriate experts (i.e., DBAs, Network Administrators, Server
System Administrators, etc.).

Prerequisites
None

Roles

Data Architect (Primary)

Data Quality Developer (Primary)

Database Administrator (DBA) (Primary)

System Administrator (Primary)

Technical Architect (Secondary)

Considerations

Capacity planning and volume estimation should focus on several key areas that are
likely to become system bottlenecks or to strain system capacity, specifically:

Disk Space Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 172 of 1017


Database size is the most likely factor to affect disk space usage in the data integration
solution. As the typical data integration solution does not alter the source systems,
there is usually no need to consider their size. However, the target databases, and any
ODS or staging areas demand disk storage over and above the existing operational
systems. A Database Sizing Model workbook is one effective means for estimating
these sizes.

During the Architect Phase only a rough volume estimate is required. After the Design
Phase is completed, the database sizing model should be updated to reflect the data
model and any changes to the known business requirements. The basic techniques for
database sizing are well understood by experienced DBAs. Estimates of database size
must factor in:

● Determine the upper bound of the precision of each table row. This can
obviously be affected by certain DBMS data types, so be sure to take into
account each physical byte consumed. The documentation for the DBMS
should specify storage requirements for all supported data types. After the
physical data model has been developed, the row width can be calculated.
● Depending on the type of table, this number may be vastly different for a
"young" warehouse than one at "maturity". For example, if the database is
designed to store three years of historical sales data, and there is an average
daily volume of 5,000 sales, the table will contain 150,000 rows after the first
month, but will have swelled to nearly 5.5 million rows at full maturity. Beyond
the third year, there should be a process in place for archiving data off the
table, thus limiting the size to 5.5 million rows.
● Indexing can add a significant disk usage penalty to a database. Depending on
the overall size of the indexed table, and the size of the keys used in the index,
an index may require 30 to 80 percent additional disk space. Again, the DBMS
documentation should contain specifics about calculating index size.
● Partitioning the physical target can greatly increase the efficiency and
organization of the load process. However, it does increase the number of
physical units to be maintained. Be sure to discuss with the DBAs the most
intelligent structuring of the database partitions.

Using these basic factors, it is possible to construct a database sizing model (typically
in spreadsheet form) that lists all database tables and indexes, their row widths, and
estimated number of rows. Once the row number estimates have been validated, the
estimating model should produce a fairly accurate estimate of database size. Note that
the model will provide an estimate of raw data size. Be sure to consult the DBAs to
understand how to factor in the physical storage characteristics relative to the DBMS
being used, such as block parameter sizes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 173 of 1017


The estimating process also provides a good opportunity to validate the star schema
data model. For example, fact tables should contain only composite keys and discrete
facts. If a fact table is wider than 32-64 bytes, it may be wise to re-evaluate what is
being stored. The width of the fact table is very important, since a warehouse can
contain millions, tens of millions, or even hundreds of millions of fact records. The
dimension tables, on the other hand, will typically be wider than the fact tables, and
may contain redundant data (e.g., names, addresses, etc.), but will have far fewer
rows. As a result, the size of the dimension tables is rarely a major contributor to the
overall target database size.

Since there is the possibility of unstructured data being sourced, transformed and
stored, it is important to factor in any conversion in data size, either up or down, from
source to target.

It is important to remember that Business Intelligence (BI) tools may consume


significant storage space, depending on the extent to which they pre-aggregate data
and how that data is stored. Because this may be an important factor in the overall disk
space requirements, be sure to consider storage techniques carefully during the BI
platform selection process.

TIP
If you have determined that the star schema is the right model to use for the data
integration solution, be sure that the DBAs who are responsible for the target
data model understand its advantages. A DBA who is unfamiliar with the star
schema may seek to normalize the data model in order to save space. Firmly
resist this tendency to normalize.

Data Processing Volume

Data processing volume refers to the amount of data being processed by a given
PowerCenter server within a specified timeframe. In most data integration
implementations, a load window is allotted representing clock time. This window is
determined by the availability of the source systems for extracts and the end-user
requirements for access to the target data sources. Maintenance jobs that run on a
regular basis may further limit the length of the load window.

As a result of the limited load window, the PowerCenter server engine must be able to
perform its operations on all data in a given time period. The ability to do so is
constrained by three factors:

● Time it takes to extract the data (potentially including network transfer time, if

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 174 of 1017


the data is on a remote server)
● Transformation time within PowerCenter
● Load time (which is also potentially impacted by network latency)

The biggest factors affecting extract and load times are, however, related to database
tuning. Refer to Performance Tuning Databases (Oracle) for suggestions on improving
database performance.

The throughput of the PowerCenter Server engine is typically the last option for
improved performance. Refer to the Velocity Best Practice Tuning Sessions for Better
Performance which includes suggestions on tuning mappings and sessions to optimize
performance. From an estimating standpoint, however, it is impossible to accurately
project the throughput (in terms of rows per second) of a mapping due to the high
variability in mapping complexity, quantity and complexity of transformations, and the
nature of the data being transformed. It is a more accurate estimation to use clock time
to ensure processing within the given load window.

If the project includes steps dedicated to improving data quality (for example, as
described in Task 4.6) then a related performance factor is the time taken to perform
data matching (that is, record de-duplication) operations. Depending on the size of the
dataset concerned, data matching operations in Infomatica Data Quality can take
several hours of processor time to complete. Data matching processes can be tuned
and executed on remote machines on the network to significantly reduce record
processing time. Refer to the Best Practice Effective Data Matching Techniques for
more information.

Network Throughput

Once the physical data row sizes and volumes have been estimated, it is possible to
estimate the required network capacity. It is important to remember the network
overhead associated with packet headers, as this can have an affect on the total
volume of data being transmitted. The Technical Architect should work closely with a
Network Administrator to examine network capacity between the different components
involved in the solution.

The initial estimate is likely to be rough, but should provide a sense of whether the
existing capacity is sufficient and whether the solution should be architected differently
(i.e., move source or target data prior to session execution, re-locate server engine(s),
etc.). The Network Administrator can thoroughly analyze network throughput during
system and/or performance testing, and apply the appropriate tuning techniques. It is
important to involve the network specialists early in the Architect Phase so that they

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 175 of 1017


are not surprised by additional network requirements when the system goes into
production.

TIP
Informatica generally recommends having either the source or target database
co-located with the PowerCenter Server engine because this can significantly
reduce network traffic. If such co-location is not possible, it may be advisable to
FTP data from a remote source machine to the PowerCenter Server as this is a
very efficient way of transporting the data across the network.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 18:19

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 176 of 1017


Phase 3: Architect
Task 3.2 Design
Development Architecture

Description

The Development Architecture is the collection of technology standards, tools,


techniques, and services required to develop a solution. This task involves developing
a testing approach, defining the development environments, and determining the
metadata strategy. The benefits of defining the development architecture are achieved
later in the project, and include good communication and change controls as well as
controllable migration procedures. Ignoring proper controls is likely to lead to issues
later on in the project.

Although the various subtasks that compose this task are described here in linear
fashion, all of these subtasks relate to the others, so it is important to approach the
overall body of work in this task as a whole and consider the development architecture
as a whole.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Architect (Secondary)

Data Integration Developer (Secondary)

Database Administrator (DBA) (Primary)

Metadata Manager (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 177 of 1017


Presentation Layer Developer (Secondary)

Project Sponsor (Review Only)

Quality Assurance Manager (Primary)

Repository Administrator (Primary)

Security Manager (Secondary)

System Administrator (Primary)

Technical Architect (Primary)

Technical Project Manager (Primary)

Considerations

The Development Architecture should be designed prior to the actual start of


development because many of the decisions made at the beginning of the project may
have unforeseen implications once the development team has reached its full size. The
design of the Development Architecture must consider numerous factors including the
development environment(s), naming standards, developer security, change control
procedures, and more.

The scope of a typical PowerCenter implementation, possibly covering more than one
project, is much broader than a departmentally-scoped solution. It is important to
consider this statement fully, because it has implications for the planned deployment of
a solution, as well as the requisite planning associated with the development
environment. The main difference is that a departmental data mart type project can be
created with only two or three developers in a very short time period. By contrast, a full
integration solution involving the creation of an ICC (Integration Competency Center) or
an analytic solution that approaches enterprise scale requires more of a "big team"
approach. This is because many more organizational groups are involved, adherence
to standards is much more important, and testing must be more rigorous, since the
results will be visible to a larger audience.

The following paragraphs outline some of the key differences between a departmental
development effort and an enterprise effort:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 178 of 1017


With a small development team, the environment may be simplistic:

● Communication between developers is easy; it may literally consist of shouting


over a cubicle partition.
● Only one or two repository folders may be necessary, since there is little risk of
the developers "stepping on" each other's work.
● Naming standards are not rigidly enforced.
● Migration procedures are loose; development objects are moved into
production without undue emphasis on impact analysis and change control
procedures.
● Developer security is ignored; typically, all developers use similarly often
highly privileged user ids.

However, as the development team grows and the project becomes more complex, this
simplified environment leads to serious development issues:

● Developers accustomed to informal communication may not thoroughly inform


the entire development team of important changes to shared objects.
● Repository folders originally named to correspond to individual developers will
not adequately support subject area- or release-based development groups.
● Developers maintaining others' mappings are likely to spend unnecessary time
and effort trying to decipher unfamiliar names.
● Failure to understand the dependencies of shared objects leads to unknown
impacts on the dependent objects. The lack of rigor in testing and migrating
objects into production leads to runtime bugs and errors in the warehouse
loading process.
● Sharing a single developer ID among multiple developers makes it impossible
to determine which developer locked a development object, or who made the
last change to an object. More importantly, failure to define secured
development groups allows all developers to access all folders, leading to the
possibility of untested changes being made in test environments.

These factors represent only a subset of the issues that may occur when the
development architecture is haphazardly constructed, or "organically" grown. As is the
case with the execution environment, a departmental data mart development effort can
"get away with" minimal architectural planning. But any serious effort to develop an
enterprise-scale analytic solution must be based on well-planned architecture, including
both the development and execution environments.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 179 of 1017


In Data Migration projects it is common to build out a set of reference data tables to
support the effort. These often include tables to hold configuration details (valid values),
cross-reference specifics, default values, data control structures, table-driven
parameter tables. These structures will be key component in the development of re-
usable objects.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 180 of 1017


Phase 3: Architect
Subtask 3.2.1 Develop
Quality Assurance Strategy

Description

Although actual testing starts with unit testing during the build phase followed by the
project’s Test Phase, there is far more involved in producing a high quality project. The
QA Strategy includes definition of key QA roles, key verification processes and key QA
assignments involved in detailing all of the validation procedures for the project.

Prerequisites
None

Roles

Quality Assurance Manager (Primary)

Security Manager (Secondary)

Test Manager (Primary)

Considerations

In determining what project steps will require verification, the QA Manager or “owner” of
the project’s QA processes, should consider the business requirements and the project
methodology. Although it may take a “sales” effort to win over management to a QA
process that is highly involved throughout the project, the benefits can be proven
historically in the success rates of projects and their ongoing maintenance costs.
However, the trade-offs of cost vs. value will likely affect the scope of QA.

● Potential areas of verification to be considered for QA processes:


● Formal business requirements reviews with key business stakeholders and
sign-off
● Formal technical requirements reviews with IT stakeholders and sign-off
● Formal review of environments and architectures with key technical personnel

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 181 of 1017


● Peer reviews of logic designs
● Peer walkthroughs of data integration logic (mappings, code, etc.)
● Unit Testing: definition of procedures, review of test plans, formal sign-off for
unit tests
● Gatekeeping for migration out of Development environment (into QA and/or
Production)
● Regression testing: definition of procedures, review of test plans, formal sign-
off
● System Tests: review of Test Plans, formal acceptance process
● Defect Management: review of procedures, validation of resolution
● User Acceptance Test: review of Test Plans, formal acceptance process
● Documentation review
● Training materials review
● Review of Deployment Plan; sign-off for deployment completion

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 182 of 1017


Phase 3: Architect
Subtask 3.2.2 Define
Development Environments

Description

Although the development environment was relatively simple in the early days of
computer system development when a mainframe-based development project typically
involved one or more isolated regions connected to one or more database instances,
distributed systems, such as federated data warehouses, involve much more complex
development environments, and many more "moving parts." The basic concept of
isolating developers from testers, and both from the production system, is still critical to
development success. However, relative to a centralized development effort, there are
many more technical issues, hardware platforms, database instances, and specialized
personnel to deal with.

The task of defining the development environment is, therefore, extremely important
and very difficult. Because of the wide variance in corporate technical environments,
standards, and objectives, there is no "optimal" development environment. Rather,
there are key areas of consideration and decisions that must be made with respect to
them.

After the development environment has been defined, it is important to document its
configuration, including (most importantly) the information the developers need to use
the environments. For example, developers need to understand what systems they are
logging into, what databases they are accessing, what repository (or repositories) they
are accessing, and where sources and targets reside. An important component of any
development environment is to configure it as close to the test and production
environments as possible given time and budget. This can significantly ease the
development and integration efforts downstream and will ultimately save time and cost
during the testing phases.

Prerequisites

3.1.1 Define Technical Requirements

Roles

Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 183 of 1017


Repository Administrator (Primary)

System Administrator (Primary)

Technical Architect (Primary)

Technical Project Manager (Review Only)

Considerations

The development environment for any data integration solution must consider many of
the same issues as a "traditional" development project. The major differences are that
the development approach is "repository-centric" (as opposed to code-based), there
are multiple sources and targets (unlike a typical system development project, which
deals with a single database), and few (if any) hand-coded objects to build and
maintain. In addition, because of the repository-based development approach, the
development environment must consider all of the following key areas:

● Repository Configuration. This involves critical decisions, such as whether


to use local repositories, a global repository, or both, as well as determining an
overall metadata strategy (see 3.2.4 Determine Metadata Strategy ).
● Folder structure. Within each repository, folders are used to group and
organize work units or report objects. To be effective, the folder structure must
consider the organization of the development team(s), as well as the change
control/migration approach.
● Developer security. Both PowerCenter and Data Analyzer have built-in
security features that allow an administrative user (i.e., the Repository
Administrator) to define the access rights of all other users to objects in the
repository. The organization of security groups should be carefully planned
and implemented prior to the start of development. As an additional option,
LDAP can be used to assist in simplifying the organization of users and
permissions.

Repository Configuration

Informatica's data integration platform, PowerCenter, provides capabilities for


integrating multiple heterogeneous sources and targets. The requirements of the
development team should dictate to what extent the PowerCenter capabilities are
exploited, if at all. In a simple data integration development effort, source data may be

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 184 of 1017


extracted from a single database or set of flat files, and then transformed and loaded
into a single target database. More complex data integration development efforts
involve multiple source and target systems. Some of these may include mainframe
legacy systems as well as third-party ERP providers.

Most data integration solutions currently being developed involve data from multiple
sources, target multiple data marts, and include the participation of developers from
multiple areas within the corporate organization. In order to develop a cohesive analytic
solution, with shared concepts of the business entities, transformation rules, and end
results, a PowerCenter-based development environment is required.

There are basically three ways to configure an Informatica-based data integration


solution although variations on these three options are certainly possible, particularly
with the addition of PowerExchange products (i.e., PowerExchange for SAPNetweaver,
PowerExchange for PeopleSoft Enterprise) and Data Analyzer for front-end reporting.
However, from a development environment standpoint, the following three
configurations serve as the basis for determining how to best configure the
environment for developers:

● Standalone PowerCenter. In this configuration, there is a single repository


that cannot be shared with any others within the enterprise. This type of
repository is referred to as a local repository and is typically used for small,
independent, departmental data marts. Many of the capabilities within
PowerCenter are available, including developer security, folder structures, and
shareable objects. The primary development restrictions are that the objects in
the repository can't be shared with other repositories, and this repository
cannot access objects in any other repositories. Multiple developers,
working on multiple projects, can still use this repository; folders can be
configured to restrict access to specified developers (or groups); and a
repository administrator with SuperUser authority can control production
objects. This means that there would be an instance of repository for
development, testing, and production. Some companies can manage co-
locating development and testing on one repository by segregating codes
through folder strategies.
● PowerCenter Data Integration Hub with Networked Local Repositories.
This configuration combines a centralized, shared global repository with one
or more distributed local repositories. The strength of this solution is that
multiple development groups can work semi-autonomously, while sharing
common development objects. In the production environment, distributing the
server load across the PowerCenter server engines can leverage this same
configuration. This option can dramatically affect the definition of the
development environment.
● PowerCenter as a Data Integration Hub with a Data Analyzer Front-End to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 185 of 1017


the Reporting Warehouse. This configuration provides an end-to-end suite of
products that allow developers to build the entire data integration solution from
data loads to end-user reporting.

PowerCenter Data Integration Hub with Networked Local


Repositories

In this advanced repository configuration, the Technical Architect must pay careful
attention to the sharing of development objects and the use of multiple repositories.
Again, there is no single "correct" solution, only general guidelines for consideration.

In most cases, the PowerCenter Global Repository becomes a development focal


point. Departmental developers wishing to access enterprise definitions of sources,
targets, and shareable objects connect to the Global Repository to do so. The layout of
this repository, and its contents, must be thoroughly planned and executed. The Global
Repository may include shareable folders containing:

● Source definitions. Because many source systems may be shared, it is


important to have a single "true" version of their schemas resident in the
Global Repository.
● Target definitions. Apply the same logic regarding source definitions.
● Shareable objects. Shared objects should be created and maintained in a
single place; the Global Repository is the place.

TIP
It is very important to house all globally-shared database schemas in the
Global Repository. Because most IT organizations prefer to maintain their
database schemas in a CASE/data modeling tool, the procedures for updating
the PowerCenter definitions of source/target schemas must include importing
these schemas from tools such as ERwin. It is far easier to develop these
procedures for a single (global) repository than for each of the (independent)
local repositories that may be using the schemas.

Of course, even if the overall development environment includes a PowerCenter Data


Integration Hub, there may still be non-shared sources, targets, and development
objects. In these cases, it is perfectly acceptable to house the definitions within a local
repository. If necessary, these objects may eventually be migrated into the shared
Global Repository. And, it may still make sense to do local development and unit

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 186 of 1017


testing in a local repository - even for shared objects, since they shouldn't be shared
until they have been fully tested. During the Architect Phase and the Design Phase
the Technical Architect should work closely with Project Management and the
development lead(s) to determine the appropriate repository placement of development
objects in a PowerCenter-based environment. After the initial configuration is
determined, the Technical Architect can limit his/her involvement in this area.

For example, any data quality steps taken with Infomatica Data Quality
(IDQ) applications (such as those implemented in 2.8 Perform Data Quality Audit or 5.3
Design and Build Data Quality Process) are performed using processes saved to a
discrete IDQ repository. These processes (called plans in IDQ parlance) can be added
to PowerCenter transfomations and subsequently saved with those transformations in
the PowerCenter repository. As indicated above, data quality plans can be designed
and tested within an IDQ repository before deployment in PowerCenter. Moreover,
depending on their purpose, plans may remain in an IDQ server repository, from which
they can be distributed as needed across the enterprise, for the life of the project.

In addition to the sharing advantages provided by the PowerCenter Data Integration


Hub approach, the global repository also serves as a centralized entry point for viewing
all repositories linked to it via networked local repositories. This mechanism allows a
global repository administrator to oversee multiple development projects without having
to separately log-in to each of the individual local repositories. This capability is useful
for ensuring that individual project teams are adhering to enterprise standards and may
also be used by centralized QA teams, where appropriate.

Folder Architecture Options and Alternatives

Repository folders provide development teams with a simple method for grouping and
organizing work units. The process for creating and administering folders is quite
simple, and thoroughly explained in Informatica’s product documentation. The main
area for consideration is the determination of an appropriate folder structure within one
or more repositories.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 187 of 1017


TIP
If the migration approach adopted by the Technical Architect involves migrating
from a development repository to another repository (test or production), it may
make sense for the "target" repository to mirror the folder structure within the
development repository. This simplifies the repository-to-repository migration
procedures. Another possible approach is to assign the same names to
corresponding database connections in both the "source" and "target"
repositories. This is particularly useful when performing folder copies from one
environment to another because it eliminates the need to change database
connection settings after the folder copy has been completed.

The most commonly employed general approaches to folder structure are:

● Folders by Subject (Target) Area. The Subject Area Division method


provides a solid infrastructure for large data warehouse or data mart
developments by organizing work by key business area. This strategy is
particularly suitable for large projects populating numerous target tables. For
example, folder names may be SALES, DISTRIBUTION, etc.
● Folder Division by Environment. This method is easier to establish and
maintain than Folders by Subject Area, but is suitable only for small
development teams working with a minimal number of mappings. As each
developer completes unit tests in his/her individual work folders, the mappings
or objects are consolidated as they are migrated to test or QA. Migration to
production is significantly simplified, with the maximum number of required
folder copies limited to the number of environments. Eventually however, the
number of mappings in a single folder may become too large to easily
maintain. Folder names may be DEV1, DEV2, DEV3, TEST, QA, etc.
● Folder Dividion by Source Area. The Source Area Division method is
attractive to some development teams, particularly if development is
centralized around the source systems. In these situations, the promotion and
deployment process can be quite complex depending on the load strategy.
Folder names may be ERP, BILLING, etc.

In addition to these basic approaches, many PowerCenter development environments


also include developer folders that are used as "sandboxes," allowing for unrestricted
freedom in development and testing. Data Analyzer creates Personal Folders for each
user name which can be used as a sandbox area for report development and
test. Once the developer has completed the initial development and unit testing within
his/her own sandbox folder, he/she can migrate the results to the appropriate folder.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 188 of 1017


TIP
PowerCenter does not support nested folder hierarchies, which creates a
challenge to logically grouping development objects in different folders. A
common technique for logically grouping folders is to use standardized naming
conventions, typically prefixing folder names with a brief, unique identifier. For
example, suppose three developers are working on the development of a
Marketing department data mart. Concurrently, in the same repository, another
group of developers is working on a Sales data mart. In order to allow each
developer to work in his/her own folder, while logically grouping them together,
the folders may be named SALES_DEV1, SALES_DEV2, SALES_DEV3,
MRKT_DEV1, etc. Because the folders are arranged alphabetically, all of the
SALES-related folders will sort together, as will the MRKT folders .

Finally, it is also important to consider the migration process in the design of the folder
structures. The migration process depends largely on the folder structure that is
established, and the type of repository environment. In earlier versions of PowerCenter,
the most efficient method to migrate an object was to perform a complete folder copy.
This involves grouping mappings meaningfully within a folder, since all mappings within
the folder migrate together. However, if individual objects need to be migrated, the
migration process can become very cumbersome, since each object needs to be
"manually" migrated.

PowerCenter 7.x introduced the concept of team-based development and object


versioning, which integrated a true version-control tool within PowerCenter. Objects
can be treated as individual elements and can be checked out for development and
checked in for testing. Objects can also be linked together to facilitate their deployment
to downstream repositories.

Data Analyzer 4.x uses the export and import of repository objects for the migration
process among environments. Objects are exported and imported as individual pieces
and cannot be linked together in a deployment group as they can in PowerCenter 7.x or
migrated as a complete folder as they can in earlier versions of PowerCenter.

Developer Security

The security features built into PowerCenter and Data Analyzer allow the development
team to be grouped according to the functions and responsibilities of each member.
One common, but risky, approach is to give all developers access to the default
Administrator ID provided upon installation of the PowerCenter or Data Analyzer

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 189 of 1017


software. Many projects use this approach because it allows developers to begin
developing mappings and sessions as soon as the software is installed.
INFORMATICA STRONGLY DISCOURAGES THIS PRACTICE. The following
paragraphs offer some recommendations for configuring security profiles for a
development team.

PowerCenter's and Data Analyzer’s security approach is similar to database security


environments. PowerCenter’s security management is performed through the
Repository Manager and Data Analyzer’s security is performed through tasks on the
Administrator tab. The internal security enables multi-user development through
management of users, groups, privileges, and folders. Despite the similarities,
PowerCenter UserIDs are distinct from database userids, and they are created,
managed, and maintained via administrative functions provided by the PowerCenter
Repository Manager or Data Analyzer Administrator.

Although privileges can be assigned to users or groups, it is more common to assign


privileges to groups only, and then add users to each group. This approach is simpler
than assigning privileges on a user-by-user basis since there are generally a few
groups and many users, and any user can belong to more than one group. Every user
must be assigned to at least one group.

For companies that have the capabilities to do so, LDAP integration is an available
option that can minimize the administration of usernames and passwords separately. If
you use LDAP authentication for repository users, the repository maintains an
association between repository user names and external login names. When you
create a user, you can select the login name from the external directory.

For additional information on PowerCenter and Data Analyzer security, including


suggestions for configuring user privileges and folder-level privileges, see Configuring
Security.

As development objects migrate closer to the production environment, security


privileges should be tightened. For example, the testing group is typically granted
Execute permissions in order to run mappings, but should not be given Write access to
the mappings. When the testing team identifies necessary changes, it can
communicate those changes (via a Change Request or bug report) to the development
group, which fixes the error and re-migrates the result to the test area.

The tightest security of all is reserved for promoting development objects into
production. In some environments, no member of the development team is permitted to
move anything into production. In these cases, a System Owner or other system
representative outside the development group must be given the appropriate repository

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 190 of 1017


privileges to complete the migration process. The Technical Architect and Repository
Administrator must understand these conditions while designing an appropriate security
solution.

Best Practices

Configuring Security

Sample Deliverables
None

Last updated: 19-Dec-07 16:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 191 of 1017


Phase 3: Architect
Subtask 3.2.3 Develop
Change Control Procedures

Description

Changes are inevitable during the initial development and maintenance stages of any
project. Wherever and whenever the changes occur - in the logical and physical data
models, extract programs, business rules, or deployment plans - they must be
controlled.

Change control procedures include formal procedures to be followed when requesting


a change to the developed system (such as sources, targets, mappings, mapplets,
shared transformations, sessions, or batches for PowerCenter and schemas, global
variables, reports, or shared objects for Data Analyzer). The primary purpose of a
change control process is to facilitate the coordination among the various organizations
involved with effecting this change (i.e., development, test, deployment, and
operations). This change control process controls the timing, impact, and method by
which development changes are migrated through the promotion hierarchy. However,
the change control process must not be so cumbersome as to hinder speed of
deployment. The procedures should be thorough and rigid, without imposing undue
restrictions on the development team's goal of getting its solution into production in a
timely manner.

This subtask addresses many of the factors influencing the design of the change
control procedures. The procedures themselves should be a well-documented series of
steps, describing what happens to a development object once it has been modified (or
created) and unit tested by the developer. The change control procedures document
should also provide background contextual information, including the configuration of
the environment, repositories, and databases.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 192 of 1017


Database Administrator (DBA) (Secondary)

Presentation Layer Developer (Secondary)

Quality Assurance Manager (Approve)

Repository Administrator (Secondary)

System Administrator (Secondary)

Technical Project Manager (Primary)

Considerations

It is important to recognize that the change control procedures and the organization of
the development environment are heavily dependent upon each other. It is impossible
to thoroughly design one without considering the other. The following development
environment factors influence the approach taken to change control:

Repository Configuration

Subtask 3.2.2 Define Development Environments discusses the two basic approaches
to repository configuration. The first one, Stand-Alone PowerCenter, is the simplest
configuration in that it involves a single repository. If that single repository supports
both development and production (although this is not generally advisable), then the
change control process is fairly straightforward; migrations involve copying the relevant
object from a development folder to a production folder, or performing a complete folder
copy.

However, because of the many advantages gained by isolating development from


production environments, Informatica recommends physically separating repositories
whenever technically and fiscally feasible. This decision complicates the change control
procedures somewhat, but provides a more stable solution.

The general approach for migration is similar regardless of whether the environment is
a single repository or multiple repository approach. In either case, logical groupings of
development objects have been created, representing the various promotion levels
within the promotion hierarchy (e.g., DEV, TEST, QA, PROD). In the single repository
approach, the logical grouping is accomplished through the use of folders named
accordingly. In the multiple repository approach, an entire repository may be used for

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 193 of 1017


one (or more) promotion levels. Whenever possible, the production repository should
be independent of the others. A typical configuration would be a shared repository
supporting both DEV and TEST, and a separate PROD repository.

● If the object is a global object (reusable or not reusable), the change must be
applied to the global repository.
● If the object is shared, the shortcuts referencing this object automatically
reflect the change from any location in the global or local architecture.
Therefore, only the "original" object must be migrated.
● If the object is stored in both repositories (i.e., global and local), the change
must be made in both repositories.
● Finally, if the object is only stored locally, the change is only implemented in
the local repository.

Tip
With a PowerCenter Data Integration Hub implementation, global repositories
can register local repositories. This provides access to both repositories
through one "console", simplifying the administrative tasks for completing
change requests. In this case, the global Repository Administrator can perform
all repository migration tasks.

Regardless of the repository configuration however, the following questions must be


considered in the change control procedures:

● What PowerCenter or Data Analyzer objects does this change affect?


● What other system objects are affected by the change? What processes
(migration/promotion, load) does this change impact?
● What processes does the client have in place to handle and track changes?
● Who else uses the data affected by the change and are they involved in the
change request?
● How will this change be promoted to other environments in a timely manner?
● What is the effort involved in making this change? Is there time in the project
schedule for this change? Is there sufficient time to fully test the change?

Change Request Tracking Method

The change procedures must include a means for tracking change requests and their
migration schedules, as well as a procedure for backing out changes, if necessary.
The Change Request Form should include information about the nature of the change,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 194 of 1017


the developer making the change, the timing of the request for migration, and enough
technical information about the change that it can be reversed if necessary.

There are a number of ways to back-out a changed development object. It is important


to note, however, that prior to PowerCenter 7.x, reversing a change to a single object in
the repository is very tedious and error-prone, and should be considered as a last
resort. The time to plan for this occurrence however, is during the implementation of the
development environment, not after an incorrect change has been migrated into
Production. Backing out a change in PowerCenter 7.x, however, is a simple as
reverting to a previous version of the object(s).

Team Based Development, Tracking and Reverting to Previous


Version

The team-based development option provides functionality in two areas: versioning and
deployment. But, other features, such as repository queries and labeling are necessary
to ensure optimal use of versioning and deployment. The following sections describe
this functionality at a general level. For a more detailed explanation of any of the
capabilities of the team-based development features of PowerCenter, refer to the
appropriate sections of the PowerCenter documentation.

While the functionality provided via team-based development is quite powerful, it is


clear that there are better ways of using it to achieve expected goals. The activities of
coordinating development in a team environment, tracking finished work that needs to
be reviewed or migrated, managing migrations, and ensuring minimal errors can be
quite complex. The process requires a combination of PowerCenter functionality and
user process to implement effectively.

Data Migration Projects

For Data Migration projects change control is critical for success. It is common that the
target system has continual changes during the life of the data migration project. These
cause changes to specifications, which in turn cause a need to change the mappings,
sessions, workflows, and scripts that make up the data migration project. Change
control is important to allow the project management to understand the scope of
change and to limit the impact that process changes cause to related processes. For
data migration, the key to change control is in the communication of changes to ensure
that testing activities are integrated.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 195 of 1017


Sample Deliverables
None

Last updated: 15-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 196 of 1017


Phase 3: Architect
Subtask 3.2.4 Determine Metadata Strategy

Description

Designing, implementing and maintaining a solid metadata strategy is a key enabler of high-quality solutions. The
federated architecture model of a PowerCenter-based global metadata repository provides the ability to share metadata
that crosses departmental boundaries while allowing non-shared metadata to be maintained independently.

A proper metadata strategy provides Data Integration Developers, End-User Application Developers, and End Users with
the ability to create a common understanding of the data, where it came from, and what business rules have been applied
to it. As such, the metadata may be as important as the data itself, because it provides context and credibility to the data
being analyzed.

The metadata strategy should describe where metadata will be obtained, where it will be stored, and how it will be
accessed. After the strategy is developed, the Metadata Manager is responsible for documenting and distributing it to the
development team and end-user community. This solution allows for the following capabilities:

● Consolidation and cataloging of metadata from various source systems


● Reporting on cataloged metadata
● Lineage and where-used Analysis
● Operational reporting
● Extensibility

The Business Intelligence Metadata strategy can also assist in achieving the goals of data orientation by providing a focus
for sharing the data assets of an organization. It can provide a map for managing the expanding requirements for
reporting information that the business places upon the IT environment. The metadata strategy highlights the importance
of a central data administration department for organizations that are concerned about data quality, integrity, and reuse.
The components of a metadata strategy for Data Analyzer include:

● Determine how metadata will be used in the organization


● Data stewardship
● Data ownership
● Determine who will use what metadata and why
● Business definitions and names
● Systems definitions and names
● Determine training requirements for the power user as well as regular users

Prerequisites
None

Roles

Metadata Manager (Primary)

Repository Administrator (Primary)

Technical Project Manager (Approve)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 197 of 1017


Considerations

The metadata captured while building and deploying analytic solution architecture should pertain to each of the system's
points of integration, an area where managing metadata provides benefit to IT and/or business users. The Metadata
Manager should analyze each point of integration in order to answer the following questions:

● What metadata needs to be captured?


● Who are the users that will benefit from this metadata?
● Why is it necessary to capture this metadata (i.e., what are the actual benefits of capturing this metadata)?
● Where is the metadata currently stored (i.e., its source) and where will it ultimately reside?
● How will the repository be populated initially, maintained, and accessed?

It is important to centralize metadata management functions despite the potential "metadata bottleneck" that may be
created during development. This consolidation is beneficial when a production system based on clean, reliable metadata
is unveiled to the company. The following table expands the concept of the Who, What, Why, Where, and How approach
to managing metadata:

Metadata Source of Metadata


Definition Users Benefits Metadata Store Population Maintenance Access

(What?) (Who?) (Why?) (Where?) (Where?) (How?) (How?) (How?)

Source Source Allows users Source PowerCenter Source Captured Repository


Structures system to see all operational Repository Analyzer, and loaded Manager
users & structures and system PowerPlugs, once,
owners, associated PowerCenter, maintained
Data elements in Informatica as necessary
Migration the source Data Explorer
Resources system

Target Target Allows users Data Informatica Warehouse Captured Repository


Warehouse system to see all warehouse, Repository Designer, and loaded Manager
Structures users/ structures and data marts PowerPlugs, once,
analysts, associated PowerCenter maintained
DW elements in as necessary
Architects, the target
Data system
migration
resources

Source to Data Simplifies PowerCenter Informatica PowerCenter Capture Data Repository


target migration documentation Repository Designer, Changes Manager
mappings resources, process, Informatica
business allows for Data Explorer
analysts quicker more
efficient
rework of
mappings

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 198 of 1017


Reporting Business Allows users Data Informatica Reporting
Tool analysts to see Analyzer Repository Tool
business
names and
definitions for
query-building

Note that the Informatica Data Explorer (IDE) application suite possesses a wide range of functional capabilities for data
and metadata profiling and for source-to-target mapping.

The Metadata Manager and Repository Manager need to work together to determine how best to capture the metadata,
always considering the following points:

● Source structures. Are source data structures captured or stored already in a CASE/data modeling tool? Are
they maintained consistently?
● Target structures. Are target data structures captured or stored already in a CASE/data modeling tool? Is
PowerCenter being used to create target data structure? Where will the models be maintained?
● Extract, Transform, and Load process. Assuming PowerCenter is being used for the ETL processing; the
metadata will be created and maintained automatically within the PowerCenter repository. Also, remember that
any ETL code developed outside of a PowerCenter mapping (i.e., in stored procedures or external procedures)
will not have metadata associated with it.
● Analytic applications. Several front-end analytic tools have the ability to import PowerCenter metadata. This can
simplify the development and maintenance of the analytic solution.
● Reporting tools. End users working with Data Analyzer may need access to the PowerCenter metadata in order
to understand the business context of the data in the target database(s).
● Operational metadata. PowerCenter automatically captures rich operational data when batches and sessions are
executed. This metadata may be useful to operators and end users, and should be considered an important part
of the analytic solution.

Best Practices
None

Sample Deliverables
None

Last updated: 18-Oct-07 15:09

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 199 of 1017


Phase 3: Architect
Subtask 3.2.5 Develop
Change Management
Process

Description

Change Management is the process for managing the implementation of changes to a


project (i.e., data warehouse or data integration) including hardware, software,
services, or related documentation.

Its purpose is to minimize the disruption to services caused by change and to ensure
that records of hardware, software, services and documentation are kept up to date.
The Change Management process enables the actual change to take place. Elements
of the process include identify change, create request for change, impact assessment,
approval, scheduling, and implementation.

Prerequisites
None

Roles

Business Project Manager (Primary)

Project Sponsor (Review Only)

Technical Project Manager (Primary)

Considerations

Identify Change

Change Management is necessary in any of the following situations:

● A problem arises that requires a change that will affect more than one
business user or a user group such as sales, marketing, etc.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 200 of 1017


● A new requirement is identified as a result of advances in technology (e.g., a
software upgrade) or a change in needs (for new functionality).
● A change is required to fulfill a change in business strategy as identified by a
business leader or developer.

Request for Change

A request for change should be completed for each proposed change, with a checklist
of items to be considered and approved before implementing the change. The change
procedures must include a means for tracking change requests and their migration
schedules, as well as a procedure for backing out changes, if necessary. The Change
Request Form should include information about the nature of the change, the
developer making the change, the timing of the request for migration, and enough
technical information about the change that it can be reversed if necessary.

Before implementing a change request in the PowerCenter environment, it is advisable


to create an additional back-up repository. Using this back-up, the repository can be
restored to a 'spare' repository database. After a successful restore, the original object
can be retrieved via object copy. In addition, be sure to:

● Track changes manually (electronic or paper change request form), then


change the object back to its original form by referring to the change request
form.
● Create one to 'x' number of version folders, where 'x' is the number of versions
back that repository information is maintained. If a change needs to be
reversed, the object simply needs to be copied to the original development
folder from this versioning folder. The number of 'versions' to maintain is at the
discretion of the PowerCenter Administrator. Note however, that this approach
has the disadvantage of being very time consuming and may also greatly
increase the size of the repository databases.

PowerCenter Versions 7.X and 8.X

The team-based development option provides functionality in two areas: versioning and
deployment. But, other features, such as repository queries and labeling are required to
ensure optimal use of versioning and deployment. The following sections describe this
functionality at a general level. For a more detailed explanation of any of the
capabilities of the Team-based Development features of PowerCenter, please refer to
the appropriate sections of the PowerCenter documentation.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 201 of 1017


● For clients using Data Analyzer for front-end reporting, certain considerations
need to be addressed with the migration of objects:
● Data Analyzer’s repository database contains user profiles in addition to
reporting objects. If users are synchronized from outside sources (like an
LDAP directory or via Data Analyzer’s API), then a repository restore from one
environment to another may delete user profiles (once the repository is linked
to LDAP).
● When reports containing references to dashboards are migrated, the
dashboards also need to be migrated to reflect the link to the report.
● In a clustered Data Analyzer configuration, certain objects that are migrated
via XML imports may only be reflected on the node that the import operation
was performed on. It may be necessary to stop and re-start the other nodes to
refresh these nodes with these changes.

Approval to Proceed

An initial review of the Change Request form should assess the cost and value of
proceeding with the change. If sufficient information is not provided on the request form
to enable the initial reviewer to thoroughly assess the change, he or she should return
the request form to the originator for further details. The originator can then resubmit
the change request with the requested information. The change request must be
tracked through all stages of the change request process, with thorough documentation
regarding approval or rejection and resubmission.

Plan and Prepare Change

Once approval to proceed has been granted, the originator may plan and prepare the
change in earnest.

The following sections on the request for change must be completed at this stage:

● Full details of change – Inform Administrator, backup repository and backup


database.
● Impact on services and users – Inform business users in advance about any
anticipated outage.
● Assessment of risk of the change failing.
● Fallback plan in case of failure – Includes reverting to old version using TBD
● Date and time of change – Migration / Promotion plan Test-Dev and Dev-Prod

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 202 of 1017


Impact Analysis

The Change Control Process must include a formalized approach to completing impact
analysis. Any implemented change has some planned downstream impact (e.g., the
values on a report will change, additional data will be included, a new target file will be
populated, etc.) The importance of the impact analysis process is in recognizing
unforeseen downstream affects prior to implementing the change. In many cases, the
impact is easy to define. For example, if a requested change is limited to changing the
target of a particular session from a flat file to a table, the impact is obvious. However,
most changes occur within mappings or within databases, and the hidden impacts can
be worrisome. For example, if a business rule change is made, how will the end results
of the mapping be affected? If a target table schema needs to be modified within the
repository, the corresponding target database must also be changed, and it must be
done in sync with the migration of the repository change.

An assessment must be completed to determine how a change request affects other


objects in the analytic solution architecture. In many development projects, the initial
analysis is performed, and then communicated to all affected parties (e.g., Repository
Administrator, DBAs, etc.) at a regularly scheduled meeting. This ensures that
everyone who needs to be notified is, and that all approve the change request. For
PowerCenter, the Repository Manager can be used to identify object
interdependencies.

An impact analysis must answer the following questions:

● What PowerCenter or Data Analyzer objects does this change affect?


● What other system objects are affected by the change? What processes (i.e.,
migration/promotion, load) does this change impact?
● What processes does the client have in place to handle and track changes?
● Who else uses the data affected by the change and are they involved in the
change request?
● How will this change be promoted to other environments in a timely manner?
● What is the effort involved in making this change? Is there time in the project
schedule for this change? Is there sufficient time to fully test the change?

Implementation

Following final approval and after relevant and timely communications have been
issued, the change may be implemented in accordance with the plan and the
scheduled date and time.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 203 of 1017


After implementation, the change request form should indicate whether the change was
successful or unsuccessful so as to maintain a clear record of the outcome of the
request.

Change Control and Migration /Promotion Process

Identifying the most efficient method for applying change to all environments is
essential. Within the PowerCenter and Data Analyzer environments, the types of
objects to manage are:

● Source definitions
● Target definitions
● Mappings and mapplets
● Reusable transformations
● Sessions
● Batches
● Reports
● Schemas
● Global variables
● Dashboards
● Schedules

In addition, there are objects outside of the Informatica architecture that are directly
linked to these objects, so the appropriate procedures need to be established to ensure
that all items are synchronized.

When a change request is submitted, the following steps should occur:

1. Perform impact analysis on the request. List all objects affected by the change,
including development objects and databases.
2. Approve or reject the change or migration request. The Project Manager has
authority to approve/reject change requests.
3. If approved, pass the request to the PowerCenter Administrator for processing.
4. Migrate the change to the test environment.
5. Test the requested change. If the change does not pass testing, the process will
need to start over for this object.
6. Submit the promotion request for migration to QA and/or production
environments.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 204 of 1017


7. If appropriate, the Project Manager approves the request.
8. The Repository Administrator promotes the object to appropriate environments.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 205 of 1017


Phase 3: Architect
Task 3.3 Implement
Technical Architecture

Description

While it is crucial to design and implement a technical architecture as part of the data
integration project development effort, most of the implementation work is beyond the
scope of this document. Specifically, the acquisition and installation of hardware and
system software is generally handled by internal resources, and is accomplished by
following pre-established procedures. This section touches on these topics, but is not
meant to be a step-by-step guide to the acquisition and implementation process.

After determining an appropriate technical architecture for the solution (3.1 Develop
Solution Architecture), the next step is to physically implement that architecture. This
includes procuring and installing the hardware and software required to support the
data integration processes.

Prerequisites

3.2 Design Development Architecture

Roles

Database Administrator (DBA) (Secondary)

Project Sponsor (Approve)

Repository Administrator (Primary)

System Administrator (Primary)

Technical Architect (Primary)

Technical Project Manager (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 206 of 1017


Considerations

The project schedule should be the focus of the hardware and software implementation
process. The entire procurement process, which may require a significant amount of
time, must begin as soon as possible to keep the project moving forward. Delays in this
step can cause serious delays to the project as a whole. There are, however, a number
of proven methods for expediting the procurement and installation processes, as
described in the related subtasks.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 207 of 1017


Phase 3: Architect
Subtask 3.3.1 Procure
Hardware and Software

Description

This is the first step in implementing the technical architecture. The procurement
process varies widely among organizations, but is often based on a purchase request (i.
e., Request for Purchase or RFP) generated by the Project Manager after the project
architecture is planned and configuration recommendations are approved by IT
management.

An RFP is usually mandatory for procuring any new hardware or software. Although the
forms vary widely among companies, an RFP typically lists what products need to be
purchased, when they will be needed, and why they are necessary for the project. The
document is then reviewed and approved by appropriate management and the
organization's "buyer".

It is critical to begin the procurement process well in advance of the start of


development.

Prerequisites

3.2 Design Development Architecture

Roles

Database Administrator (DBA) (Secondary)

Project Sponsor (Approve)

Repository Administrator (Secondary)

System Administrator (Secondary)

Technical Architect (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 208 of 1017


Technical Project Manager (Primary)

Considerations

Frequently, the Project Manager does not control purchasing new hardware and
software. Approval must be received from another group or individual within the
organization, often referred to as a "buyer". Even before product purchase decisions
are finalized, it is a good idea to notify the buyer of necessary impending purchases,
providing a brief overview of the types of products that are likely to be required and for
what reasons.

It may also be possible to begin the procurement process before all of the prerequisite
steps are complete (See 2.2 Define Business Requirements, 3.1.2 Develop
Architecture Logical View, and 3.1.3 Develop Configuration Recommendations. The
Technical Architect should have a good idea of at least some of the software and
hardware choices before a physical architecture and configuration recommendations
are solidified.

Finally, if development is ready to begin and the hardware procurement process is not
yet complete, it may be worthwhile to get started on a temporary server with the
intention of moving the work to the new server when it is available.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 209 of 1017


Phase 3: Architect
Subtask 3.3.2 Install/
Configure Software

Description

Installing, configuring, and deploying new hardware and software should not affect the
progress of a data integration project. The entire development team depends on a
properly configured technical environment. Incorrect installation or delays can have
serious negative effects on the project schedule.

Establishing and following a detailed installation plan can help avoid unnecessary
delays in development. (See 3.1.2 Develop Architecture Logical View).

Prerequisites

3.2 Design Development Architecture

Roles

Database Administrator (DBA) (Primary)

Repository Administrator (Primary)

System Administrator (Primary)

Technical Architect (Review Only)

Technical Project Manager (Review Only)

Considerations

When installing and configuring hardware and software for a typical data warehousing
project, the following Informatica software components should be considered:

● PowerCenter Services – The PowerCenter services, including the repository,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 210 of 1017


integration, log, and domain services, should be installed and configured on a
server machine.
● PowerCenter Client – The client tools for the PowerCenter engine must be
installed and configured on the client machines for developers. The
DataDirect - ODBC drivers should also be installed on the client machines.
The PowerCenter client tools allow a developer to interact with the repository
through an easy-to-use GUI interface.
● PowerCenter Reports – PowerCenter Reports (PCR) is a reporting tool that
enables users to browse and analyze PowerCenter metadata, allowing users
to view PowerCenter operational load statistics and perform impact analysis.
PCR is based on Informatica Data Analyzer, running on an included JBOSS
application server, to manage and distribute these reports via an internet
browser interface.
● PowerCenter Reports Client – The PCR client is a web-based, thin-client tool
that uses Microsoft Internet Explorer 6 as the client. Additional client tool
installation for the PCR is usually not necessary, although the proper version
of Internet Explorer should be verified on client workstations.
● Data Analyzer Server – The analytics server engine for Data Analyzer should
be installed and configured on a server.
● Data Analyzer Client – Data Analyzer is a web-based, thin-client tool that
uses Microsoft Internet Explorer 6 as the client. Additional client tool
installation for Data Analyzer is usually not necessary, although the proper
version of Internet Explorer should be verified on the client machines of
business users to ensure that minimum requirements are met.
● PowerExchange – PowerExchange has components that must be installed on
the source system, PowerCenter server, and client.

In addition to considering the Informatica software components that should be installed,


the preferred database for the data integration project should be selected and installed,
keeping these important database size considerations in mind:

● PowerCenter Metadata Repository - Although you can create a


PowerCenter metadata repository with a minimum of 100MB of database
space, Informatica recommends allocating up to 150MB for PowerCenter
repositories. Additional space should be added for versioned repositories. The
database user should have privileges to create tables, views, and indexes.
● Data Analyzer Metadata Repository - Although you can create a Data
Analyzer repository with a minimum of 60MB of database space, Informatica
recommends allocating up to 150MB for Data Analyzer repositories. The
database user should have privileges to create tables, views, and indexes.
● Metadata Manager Repository – Although you can create a Metadata

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 211 of 1017


Manager repository with a minimum of 550MB of database space, you may
choose to allocate more space in order to plan for future growth. The database
user should have privileges to create tables, views, and indexes.
● Data Warehouse Database – Allow for ample space with growth at a rapid
pace.

PowerCenter Server Installation

The PowerCenter services need to be installed and configured, along with any
necessary database connectivity drivers, such as native drivers or ODBC. Connectivity
needs to be established among all the platforms before the Informatica applications can
be used.

The recommended configuration for the PowerCenter environment is to install the


PowerCenter services and the repository and target databases on the same
multiprocessor machine. This approach minimizes network interference when the
server is writing to the target database. Use this approach when available CPU and
memory resources on the multiprocessor machine allow all software processes to
operate efficiently without “pegging” the server. If available hardware dictates that the
PowerCenter Server is separated physically from the target database server,
Informatica recommends placing a high-speed network connection between the two
servers.

Some organizations house the repository database on a separate database server if


they are running OLAP servers and want to consolidate metadata repositories.
Because the repository tables are typically very small in comparison to the data mart
tables, and storage parameters are set at the database level, it may be advisable to
keep the repository in a separate database.

For step-by-step instructions for installing the PowerCenter services, refer to the
Informatica PowerCenter Installation Guide. The following list is intended to
complement the installation guide when installing PowerCenter:

● Network Protocol - TCP/IP and IPX/SPX are the supported protocols for
communication between the PowerCenter services and PowerCenter client
tools. To improve repository performance, consider installing the Repository
service on a machine with a fast network connection. To optimize
performance, do not install the Repository service on a Primary Domain
Controller (PDC) or a Backup Domain Controller (BDC).
● Native Database Drivers (or ODBC in some instances) are used by the
Server to connect to the source, target, and repository databases. Ensure that

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 212 of 1017


all appropriate database drivers (and most recent patch levels) are installed on
the PowerCenter server to access source, target, and repository databases.
● Operating System Patches – Prior to installing PowerCenter, please refer to
the PowerCenter Release Notes documentation to ensure that all required
patches have been applied to the operating system. This step is often
overlooked and can result in operating system errors and/or failures when
running the PowerCenter Server.
● Data Movement Mode - The DataMovementMode option is set in the
PowerCenter Integration Service configuration. The DataMovementMode can
be set to ASCII or Unicode.Unicode is an international character set standard
that supports all major languages (including US, European, and Asian), as well
as common technical symbols. Unicode uses a fixed-width encoding of 16-bits
for every character. ASCII is a single-byte code page that encodes character
data with 7-bits. Although actual performance results depend on the nature of
the application, if international code page support (i.e., Unicode) is not
required, set the DataMovementMode to ASCII because the 7-bit storage of
character data results in smaller cache sizes for string data, resulting in more
efficient data movement.
● Versioning – If Versioning is enabled for a PowerCenter Repository,
developers can save multiple copies of any PowerCenter object to the
repository. Although this feature provides developers with a seamless way to
manage changes during the course of a project, it also results in larger
metadata repositories. If Versioning is enabled for a repository, Informatica
recommends allocating a minimum of 500MB of space in the database for the
PowerCenter repository.
● Lightweight Directory Access Protocol (LDAP) - If you use PowerCenter
default authentication, you create users and maintain passwords in the
PowerCenter metadata repository using Repository Manager. The Repository
service verifies users against these user names and passwords. If you use
Lightweight Directory Access Protocol (LDAP), the Repository service passes
a user login to the external directory for authentication, allowing
synchronization of PowerCenter user names and passwords with network/
corporate user names and passwords. The repository maintains an
association between repository user names and external login names. You
must create the user name-login associations, but you do not maintain user
passwords in the repository. Informatica provides a PowerCenter plug-in that
you can use to interface between PowerCenter and an LDAP server. To install
the plug-in, perform the following steps:
1. Configure the LDAP module connection information from the
Administration Console.
2. Register the package with each repository that you want to use it with.
3. Set up users in each repository.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 213 of 1017


For more information on configuring LDAP authentication, refer to the Informatica
PowerCenter Repository Guide.

PowerCenter Client Installation

The PowerCenter Client needs to be installed on all developer workstations, along with
any necessary drivers, including database connectivity drivers such as ODBC.

Before you begin the installation, verify that you have enough disk space for the
PowerCenter Client. You must have 300MB of disk space to install the PowerCenter 8
Client tools. Also, make sure you have 30MB of temporary file space available for the
PowerCenter Setup. When installing PowerCenter Client tools via a standard
installation, choose to install the “Client tools” and “ODBC” components.

TIP
You can install the PowerCenter Client tools in standard mode or silent mode.
You may want to perform a silent installation if you need to install the
PowerCenter Client on several machines on the network, or if you want to
standardize the installation across all machines in the environment. When you
perform a silent installation, the installation program uses information in a
response file to locate the installation directory. You can also perform a silent
installation for remote machines on the network.

When adding an ODBC data source name (DSN) to client workstations, it is a good
idea to keep the DSN consistent among all workstations. Aside from eliminating the
potential for confusion on individual developer machines, this is important when
importing and exporting repository registries.

The Repository Manager saves repository connection information in the registry. To


simplify the process of setting up client systems, it is possible to export that information,
and then import it for a new client. The registry references the data source names used
in the exporting machine. If a registry is imported containing a DSN that does not exist
on the client system, the connection will fail at runtime.

PowerCenter Reports Installation

PowerCenter Reports (PCR) replaces the PowerCenter Metadata Reporter. The


reports are built on the Data Analyzer infrastructure. Data Analyzer must be installed
and configured, along with the application server foundation software. Currently, PCR
is shipped with the PowerCenter installation (both Standard and Advanced Editions).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 214 of 1017


The recommended configuration for the PCR environment is to place the PCR/Data
Analyzer server, application server, and repository databases on the same
multiprocessor machine. This approach minimizes network input/output as the PCR
server reads from the PowerCenter repository database. Use this approach when
available CPU and memory resources on the multiprocessor machine allow all software
processes to operate efficiently without “pegging” the server. If available hardware
dictates that the PCR server be physically separated from the PowerCenter repository
database server, Informatica recommends placing a high-speed network connection
between the two servers.

For step-by-step instructions for installing the PowerCenter Reports, refer to the
Informatica PowerCenter Installation Guide. The following list of considerations is
intended to complement the installation guide when installing PCR:

● Operating System Patch Levels – Prior to installing PCR, be sure to refer to


the Data Analyzer Release Notes documentation to ensure that all required
patches have been applied to the operating system. This step is often
overlooked and can result in operating system errors and/or failures if the
correct patches are not applied.
● Lightweight Directory Access Protocol (LDAP) - If you use default
authentication, you create users and maintain passwords in the Data Analyzer
metadata repository. Data Analyzer verifies users against these user names
and passwords. However, if you use Lightweight Directory Access Protocol
(LDAP), Data Analyzer passes a user login to the external directory for
authentication, allowing synchronization of Data Analyzer user names and
passwords with network/corporate user names and passwords, as well as
PowerCenter user names and passwords. The repository maintains an
association between repository user names and external login names. You
must create the user name-login associations, but you do not have to maintain
user passwords in the repository. In order to enable LDAP, you must configure
the IAS.properties and ldaprealm.properties files. For more information on
configuring LDAP authentication, see the Data Analyzer Administration Guide.

PowerCenter Reports Client Installation

The PCR client is a web-based, thin-client tool that uses Microsoft Internet Explorer 6
as the client. The proper version of Internet Explorer should be verified on client
machines, ensuring that Internet Explorer 6 is the default web browser, and the
minimum system requirements should be validated.

In order to use PCR, the client workstation should have at least a 300MHz processor
and 128MB of RAM. Please note that these are the minimum requirements for the PCR

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 215 of 1017


client, and that if other applications are running on the client workstation, additional
CPU and memory is required. In most situations, users are likely to be multi-tasking
using multiple applications, so this should be taken into consideration.

Certain interactive features in the PCR require third-party plug-in software to work
correctly. Users must download and install the plug-in software on their workstation
before they can use these features. PCR uses the following third-party plug-in software:

● Microsoft SOAP Toolkit - In PCR, you can export a report to an Excel file and
refresh the data in Excel directly from the cached data in PCR or from data in
the data warehouse through PCR. To use the data refresh feature, you must
first install the Microsoft SOAP Toolkit. For information on downloading the
Microsoft SOAP Toolkit, see “Working with Reports” in the Data Analyzer User
Guide.
● Adobe SVG Viewer - In PCR, you can display interactive report charts and
chart indicators. You can click on an interactive chart to drill into the report
data and view details and select sections of the chart. To view interactive
charts, you must install Adobe SVG Viewer. For more information on
downloading Adobe SVG Viewer, see “Managing Account Information” in the
Data Analyzer User Guide.

Lastly, for PCR to display its application windows correctly, Informatica recommends
disabling any pop-up blocking utility on your browser. If a pop-up blocker is running
while you are working with PCR, the PCR windows may not display properly.

Data Analyzer Server Installation

The Data Analyzer Server needs to be installed and configured along with the
application server foundation software. Currently, Data Analyzer is certified on the
following application servers:

● BEA WebLogic
● IBM WebSphere
● JBoss Application Server

Refer to the PowerCenter Installation Guide for the current list of supported application
servers and exact version numbers.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 216 of 1017


TIP
When installing IBM WebSphere Application Server, avoid using spaces in the
installation directory path name for the application server, http server, or
messaging server.

The recommended configuration for the Data Analyzer environment is to put the Data
Analyzer Server, application server, repository, and data warehouse databases on the
same multiprocessor machine. This approach minimizes network input/output as the
Data Analyzer Server reads from the data warehouse database. Use this approach
when available CPU and memory resources on the multiprocessor machine allow all
software processes to operate efficiently without “pegging” the server. If available
hardware dictates that the Data Analyzer Server is separated physically from the data
warehouse database server, Informatica recommends placing a high-speed network
connection between the two servers.

For step-by-step instructions for installing the Data Analyzer Server components, refer
to the Informatica Data Analyzer Installation Guide. The following list of considerations
is intended to complement the installation guide when installing Data Analyzer:

● Operating System Patch Levels – Prior to installing Data Analyzer, refer to


the Data Analyzer Release Notes documentation to ensure that all required
patches have been applied to the operating system. This step is often
overlooked and can result in operating system errors and/or failures if the
correct patches are not applied.
● Lightweight Directory Access Protocol (LDAP) - If you use Data Analyzer
default authentication, you create users and maintain passwords in the Data
Analyzer metadata repository. Data Analyzer verifies users against these user
names and passwords. However, if you use Lightweight Directory Access
Protocol (LDAP), Data Analyzer passes a user login to the external directory
for authentication, allowing synchronization of Data Analyzer user names and
passwords with network/corporate user names and passwords, as well as
PowerCenter user names and passwords. The repository maintains an
association between repository user names and external login names. You
must create the user name-login associations, but you do not maintain user
passwords in the repository. In order to enable LDAP, you must configure the
IAS.properties and ldaprealm.properties files. For more information on
configuring LDAP authentication, refer to the Informatica Data Analyzer
Administrator Guide.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 217 of 1017


TIP
After installing Data Analyzer on the JBoss application server, set the minimum
pool size to 0 in the file <JBOSS_HOME>/server/informatica/deploy/hsqldb-ds.
xml. This ensures that the managed connections in JBOSS will be configured
properly. Without this setting it is possible that email alert messages will not be
sent properly.
TIP
Repository Preparation
Before you install Data Analyzer, be sure to clear the database transaction log
for the repository database. If the transaction log is full or runs out of space when
the Data Analyzer installation program creates the Data Analyzer repository, the
installation program will fail.

Data Analyzer Client Installation

The Data Analyzer Client is a web-based, thin-client tool that uses Microsoft Internet
Explorer 6 as the client. The proper version of Internet Explorer should be verified on
client machines, ensuring that Internet Explorer 6 is the default web browser, and the
minimum system requirements should be validated.

In order to use the Data Analyzer Client, the client workstation should have at least a
300MHz processor and 128MB of RAM. Please note that these are the minimum
requirements for the Data Analyzer Client, and that if other applications are running on
the client workstation, additional CPU and memory is required. In most situations, users
are likely to be multi-tasking using multiple applications, so this should be taken into
consideration.

Certain interactive features in Data Analyzer require third-party plug-in software to work
correctly. Users must download and install the plug-in software on their workstation
before they can use these features. Data Analyzer uses the following third-party plug-in
software:

● Microsoft SOAP Toolkit - In Data Analyzer, you can export a report to an


Excel file and refresh the data in Excel directly from the cached data in Data
Analyzer or from data in the data warehouse through Data Analyzer. To use
the data refresh feature, you must first install the Microsoft SOAP Toolkit. For
information on downloading the Microsoft SOAP Toolkit, see “Working with
Reports” in the Data Analyzer User Guide.
● Adobe SVG Viewer - In Data Analyzer, you can display interactive report
charts and chart indicators. You can click on an interactive chart to drill into the
report data and view details and select sections of the chart. To view
interactive charts, you must install Adobe SVG Viewer. For more information

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 218 of 1017


on downloading Adobe SVG Viewer, see “Managing Account Information” in
the Data Analyzer User Guide.

Lastly, for Data Analyzer to display its application windows correctly, Informatica
recommends disabling any pop-up blocking utility on your browser. If a pop-up blocker
is running while you are working with Data Analyzer, the Data Analyzer windows may
not display properly.

Metadata Manager Installation

Metadata Manager software can be installed after the development environment


configuration has been completed and approved. The following high-level steps are
involved in Metadata Manager installation process:

Metadata Manager requires a web server and a Java 2 Enterprise Edition (J2EE)-
compliant application server. Metadata Manager works with BEA WebLogic Server,
IBM WebSphere Application Server, and JBoss Application Server. If you choose to
use BEA WebLogic or IBM WebSphere, they must be installed prior to the Metadata
Manager installation. The JBoss Application Server can be installed from the Metadata
Manager installation process.

Informatica recommends that a system administrator, who is familiar with application


and web servers, LDAP servers, and the J2EE platform, install the required software.
For complete information on the Metadata Manager installation process, refer to the
PowerCenter Installation Guide.

1. Install BEA WebLogic Server or IBM WebSphere Application Server on the


machine where you plan to install Metadata Manager. You must install the
application server and other required software before you install Metadata
Manager.
2. You can install Metadata Manager on a machine with a Windows or UNIX
operating system.

Metadata Manager includes the following installation components:

● Metadata Manager
● Limited edition of PowerCenter
● Metadata Manager documentation in PDF format
● Metadata Manager and Data Analyzer integrated online help
● Configuration Console online help

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 219 of 1017


Be sure to refer to the Metadata Manager Release Notes for information regarding the
supported versions of each application.

To install Metadata Manager for the first time, complete each of the following tasks in
the order listed below:

1. Create database user accounts. Create one database user account for the
Metadata Manager Warehouse and Metadata Manager Server repository and
another for the Integration repository.
2. Install the application server. Install BEA WebLogic Server or IBM
WebSphere Application Server.
3. Install PowerCenter 8. Install PowerCenter 8 to manage metadata extract and
load tasks.
4. Install Metadata Manager. When installing Metadata Manager, provide the
connection information for the database user accounts for the Integration
repository and the Metadata Manager Warehouse and Metadata Manager
Server repository. The Metadata Manager installation creates both repositories
and installs other Metadata Manager components, such as the Configuration
Console, documentation, and XConnects.
5. Optionally, run the pre-compile utility (for BEA WebLogic Server and IBM
WebSphere). If you are using the BEA WebLogic Server as your Application
server, optionally pre-compile the JSP scripts to display the Metadata Manager
web pages faster when they are accessed for the first time.
6. Apply the product license. Apply the application server license, as well as the
PowerCenter and Metadata Manager licenses.
7. Configure the PowerCenter Server. Assign the Integration repository to the
PowerCenter Server to enable running of prepackaged XConnect workflows.
The workflow for each XConnect extracts metadata from the metadata source
repository and loads it into the Metadata Manager Warehouse.

Note: For more information about installing Metadata Manager, see “Installing
Metadata Manager” chapter of the PowerCenter Installation Guide.

After the software has been installed and tested, the Metadata Manager Administrator
can begin creating security groups, users, and the repositories. Following are the some
of the initial steps for the Metadata Manager Administrator once the Metadata Manager
is installed. For more information on any of these steps, refer to the Metadata Manager
Administration Guide.

1. After completing the Metadata Manager installation, configure XConnects to


extract metadata. Configure an XConnect for each source repository, and then
load metadata from the source repositories into the Metadata Manager

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 220 of 1017


Warehouse.
2. Repository registration / creation in the Metadata Manager. Add each source
repository to Metadata Manager. This action adds the corresponding XConnect
for this repository in the Configuration Console.
3. Set up the Configuration Console. Verify the Integration repository,
PowerCenter Server, and PowerCenter Repository Server connections in the
Configuration Console. Also, specify the PowerCenter source files directory in
the Configuration Console.
4. Set up and run the XConnect for each source repository using the Configuration
Console.
5. To limit the tasks that users can perform and the type of source repository
metadata objects that users can view and modify, set user privileges and object
access permissions.

PowerExchange Installation

Before beginning the installation, take time to read the PowerExchange Installation
Guide as well as the documentation for the specific PowerExchange products you have
licensed and plan to install.

Take time to identify and notify resources you are going to need to complete the
installation. Depending on the specific product, you could need any or all of the
following:

● Database Administrator
● PowerCenter Administrator
● MVS Systems Administrator
● UNIX Systems Administrator
● Security Administrator
● Network Administrator
● Desktop (PC) Support

Installing the PowerExchange Listener on Source Systems

The process for installing PowerExchange on the source system varies greatly
depending on the source system. Take care to read through the installation
documentation prior to attempting the installation. The PowerExchange Installation
Guide has step by step instructions for installing PowerExchange on all supported
platforms.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 221 of 1017


Installing the PowerExchange Navigator on the PC

The Navigator allows you to create and edit data maps and tables. To install
PowerExchange on the desktop (PC) for the first time, complete each of the following
tasks in the order listed below:

1. Install the PowerExchange Navigator. Administrator access may be required


to install the software.
2. Modify the dbmover.cfg file. Depending on your installation, modifications
may not be required. Refer to the PowerExchange Reference Manual for
information on the parameters in dbmover.cfg.

Installing PowerExchange Client for the PowerCenter Server

The PowerExchange client for the PowerCenter server allows PowerCenter to read
data from PowerExchange data sources. The PowerCenter Administrator should
perform the installation with the assistance of a server administrator. It is recommended
that a separate user account be created to run the required processes. A PowerCenter
Administrator needs to register the PowerExchange plug-in with the PowerExchange
repository.

Informatica recommends that the installation be performed in one environment and


tested from end-to-end (from data map creation to running workflows) before
attempting to install the product in other environments.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 18:58

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 222 of 1017


Phase 4: Design

4 Design

● 4.1 Develop Data Model


(s)
❍ 4.1.1 Develop Enterprise Data Warehouse Model
❍ 4.1.2 Develop Data Mart Model(s)
● 4.2 Analyze Data Sources
❍ 4.2.1 Develop Source to Target Relationships
❍ 4.2.2 Determine Source Availability
● 4.3 Design Physical Database
❍ 4.3.1 Develop Physical Database Design
● 4.4 Design Presentation Layer
❍ 4.4.1 Design Presentation Layer Prototype
❍ 4.4.2 Present Prototype to Business Analysts
❍ 4.4.3 Develop Presentation Layout Design

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 223 of 1017


Phase 4: Design

Description

The Design Phase lays the


foundation for the upcoming Build Phase. In the Design Phase, all data models are
developed, source systems are analyzed and physical databases are designed. The
presentation layer is designed and a prototype constructed. Each task, if done
thoroughly, enables the data integration solution to perform properly and provides an
infrastructure that allows for growth and change.

Each task in the Design Phase provides the functional architecture for the
development process using PowerCenter. The design of target data store may include,
data warehouses and data marts, star schemas, web services, message queues or
custom databases to drive specific applications or effect a data migration. The Design
Phase requires that several preparatory tasks are completed before beginning the
development work of building and testing mappings, sessions, and workflows within
PowerCenter.

Prerequisites

3 Architect

Roles

Application Specialist (Primary)

Business Analyst (Primary)

Data Architect (Primary)

Data Integration Developer (Primary)

Data Quality Developer (Primary)

Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 224 of 1017


Presentation Layer Developer (Primary)

System Administrator (Primary)

Technical Project Manager (Review Only)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:45

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 225 of 1017


Phase 4: Design
Task 4.1 Develop Data
Model(s)

Description

A data integration/business intelligence project requires logical data models in order to


begin the process of designing the target database structures that are going to support
the solution architecture. The logical data model will, in turn, lead to the initial physical
database design that will support the business requirements and be populated through
data integration logic. While this task and its subtasks focus on the data models for
Enterprise Data Warehouses and Enterprise Data Marts, many types of data
integration projects do not involve a Data Warehouse.

Data migration or synchronization projects typically have existing transactional


databases as sources and targets; in these cases, the data models may be reverse
engineered directly from these databases. The same may be true of data consolidation
projects if the target is the same structure as an existing operational database.
Operational data integration projects, including data consolidation into new data
structures, requires a data model design, but typically one that is dictated by the
functional processes. Regardless of the architecture chosen for the data integration
solution, the data models for the target databases or data structures, need to be
developed in logical and consistent fashion prior to development.

Depending on the structure and approach to data storage supporting the data
integration solution, the data architecture may include an Enterprise Data Warehouse
(EDW) and one or more data marts. In addition, many implementations also include an
Operational Data Store (ODS), which may also be referred to as a dynamic data store
(DDS) or staging area. Each of these data stores may exist independently of the
others, and may reside on completely different database management systems
(DBMSs) and hardware platforms. In any case, each of the database schemas
comprising the overall solution will require a corresponding logical model.

An ODS may be needed when there are operational or reporting uses for the
consolidated detail data or to provide a staging area, for example, when there is a short
time span to pull data from the source systems. It can act as a buffer between the EDW
and the source applications. The data model for the ODS is typically in third-normal
form and may be a virtual duplicate of the source systems' models. The ODS typically
receives the data after some cleansing and integration, but with little or no
summarization from the source systems; the ODS can then become the source for the
EDW.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 226 of 1017


Major business intelligence projects require an EDW to house the data imported from
many different source systems. The EDW represents an integrated, subject-oriented
view of the corporate data comprised of relevant source system data. It is typically
slightly summarized so that its information is relevant to management, as opposed to
providing all the transaction details. In addition to its numerical details (i.e., atomic-level
facts), it typically has derived calculations and subtotals. The EDW is not
generally intended for direct access by end users for reporting purposes—for that we
have the “data marts”. The EDW typically has a somewhat de-normalized structure to
support reporting and analysis, as opposed to business transactions. Depending on
size and usage, a variant of a star schema may be used.

Data marts (DMs) are effectively subsets of the EDW. Data marts are fed directly from
the enterprise data warehouse, ensuring synchronization of business rules and
snapshot times. The logical design structures are typically dimensional star or
snowflake schemas. The structures of the data marts are driven by the requirements of
particular business users and reporting tools. There may be additions and reductions to
the logical data mart design depending on the requirements for the particular data mart.
Historical data capture requirements may differ from those on the enterprise data
warehouse. A subject-oriented data mart may be able to provide for more historical
analysis, or alternatively may require none. Detailed requirements drive content, which
in turn drives the logical design that becomes the foundation of the physical database
design.

Two generic assumptions about business users also affect data mart design:

● Business users prefer systems they easily understand.


● Business users prefer systems that deliver results quickly.

These assumptions encourage the use of star and snowflake schemas in the solution
design. These types of schemas represent business activities as a series of discrete,
time-stamped events (or facts) with business-oriented names, such as orders or
shipments. These facts contain foreign key "pointers" to one or more dimensions that
place the fact into a business context, such as the fiscal quarter in which the shipment
occurred, or the sales region responsible for the order. The use of business
terminology throughout the star or snowflake schema is much more meaningful to the
end user than the typical normalized, technology-centric data model.

During the modeling phase of a data integration project, it is important to consider all
possible methods of obtaining a data model. Analyzing the cost benefits of build vs. buy
may well reveal that it is more economical to buy a pre-built subject area model than to
invest the time and money in building your own.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 227 of 1017


Prerequisites
None

Roles

Business Analyst (Secondary)

Data Architect (Primary)

Data Quality Developer (Primary)

Technical Project Manager (Review Only)

Considerations

Requirements

The question that should be asked before modeling begins is:

Are the requirements sufficiently defined in at least one subject area that the
data modeling tasks can begin?

If the data modeling requires too much guesswork, at best, time will be wasted or, at
worst, the Data Architect will design models that fail to support the business
requirements.

This question is particularly critical for designing logical data warehouse and data mart
schemas. The EDW logical model is largely dependent on source system structures.

Conventions for Names and Data Types

Some internal standards need to be set at the beginning of the modeling process to
define data types and names. It is extremely important for project team members to
adhere to whatever conventions are chosen. If project team members deviate from the
chosen conventions, the entire purpose is defeated. Conventions should be chosen for
the prefix and suffix names of certain types of fields. For example, numeric surrogate
keys in the data warehouse might use either seq or id as a suffix to easily identify the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 228 of 1017


type of field to the developers. (See Naming Conventions for additional information.)

Data modeling tools refer to common data types as domains. Domains are also
hierarchical. For example, address can be of a string data type. Residential and
business addresses are children of address. Establishing these data types at the
beginning of the model development process is beneficial for consistency and
timeliness in implementing the subsequent physical database design.

Metadata

A logical data model produces a significant amount of metadata and is likely to be a


major focal point for metadata during the project.

Metadata integration is a major up-front consideration if metadata is to be managed


consistently and competently throughout the project. As metadata has to be delivered
to numerous applications used in various stages of a data integration project, an
integrated approach to metadata management is required. Informatica’s Metadata
Services products can be used to deliver metadata from other application repositories
to the PowerCenter Repository and from PowerCenter to various business intelligence
(BI) tools.

Logical data models can be delivered to PowerCenter ready for data integration
development. Additionally, metadata originating from these models can be delivered to
end users through business intelligence tools. Many business intelligence vendors
have tools that can access the PowerCenter Repository through the Metadata
Services and Metadata Manager Benefits architectures.

Maintaining the Data Models

Data models are valuable documentation, both to the project and the business users.
They should be stored in a repository in order to take advantage of PowerCenter's
integrated metadata approach. Additionally, they should be regularly backed-up to file
after major changes. Versioning should take place regularly within the repository so
that it is possible to roll back several versions of a data model, if necessary. Once the
backbone of a data model is in place, a change control procedure should be
implemented to monitor any changes requested and record implementation of those
changes. Adhering to rigorous change control procedures will help to ensure that all
impacts of a change are recognized prior to their implementation.

To facilitate metadata analysis and to keep your documentation up-to-date, you may
want to consider the metadata reporting capabilities in Metadata Manager to provide
automatically updated lineage and impact analysis.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 229 of 1017


TIP
To link logical model design to the requirements specifications, use either of
these methods:

● Option 1: Allocate one of the many entity or attribute description fields


that data modeling tools provide to be the link between the elements of
the logical design and the requirements documentation. Then,
establish (and adhere to) a naming convention for the population of
this field to identify the requirements that are met by the presence of a
particular entity or attribute.
● Option 2: Record the name of the entity and associated attribute in a
spreadsheet or database with the requirements that they support.

Both options 1 and 2 allow for metadata integration. Option 1 is generally


preferable because the links can be imported into the PowerCenter Repository
through Metadata Exchange.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:01

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 230 of 1017


Phase 4: Design
Subtask 4.1.1 Develop
Enterprise Data Warehouse
Model

Description

If the aim of the data integration project is to produce an Enterprise Data Warehouse
(EDW), then the logical EDW model should encompass all of the sources that feed the
warehouse. This model will be a slightly de-normalized structure to replicate source
data from operational systems; it should be neither a full star, nor snowflake schema,
nor a highly normalized structure of the source systems. Some of the source structures
are redesigned in the model to migrate non-relational sources to relational structures.
In some cases, it may be appropriate to provide limited consolidation where common
fields are present in various incoming data sources. In summary, the developed EDW
logical model should be the sum of all the parts but should exclude detailed attribute
information.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Architect (Primary)

Technical Project Manager (Review Only)

Considerations

Analyzing Sources

Designing an Enterprise Data Warehouse (EDW) is particularly difficult because it is an


accumulation of multiple sources. The Data Architect needs to identify and replicate all
of the relevant source structures in the EDW data model. The PowerCenter Designer
client includes the Source Analyzer and Warehouse Designer tools, which can be

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 231 of 1017


useful for this task. These tools can be used to analyze sources, convert them into
target structures and then expand them into universal tables. Alternatively, dedicated
modeling tools can be used.

In PowerCenter Designer, incoming non-relational structures can be normalized by use


of the Normalizer transformation object. Normalized targets defined using PowerCenter
can then be created in a database and reverse-engineered into the data model, if
desired.

Universal Tables

Universal tables provide some consolidation and commonality among sources. For
example, different systems may use different codes for the gender of a customer. A
universal table brings together the fields that cover the same business subject or
business rule. Universal tables are also intended to be the sum of all parts. For
example, a customer table in one source system may have only standard contact
details while a second system may supply fields for mobile phones and email
addresses, but not include a field for a fax number. A universal table should hold all of
the contact fields from both systems (i.e., standard contact details plus fields for fax,
mobile phones and email). Additionally, universal tables should ensure syntactic
consistency such that fields from different source tables represent the same data items
and possess the same data types.

Relationship Modeling

Logical modeling tools allow different types of relationships to be identified among


various entities and attributes. There are two types of relationships: identifying and non-
identifying.

● An identifying relationship is one in which a child attribute relies on the


parent for its full identity. For example, in a bank, an account must have an
account type for it to be fully understood. The relationships are reflected in the
physical design that a modeling tool produces from the logical design. The tool
attempts to enforce identifying relationships through database constraints.
● Non-identifying relationships are relationships in which the parent object is
not required for its identity. A data modeling tool does not enforce non-
identifying relationships through constraints when the logical model is used to
generate a physical database.

Many-to-many relationships, one-to-one relationships, and many-to-one relationships


can all be defined in logical models. The modeling tools hide the underlying

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 232 of 1017


complexities and show those objects as part of a physical database design. In addition,
the modeling tools automatically create the lookup tables if the tool is used to generate
the database schema.

Historical Considerations

Business requirements and refresh schedules should determine the amount and type
of history that an EDW should hold. The logical history maintenance architecture
should be common to all tables within the EDW.

Capturing historical data usually involves taking snapshots of the database on a regular
basis and adding the data to the existing content with time stamps. Alternatively,
individual updates can be recorded and the previously current records can be time-
period stamped or versioned. It is also necessary to decide how far back the history
should go.

Data Quality

Data can be verified for validity and accuracy as it comes into the EDW. The EDW can
reasonably be expected to answer such questions as:

● Is the post code or currency code valid?


● Has a valid date been entered (i.e., minimum age requirement for a driver's
license)?
● Does the data conform to standard formatting rules?

Additionally, data values can be evaluated against expected ranges. For example,
dates of birth should be in a reasonable range (not after the current date, and not
before 1st Jan 1900.) Values can also be validated against reference datasets. As well
as using industry-standard references (e.g., ISO Currency Codes, ISO Units of
Measure), it may be necessary to obtain or generate new reference data to perform all
relevant data quality checks.

The Data Architect should focus on the common factors in the business requirements
as early as possible. Variations and dimensions specific to certain parts of the
organization can be dealt with later in the design. More importantly, focusing on the
commonalities early in the process also allows other tasks in the project cycle to
proceed earlier. A project to develop an integrated solution architecture is likely to
encounter such common business dimensions as organizational hierarchy, regional
definitions, a number of calendars, and product dimensions, among others.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 233 of 1017


Subject areas also incorporate metrics. Metrics are measures that businesses use to
quantify their performance. Performance measures include productivity, efficiency,
client satisfaction, turnover, profit and gross margin. Common business rules determine
the formulae for the calculation of the metrics.

The Data Architect may determine at this point that subject areas thought to be
common are, in fact, not common across the entire organization. Various departments
may use different rules to calculate their profit, commission payments, and customer
values. These facts need to be identified and labeled in the logical model according to
the part of the organization using the differing methods. There are two reasons for this:

● Common semantics enable business users to know if they are using the same
organizational terminology as their colleagues.
● Commonality ensures continuity between the measures a business currently
takes from an operational system and the new ones that will be available in the
data integration solution.

When the combination of dimensions and hierarchies is understood, they can be


modeled. The Data Architect can use a star or snowflake structure to denormalize the
data structures.

Objectives such as trading ease of maintenance and minimal disk space storage
against speed and usability determine whether a simple star or snowflake structure is
preferable. One or two central tables should hold the facts. Variations in facts can be
included in these tables along with common organizational facts. Variations in
dimension may require additional dimensional tables.

Tip

Determining Levels of Aggregation

In the EDW, there may be limited value in holding multiple levels of


aggregation.

● If the data warehouse is feeding dependent data marts, it may be


better to aggregate using the PowerCenter server to load the
appropriate aggregate data to the data mart.
● If specific levels are required, they should be modeled in the fact tables
at the center of the star schema.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 234 of 1017


Syndicated Data Sets

Syndicated data sets, such as weather records, should be held in the data warehouse.
These external dimensions will then be available as a subset of the data warehouse. It
should be assumed that the data set will be updated periodically and that the history
will be kept for reference unless the business determines it is not necessary. If the
historical data is needed, the syndicated data sets will need to be date-stamped.

Code Lookup Tables

Use of single code lookup tables does not provide the same benefits as a single code
lookup table on an OLTP system. The function of a single code lookup table is to
provide central maintenance of codes and descriptions. This is not a benefit that can be
achieved when populating a data warehouse since data warehouses are potentially
loaded from more than one source several times.

Having a single database structure is likely to complicate matters in the future. A single
code lookup table implies the use of a single surrogate key. If problems occur in the
load, they affect all code lookups - not just one. Separate codes would have to be
loaded from their various sources and checked for existing records and updates.

A single lookup table simply increases the amount of work mapping developers need to
carry out to qualify the parts of the table they are concerned with for a particular
mapping. Individual lookup tables remove the single point of failure for code lookups
and improve development time for mappings; however, they also involve more work for
the Data Architect. The Data Architect may prefer to show a single object for codes on
the diagrams. He/she should however, ensure that regardless of how the code tables
are modeled, they will be physically separable when the physical database
implementation takes place.

Surrogate Keys

The use of surrogate keys in most dimensional models presents an additional obstacle
that must be overcome in the solution design. It is important to determine a strategy to
create, distribute, and maintain these keys as you plan your design. Any of the
following strategies may be appropriate:

● Informatica Generated Keys. The sequence generator transformation allows


the creation of surrogate keys natively in Informatica mappings. There are
options for reusability, setting key-ranges, and continuous numbering between
loads. The limitation to this strategy is that it cannot generate a number higher

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 235 of 1017


than 232. However, two billion is generally big enough for most dimensions.
● External Table Based. PowerCenter can access an external code table during
loads using the look-up transformation to obtain surrogate keys.
● External Code Generated. Informatica can access a stored procedure or
external .dll that contains a programmatic solution to generate surrogate keys.
This is done using either the stored procedure transformation or the external
procedure transformation.
● Triggers/Database Sequence. Create a trigger on the target table, either by
calling it from the source qualifier transformation or the stored procedure
transformation, to perform the insert into the key field.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:03

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 236 of 1017


Phase 4: Design
Subtask 4.1.2 Develop Data
Mart Model(s)

Description

The data mart's logical data model, supports the final step in the integrated enterprise
decision support architecture. These models should be easily identified with their
source in the data warehouse and will provide the foundation for the physical design. In
most modeling tools, the logical model can be used to automatically resolve and
generate some of the physical design, such as lookups used to resolve many-to-many
relationships.

If the data integration project was initiated for the right reasons, the aim of the data
mart is to solve a specific business issue for its business sponsors. As a subset of the
data warehouse, one data mart may focus on the business customers while another
may focus on residential services. The logical design must incorporate transformations
supplying appropriate metrics and levels of aggregation for the business users. The
metrics and aggregations must incorporate the dimensions that the data mart business
users can use to study their metrics. The structure of the dimensions must be
sufficiently simple to enable those users to quickly produce their own reports, if desired.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Architect (Primary)

Technical Project Manager (Review Only)

Considerations

The subject area of the data mart should be the first consideration because it
determines the facts that must be drawn from the Enterprise Data Warehouse into the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 237 of 1017


business-oriented data mart. The data mart will then have dimensions that the business
wants to model the facts against. The data mart may also drive an application. If so, the
application has certain requirements that must also be considered. If any additional
metrics are required, they should be placed in the data warehouse, but the need should
not arise if sufficient analysis was completed in earlier development steps.

Tip
Keep it Simple!

If, as is generally the case, the data mart is going to be used primarily as a
presentation layer by business users extracting data for analytic purposes, the mart
should use as simple a design as possible.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 238 of 1017


Phase 4: Design
Task 4.2 Analyze Data
Sources

Description

The goal of this task is to understand the various data sources that will be feeding the
solution. Completing this task successfully increases the understanding needed to
efficiently map data using PowerCenter. It is important to understand all of the data
elements from a business perspective, including the data values and dependencies on
other data elements. It is also important to understand where the data comes from, how
the data is related, and how much data there is to deal with (i.e., volume estimates).

Prerequisites
None

Roles

Application Specialist (Primary)

Business Analyst (Primary)

Data Architect (Primary)

Data Integration Developer (Primary)

Database Administrator (DBA) (Secondary)

System Administrator (Primary)

Technical Project Manager (Review Only)

Considerations

None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 239 of 1017


Best Practices

Using Data Explorer for Data Discovery and Analysis

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 240 of 1017


Phase 4: Design
Subtask 4.2.1 Develop
Source to Target
Relationships

Description

The third step in analyzing data sources is to determine the relationship between the
sources and targets and to identify any rework or target redesign that may be required
if specific data elements are not available. This step defines the relationships between
the data elements and clearly illuminates possible data issues, such as incompatible
data types or unavailable data elements.

Prerequisites
None

Roles

Application Specialist (Secondary)

Business Analyst (Primary)

Data Architect (Primary)

Data Integration Developer (Primary)

Technical Project Manager (Review Only)

Considerations

Creating the relationships between the sources and targets is a critical task in the
design process. It is important to map all of the data elements from the source data to
an appropriate counterpart in the target schema. Taking the necessary care in this
effort should result in the following:

● Identification of any data elements in the target schema that are not currently

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 241 of 1017


available from the source. The first step determines what data is not currently
available from the source. When the source data is not available, the Data
Architect may need to re-evaluate and redesign the target schema or
determine where the necessary data can be acquired.
● Identification of any data elements that can be removed from source records
because they are not needed in the target. This step eliminates any data
elements that are not required in the target. In many cases, unnecessary data
is moved through the extraction process. Regardless of whether the data is
coming from flat files or relational sources, it is best to eliminate as much
unnecessary data as possible, as early in the process as possible.
● Determination of the data flow required for moving the data from the source to
the target. This can serve as a preliminary design specification for work to be
performed during the Build Phase . Any data modifications or translations
should be noted during this determination process as the source-to-target
relationships are established.
● Determination of the quality of the data in the source. This ensures that data in
the target is of high quality and serves its purpose. All source data should be
analyzed in a data quality application to assess its current data quality levels.
During the Design Phase , data quality processes can be introduced to fix
identified issues and/or enrich data using reference information. Data quality
should also be incorporated as an on-going process to be leveraged by the
target data source.

The next step in this subtask produces a (Target-Source Matrix) which provides a
framework for matching the business requirements to the essential data elements and
defining how the source and target elements are paired. The matrix lists each of the
target tables from the data mart in the rows of the matrix and lists descriptions of the
source systems in the columns, to provide the following data:

● Operational (transactional) system in the organization


● Operational data store
● External data provider
● Operating system
● DBMS
● Data fields
● Data descriptions
● Data profiling/analysis results
● Data quality operations, where applicable

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 242 of 1017


One objective of the data integration solution is to provide an integrated view of key
business data. Therefore, for each target table one or more source systems must exist.
The matrix should show all of the possible sources for this particular initiative. After this
matrix is completed, the data elements must be checked for correctness and validated
with both the Business Analyst(s) and the user community. The Project Manager is
responsible for ensuring that these parties agree that the data relationships defined in
the Target-Source Matrix are correct and meet the needs of the data integration
solution. Prior to any mapping development work, the Project Manager should obtain
sign-off from the Business Analysts and user community.

Undefined Data

In some cases the Data Architect cannot locate or access the data required to establish
a rule defined by the Business Analyst. When this occurs, the Business Analyst may
need to revalidate the particular rule or requirement to ensure that it meets the end-
users' needs. If it does not, the Business Analyst and Data Architect must determine if
there is another way to use the available data elements to enforce the rule. Enlisting
the services of the System Administrator or another knowledgeable source system
resource, may be helpful. If no solution is found, or if the data meets requirements but
is not available, the Project Manager should communicate with the end-user community
and propose an alternative business rule.

Choosing to eliminate data too early in the process due to inaccessibility, however, may
cause problems further down the road. The Project Manager should meet with the
Business Analyst and the Data Architect to determine what rules or requirements can
be changed and which must remain as originally defined. The Data Architect can
propose data elements that can be safely dropped or changed without compromising
the integrity of the user requirements. The Project Manager must then identify any risks
inherent in eliminating or changing the data elements and decide which are acceptable
to the project.

Some of the potential risks involved in eliminating or changing data elements are:

● Losing a critical piece of data required for a business rule that was not
originally defined but is likely to be needed in the future. Such data loss may
require a substantial amount of rework and can potentially affect project
timelines.
● Any change in data that needs to be incorporated in the Source or Target data
models requires substantial time to rework and could significantly delay
development. Such a change would also push back all tasks defined and
require a change in the Project Plan.
● Changes in the Source system model may drop secondary relationships that

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 243 of 1017


were not initially visible.

Source Changes after Initial Assessment

When a source changes after the initial assessment, the corresponding Target-Source
Matrix must also change. The Data Architect needs to outline everything that has
changed, including the data types, names, and definitions. Then, the various risks
involved in changing or eliminating data elements must be re-evaluated. The Data
Architect should also decide which risks are acceptable. Once again, the System
Administrator may provide useful information about the reasons for any changes to the
source system and their effect on data relationships.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:05

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 244 of 1017


Phase 4: Design
Subtask 4.2.2 Determine
Source Availability

Description

The final step in the 4.2 Analyze Data Sources task is to determine when all source
systems are likely to be available for data extraction. This is necessary in order to
determine realistic start and end times for the load window. The developers need to
work closely with the source system administrators during this step because the
administrators can provide specific information about the hours of operations for their
systems.

The final deliverable in this subtask, the Source Availability Matrix, lists all the sources
that are being used for data extraction and specifies the systems' downtimes during a
24-hour period. This matrix should contain details of the availability of the systems on
different days of the week, including weekends and holidays.

Prerequisites
None

Roles

Application Specialist (Primary)

Data Integration Developer (Secondary)

Database Administrator (DBA) (Secondary)

System Administrator (Primary)

Technical Project Manager (Review Only)

Considerations

The information generated in this step will be crucial later in the development process

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 245 of 1017


for determining load windows and availability of source data. In many multi-national
companies, source systems are distributed globaly, and therefore, may not be available
for extraction concurrently. This can pose problems when trying to extract data with
minimal (or no) disruption of users' day-to-day activities. Determining the source
availability can go a long way in determining when the load window for a regularly
scheduled extraction can run.

This information is also helpful for determining whether an Operational Data Store
(ODS) is needed. Sometimes, the extraction times can be so varied among necessary
source systems that an ODS or staging area is required purely for logistical reasons.

Best Practices
None

Sample Deliverables

Source Availability Matrix

Target-Source Matrix

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 246 of 1017


Phase 4: Design
Task 4.3 Design Physical
Database

Description

The physical database design is derived from the logical models created in Task 4.1.
Where the logical design details the relationships between logical entities in the
system, the physical design considers the following physical aspects of the database:

How the tables are arranged, stored (i.e., on which devices),


partitioned, and indexed

The detailed attributes of all database columns


The likely growth over the life of the database


How each schema will be created and archived


Hardware availability and configuration (e.g., availability of disk


storage space, number of devices, and physical location of storage).

The physical design must reflect the end-user reporting requirements, organizing the
data entities to allow a fast response to the expected business queries. Physical target
schemas typically range from fully normalized (essentially OLTP structures) to
snowflake and star schemas, and may contain both detail and aggregate information.

The relevant end-user reporting tools, and the underlying RDBMS, may dictate
following a particular database structure (e.g., multi-dimensional tools may arrange the
data into data "cubes").

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 247 of 1017


Business Analyst (Secondary)

Data Architect (Primary)

Database Administrator (DBA) (Primary)

System Administrator (Review Only)

Technical Project Manager (Review Only)

Considerations

Although many factors influence the physical design of the data marts, end-user
reporting needs are the primary driver. These needs determine the likely selection
criteria, filters, selection sets and measures that will be used for reporting. These
elements may, in turn, suggest indexing or partitioning policies (i.e., to support the most
frequent cross-references between data objects or tables and identify the most
common table joins) and appropriate access rights, as well as indicate which elements
are likely to grow or change most quickly.

Long-term strategies regarding growth of a data warehouse, enhancements to its


usability and functionality, or additional data marts may all point toward specific design
decisions to support future load nad/or reporting requirements. In all cases, the
physical database design is tempered by system-imposed limits such as the available
disk sizes and numbers; the functionality of the operating system or RDBMS; the
human resources available for design and creation of procedures, scripts and DBA
duties; and the volume, frequency and speed of delivery of source data. These factors
all help to determine the best-fit physical structure for the specific project.

A final consideration is how to implement the schema. Database design tools may
generate and execute the necessary processes to create the physical tables, and
the PowerCenter Metadata Exchange can interact with many common tools to pull
target table definitions into the repository. However, automated scripts may still be
necessary for dropping, truncating, and creating tables.

For Data Migration, the tables that are designed and created are normally either stage
tables or reference tables. These tables are generated to simplify the migration
process. The table definitions for the target application are almost always provided to
the data migration team. These are typically delivered with a packaged application or
already exist for the broader project implementation.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 248 of 1017


Best Practices
None

Sample Deliverables

Physical Data Model Review Agenda

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 249 of 1017


Phase 4: Design
Subtask 4.3.1 Develop
Physical Database Design

Description

As with all design tasks, there are both enterprise and workgroup considerations in
developing the physical database design. Optimally, the final design should balance the
following factors:

● Ease of end-user reporting from the target


● Ensuring the maximum throughput and potential for parallel processing
● Effective use of available system resources, disk space and devices
● Minimizing DBA and systems administration overhead
● Effective use of existing tools and procedures

Physical designs are required for target data marts, as well as any ODS/DDS schemas
or other staging tables.

The relevant end-user reporting tools, and the underlying RDBMS, may dictate
following a particular database structure (e.g., multi-dimensional tools may arrange the
data into data "cubes").

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Architect (Primary)

Database Administrator (DBA) (Primary)

System Administrator (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 250 of 1017


Technical Project Manager (Review Only)

Considerations

This task involves a number of major activities:

● Configuring the RDBMS, which involves determining what database systems


are available and identifying their strengths and weakness
● Resolving hardware issues such as the size, location, and number of storage
devices, networking links and required interfaces
● Determining distribution and accessibility requirements, such as 24x7 access
and local or global access
● Determining if existing tools are sufficient or, if not, selecting new ones
● Determining back-up, recovery, and maintenance requirements (i.e., will the
physical database design exceed the capabilities of the existing systems or
make upgrades difficult?)

The logical target data models provide the basic structure of the physical design. The
physical design provides a structure that enables the source data to be quickly
extracted and loaded in the transformation process, and allows a fast response to the
end-user queries.

Physical target schemas typically range from:

● Fully normalized (essentially OLTP structures)


● Denormalized relational structures (e.g., as above but with certain entities split
or merged to simplify loading into them, or extracting from them to feed other
databases)
● Classic snowflake and star schemas, ordered as fact and dimension tables in
standard RDBMS systems, optimized for end-user reporting.
● Aggregate versions of the above.
● Proprietary multi-dimensional structures, allowing very fast (but potentially less
flexible and detailed) queries.

The design must also reflect the end-user reporting requirements, organizing the data
entities to provide answers to the expected business queries.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 251 of 1017


Preferred Strategy

A typical multi-tier strategy uses a mixture of physical structures:

● Operational Data Store (ODS) design. This is usually closely related to the
individual sources, and is, therefore, relationally organized (like the source
OLTP), or simply relational copies of source flat files. Optimized for fast
loading (to allow connection to the source system to be as short as possible)
with few or no indexes or constraints.
● Data Warehouse design. Tied to subject areas, this may be based on a star-
schema (i.e., where significant end-user reporting may occur), or a more
normalized relational structure (where the data warehouse acts purely as a
feeder to several dependent data marts) to speed up extracts to the
subsequent data marts.
● Data Mart design. The usual source for complex business queries, this
typically uses a star or snowflake schema, optimized for set-based reporting
and cross-referenced against many, varied combinations of dimensional
attributes. May use multi-dimensional structures if a specific set of end-user
reporting requirements can be identified.

Tip
The tiers of a multi-tier strategy each have a specific purpose, which strongly
suggests the likely physical structure:

● ODS - Staging from source should be designed to quickly move data from
the operational system. The ODS structure should be very similar to the
source since no transformations are performed, and has few indexes or
constraints (which slow down loading).
● The Data Warehouse design should be biased toward feeding subsequent
data marts, and should be indexed to allow rapid feeds to the marts, along
with a relational structure. At the same time, since the data warehouse
functions as the enterprise-wide central point of reference, physical
partitioning of larger tables allow it to be quickly loaded via parallel
processes. Because data volumes are high, the data warehouse and ODS
structures should be as physically close as possible so as to avoid network
traffic.
● Data Marts should be strongly biased toward reporting, most likely as star-
schemas, or multi-dimensional cubes. The volumes will be smaller than the
parent data warehouse, so the impact of indexes on loading is not as
significant.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 252 of 1017


RDBMS Configuration

The physical database design is tempered by the functionality of the operating system
and RDBMS. In an ideal world, all RDBMS systems might provide the same set of
functions, level of configuration, and scalability. This is not the case however, different
vendors include different features in their systems and new features are included with
each new release; this may affect:

● Physical partitioning. This is not available with all systems. A lack of physical
partitioning may affect performance when loading data into growing tables.
When it is available, partitioning allows faster parallel loading to a single table,
as well as greater flexibility in table reorganisations as well as backup and
recovery.
● Physical device management. Ideally, using many physical devices to store
individual targets or partitions can speed loading because several tables on a
single device must use the same read-write heads when being updated in
parallel. Of course, using multiple, separate devices may result in added
administrative overhead and/or work for the DBA (i.e., to define additional
pointers and create more complex backup instructions).
● Limits to individual tables. Older systems may not allow tables to physically
grow past a certain size. This may require amending an initial physical design
to split up larger tables.

Tip
Using multiple physical devices to store whole tables allows faster parallel updates to
them.

If target tables are physically partitioned, the separate partitions can be stored on
separate physical devices, allowing a further order of parallel loading. The downside
is that extra initial and ongoing DBA and systems administration overhead is required
to fully manage partition management, although much of this can be automated
using external scripts.

Tools

The relevant end-user reporting tools may dictate following a particular database
structure, at least for the data mart and data warehouse designs.

Although many popular business intelligence tools (e.g. Business Objects,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 253 of 1017


MicroStrategy and others) can access a wide range of relational and denormalized
structures, each generally works best with a particular type (e.g., long/thin vs. short/fat
star schema designs).

● Multi-dimensional (MOLAP) tools often require specific (i.e., proprietary)


structures to be used. These tools arrange the data logically into data "cubes",
but physically use complex, proprietary systems for storage, indexing, and
organization.
● Database design tools (ErWin, Designer 2000, PowerDesigner) may generate
and execute the necessary processes to create the physical tables, but are
also subject to their own features and functions.

Hardware Issues

Physical designs should be able to be implemented on the existing system (which can
help to identify weaknesses in the physical infrastructure). The areas to consider are:

● The size of storage available


● The number of physical devices
● The physical location of such (e.g., on the same box, on a closely connected
box, via a fast network, via telephone lines).
● The existing network connections, loading, and peaks in demand.

Distribution and Accessibility

For a large system, the likely demands on the data mart should affect the physical
design. Factors to consider include:

● Will end-users require continuous (i.e., 24x7) access, or will a batch window
be available to load new data? Each involves some issues: continuous access
may require complex partitioning schemes and/or holding multiple copies of
the data, while a batch window would allow indexes/constraints to be dropped
before loading, resulting in significantly decreased load times.
● Will different users require access to the same data, but in different forms (e.
g., different levels of aggregation, or different sub-sets of the data)?
● Will all end-users access the same physical data, or local copies of it (which
need to be distributed in some way)? This issue affects the potential size of
any data mart.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 254 of 1017


Tip
If the end-users require 24x7 access, and incoming volumes of source data are very
large, it is possible with later releases of major RDBMS tools to load table-space and
index partitions entirely separately, only swapping them into the reporting target at
the end. This is not true for all databases, however, and, if available, needs to be
incorporated into the actual load mechanisms.

Back Up, Recovery And Maintenance

Finally, since unanticipated downtime is likely to affect an organization's ability to plan,


forecast, or even operate effectively, the physical structures must be designed with an
eye on any existing limits to the general data management processes. Because the
physical designs lead to real volumes of data, it is important to determine:

● Will the designs fit into existing back up processes? Will they execute within
the available timeframes and limits?
● Will recovery processes allow end-users to quickly re-gain access to their
reporting system?
● Will the structures be easy to maintain (i.e., to change, reorganize, rebuild, or
upgrade)?

Tip
Indexing frequently-used selection fields/columns can substantially speed up the
response for end-user reporting because the database engine optimizes its search
pattern, rather than simply scanning all rows of the table if appropriately indexed
fields are used in a request. The more indexes that exist on the target, however, the
slower the speed of data loading into the target, since maintaining the indexes
becomes an additional load on the database engine.

Where an appropriate batch window is available for performing the data load, the
indexes can be dropped before loading, and then re-generated after the load. If no
window is available, the strategy should be one of balancing the load and reporting
needs by careful selection of which fields to index.

For Data Migration projects, it is rare that any tables will be designed for the source or
target application. If tables are needed they will most likely be staging tables or be used
to assist in transformation. It is common that the staging tables will mirror either the
source system or the target system. It is encouraged to create two levels of staging
where Legacy Stage will mirror the source system and Pre-Load Stage will mirror the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 255 of 1017


target system. Developers often take advantage of PowerCenter’s table generation
functionality in designer for this purpose; to quickly generate needed tables and
subsequently to reverse engineer the table definitions with a modeling tool, after the
fact.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:08

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 256 of 1017


Phase 4: Design
Task 4.4 Design
Presentation Layer

Description

The objective of this task is to design a presentation layer for the end-user community.
The developers will use the design that results from this task and its associated
subtasks in the Build Phase to build the presentation layer (5.6 Build Presentation
Layer). This task includes activities to develop a prototype, demonstrate it to users and
get their feedback, and document the overall presentation layer.

The purpose of any presentation layer is to design an application that can transform
operational data into relevant business information. An analytic solution helps end
users to formulate and support business decisions by providing this information in the
form of context, summarization, and focus.

Note: Readers are reminded that this guide is intentionally analysis-neutral. This
section describes some general considerations and deliverables for determining how to
deliver information to the end user. This step may actually take place earlier in this
phase, or occur in parallel with the data integration tasks.

The presentation layer application should be capable of handling a variety of analytical


approaches, including the following:

● Ad hoc reporting. Used in situations where users need extensive direct,


interactive exploration of the data. The tool should enable users to formulate
there own queries by directly manipulating relational tables and complex joins.
Such tools must support:

❍ Query formulation that includes multipass SQL, highlighting (alerts), semi-


additive summations, and direct SQL entry.
❍ Analysis and presentation capabilities like complex formatting, pivoting,
charting and graphs, and user-changeable variables.
❍ Strong technical features.
❍ Thin client web access with ease of use, metadata access, picklist, and
seamless integration with other applications.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 257 of 1017


This approach is suitable when users want to answer questions such as, "What
were Product X revenues in the past quarter?"

● Online Analytical Processing (OLAP). Arguably the most common approach


and most often associated with analytic solution architectures. There are
several types of OLAP (e.g., MOLAP, ROLAP, HOLAP, and DOLAP are all
variants), each with their own characteristics. The tool selection process
should highlight these distinguishing characteristics in the event that OLAP is
deemed the appropriate approach for the organization. The OLAP
technologies provide multidimensional access to business information,
allowing users to drill down, drill through, and drill across data. OLAP
access is more discovery-oriented than ad hoc reporting.
● Dashboard reporting (Push-Button). Dashboard reporting from the data
warehouse effectively replaced the concept of EIS (executive information
systems), largely because EIS could not contain sufficient data for true
analysis. Nevertheless, the need for an executive style front-end still exists
and dashboard reporting (sometimes referred to as Push-Button access)
largely fills the need. Dashboard reporting emphasizes the summarization and
presentation of information to the end user in a user friendly and extremely
graphical interface. Graphical presentation of the information attempts to
highlight business trends or exceptional conditions.
● Data Mining. An artificial intelligence-based technology that integrates large
databases and proposes possible patterns or trends in the data. A commonly
cited example is the telecommunications company that uses data mining to
highlight potential fraud by comparing activity to the customer's previous
calling patterns. The key distinction is data mining's ability to deliver trend
analysis without specific requests by the end users.

Prerequisites
None

Roles

Business Analyst (Primary)

Presentation Layer Developer (Primary)

Considerations

The presentation layer tool must:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 258 of 1017


● Comply with established standards across the organization
● Be compatible with the current and future technology infrastructures

The analysis tool does not necessarily have to be "one size fits all." Meeting the
requirements of all end users may require mixing different approaches to end-user
analysis. For example, if most users are likely to be satisfied with an OLAP tool while a
group focusing on fraud detection requires data mining capabilities, the end-user
analysis solution should include several tools, each satisfying the needs of the various
user groups. The needs of the various users should be determined by the user
requirements defined in 2.2 Define Business Requirements.

Best Practices
None

Sample Deliverables

Information Requirements Specification

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 259 of 1017


Phase 4: Design
Subtask 4.4.1 Design
Presentation Layer
Prototype

Description

The purpose of this subtask is to develop a prototype of the end-user presentation layer
"application" for review by the business community (or its representatives).

The result of this subtask is a working prototype for end-user review and investigation.
PowerCenter can deliver a rough cut of the data to the target schema; then, Data
Analyzer (or other business intelligence tools) can build reports relatively quickly,
thereby allowing the end-user capability to evolve through multiple iterations of the
design.

Prerequisites
None

Roles

Business Analyst (Primary)

Presentation Layer Developer (Primary)

Considerations

It is important to use actual source data in the prototype. The closer the prototype is to
what the end user will actually see upon final release, the more relevant the feedback.
In this way, end users can see an initial interpretation of their needs and validate or
expand upon certain requirements.

Also consider the benefits of baselining the user requirements through a sign-off
process. This makes it easier for the development team to focus on deliverables. A
formal change control request process complements this approach. Baselining user
requirements also allows accurate tracking of progress against the project plan and
provides transparency to changes in the user requirements. This approach helps to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 260 of 1017


ensure that the project plan remains close to schedule.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 261 of 1017


Phase 4: Design
Subtask 4.4.2 Present
Prototype to Business
Analysts

Description

The purpose of this subtask is to present the presentation layer prototype to business
analysts and the end users. The result of this subtask will be a deliverable,
the Prototype Feedback document, containing detailed results from the prototype
presentation meeting or meetings.

The Prototype Feedback document should contain such administrative information as


date and time of the meeting, a list of participants, and a summary of what was
presented. The bulk of the document should contain a list of participants' approval or
rejection of various aspects of the prototype. The feedback should cover such issues
as pre-defined reports, presentation of data, review of formulas for any derived
attributes, dimensional hierarchies, and so forth. The prototype demonstration should
focus on the capabilities of the end user analysis tool and highlight the differences
between typical reporting environments and decision support architectures. This
subtask also serves to educate the users about the capabilities of their new analysis
tool. A thorough understanding of what the tool can provide enables the end users to
refine their requirements to maximize the benefit of the new tool.

Technologies such as OLAP, EIS and Data Mining often bring a new data analysis
capability and approach to end users. In an ad hoc reporting paradigm, end users must
precisely specify their queries. Multidimensional analysis allows for much more
discovery and research, which follows a different paradigm. A prototype that uses
familiar data to demonstrate these abilities helps to launch the education process while
also improving the design. The demonstration of the prototype is also an opportunity to
further refine the business requirements discovered in the requirements gathering
subtask. The end users themselves can offer feedback and ensure that the method of
data presentation and the actual data itself are correct.

Prerequisites

4.4.1 Design Presentation Layer Prototype

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 262 of 1017


Business Analyst (Primary)

Presentation Layer Developer (Primary)

Considerations

The Data Integration Developer needs to be an active participant in this subtask to


ensure that the presentation layer is developed with a full understanding of the needs
of the end users. Using actual source data in the development of the prototype gives
the Data Integration Developer a knowledge base of what data is or is not available in
the source systems and in what format that data is stored. Having all parties participate
in this activity facilitates the process of working through any data issues that may be
identified.

As with the tool selection process, it is important here to assemble a group that
represents the spectrum of end-users across the organization, from business analysts
to high-level managers. A cross section of end users at various levels ensures an
accurate representation of needs across the organization. Different job functions
require different information and may also require various data access methods (i.e., ad
hoc, OLAP, EIS, Data Mining). For example, information that is important to business
analysts such as metadata, may not be important to a high-level manager, and vice-
versa.

The demonstration of the presentation layer tool prototype should not be a one-time
activity; instead it should be conducted at several points throughout design and
development to facilitate and elicit end-user feedback. Involving the end users is vital to
getting "buy-in" and ensuring that the system will meet their requirements. User
involvement also helps build support for the presentation layer tool throughout the
organization.

Best Practices
None

Sample Deliverables

Prototype Feedback

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 263 of 1017


Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 264 of 1017


Phase 4: Design
Subtask 4.4.3 Develop
Presentation Layout Design

Description

The goal of any data integration, warehousing or business intelligence project is to


collect and transform data into meaningful information for use by the decision makers
of a business. The next step after prototyping the presentation layer and gaining
approval from the Business Analysts, is to improve and finalize its design for use by the
end users. A well-designed interface effectively communicates this information to the
end user. If an interface is not designed intuitively, however, the end users may not be
able to successfully leverage the information to their benefit. The principles are the
same regardless of the type of application (e.g., customer relationship management
reporting or metadata reporting solution).

Prerequisites

4.4.1 Design Presentation Layer Prototype

Roles

Business Analyst (Secondary)

Presentation Layer Developer (Primary)

Considerations

Types of Layouts

Each piece of information presented to the end user has its own level of importance.
The significance and required level of detail in the information to be delivered
determines whether to present the information on a dashboard or a report.

For example, information that needs to be concise and answers the question “Has this
measurement fallen below the critical threshold number?”, qualifies to be an Indicator
on a dashboard. The more critical information in the above-mentioned category,
needing to reach the end user without having to wait for the user to log onto the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 265 of 1017


system, needs to be implemented as an Alert. However, most information delivery
requirements constitute detailed reports, such as sales data for all the regions or
revenue by product category etc.

Dashboards

Data Analyzer dashboards contain all the critical information users need in one single
interface. Data can be provided via Alerts, Indicators, or links to Favorite Reports and
Shared Documents.

Data Analyzer facilitates the design of an appealing presentation layout for the
information by providing predefined dashboard layouts. A clear understanding of what
needs to be displayed, as well as how many different types of indicators and alerts are
going to put on the dashboard are important in the selection of an appropriate
dashboard layout. Generally, each subset of data should be placed in a separate
container. Detailed Reports can be put as links on the dashboards so that users can
easily navigate to more detailed reports.

Report Table Layout

Each report that you are going to build should have suitable design features for the
data to be displayed so as to ensure that the report communicates its message
effectively. In order to ascertain this, be sure to understand the type of data that each
report is going to display before choosing a report table layout. For example, a tabular
layout would be appropriate for a sales revenue report that shows the dollar amounts
against only one dimension (e.g., product category), but a sectional layout would be
more appropriate if the end users are interested in seeing the dollar amounts for each
category of the product in each district, one at a time.

When developing either a dashboard or report, be sure to consider the following points:

● Who is your audience? You have to know who is the intended recipient of the
information that you are going to provide. The audience’s requirements and
preferences should drive your presentation style. Often there will be multiple
audiences for the information you have to share. On many occasions, you will
find that the same information will best serve its purpose if presented in two
different styles to two different users. For example: you may have to create
multiple dashboards in a single project and personalize each dashboard to a
specific group of end users’ needs.
● What type of information do the users need and what are their expectations?
Always remember that the users are looking for very specific pieces of
information in the presentation layout. Most of the time, the business users are

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 266 of 1017


not highly technically skilled personnel. They do not always have the time or
required skills to navigate to various places and search for the specific metric
or value that matters to them. Try to place yourself in the users' shoes and ask
yourself questions such as what would be the most helpful way to display the
information or what could be the possible uses of the information that you are
providing.

Additionally, the users' expectations will affect the way your information is
presented to them. Some users may be interested in more indicators and
charts while others may want see detailed reports. The more thoroughly you
understand the user expectations, the better you can design presentation
layout.

● Why do they need it? Understanding this can help you to choose the right
layout for each piece of information that you have to present. If they want
granular information, then they are likely to want a detailed report. However, if
they just need quick glimpses of the data, indicators on a dashboard or
emailed alerts are likely to be more appropriate.
● When does the data need to be displayed? It is critical to know when
important business processes occur. This can help drive the development and
scheduling of reports – daily, weekly, monthly, etc. This can also help to
determine what type of indicators to develop, such as monthly or daily sales.
● How should the data be displayed? A well-designed chart, graph or an
indicator can convey critical information to the concerned users quickly and
accurately. It becomes important to choose the right colors and backgrounds
to catch the user’s attention where it is needed the most. A good example of
this would be using a bright red color for all your alerts, green for all the ‘good’
values and so on.

Tip
It is also important to determine if there are any enterprise standards set for
the layout designs of the reports and dashboards, especially the color codes
as given in the example above.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 267 of 1017


Sample Deliverables
None

Last updated: 15-Feb-07 19:10

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 268 of 1017


Phase 5: Build

5 Build

● 5.1 Launch Build Phase


❍ 5.1.1 Review Project Scope and Plan
❍ 5.1.2 Review Physical Model
❍ 5.1.3 Define Defect Tracking Process
● 5.2 Implement Physical Database
● 5.3 Design and Build Data Quality Process
❍ 5.3.1 Design Data Quality Technical Rules
❍ 5.3.2 Determine Dictionary and Reference Data Requirements
❍ 5.3.3 Design and Execute Data Enhancement Processes
❍ 5.3.4 Design Run-time and Real-time Processes for Operate Phase
Execution
❍ 5.3.5 Develop Inventory of Data Quality Processes
❍ 5.3.6 Review and Package Data Transformation Specification
Processes and Documents
● 5.4 Design and Develop Data Integration Processes
❍ 5.4.1 Design High Level Load Process
❍ 5.4.2 Develop Error Handling Strategy
❍ 5.4.3 Plan Restartability Process
❍ 5.4.4 Develop Inventory of Mappings & Reusable Objects
❍ 5.4.5 Design Individual Mappings & Reusable Objects
❍ 5.4.6 Build Mappings & Reusable Objects
❍ 5.4.7 Perform Unit Test
❍ 5.4.8 Conduct Peer Reviews
● 5.5 Populate and Validate Database
❍ 5.5.1 Build Load Process
❍ 5.5.2 Perform Integrated ETL Testing

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 269 of 1017


● 5.6 Build Presentation Layer
❍ 5.6.1 Develop Presentation Layer
❍ 5.6.2 Demonstrate Presentation Layer to Business Analysts

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 270 of 1017


Phase 5: Build

Description

The Build Phase uses the design


work completed in the Architect Phase and the Design Phase as inputs to physically
create the data integration solution including data quality and data transformation
development efforts.

At this point, the project scope, plan, and business requirements defined in
the Manage Phase should be re-evaluated to ensure that the project can deliver the
appropriate value at an appropriate time.

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Secondary)

Data Architect (Primary)

Data Integration Developer (Primary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Secondary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Primary)

Presentation Layer Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 271 of 1017


Project Sponsor (Approve)

Quality Assurance Manager (Primary)

Repository Administrator (Secondary)

System Administrator (Secondary)

Technical Project Manager (Primary)

Test Manager (Primary)

Considerations

PowerCenter serves as a complete data integration platform to move data from source
to target databases, perform data transformations, and automate the extract, transform,
and load (ETL) processes. As a project progresses from the Design Phase to the
Build Phase, it is helpful to review the activities involved in each of these processes.

● Extract - PowerCenter extracts data from a broad array of heterogeneous


sources. Data can be accessed from sources including IBM mainframe and
AS400 systems, MQ Series, and TIBCO; ERP systems from SAP, Peoplesoft,
and Siebel; relational databases; HIPAA sources; flat files; web log sources
and direct parsing of XML data files through DTDs or XML schemas.
PowerCenter interfaces mask the complexities of the underlying DBMS for the
developer, enabling the build process to focus on implementing the business
logic of the solution
● Transform - The majority of the work in the Build Phase focuses on
developing and testing data transformations. These transformations apply the
business rules, cleanse the data, and enforce data consistency from disparate
sources as data is moved from source to target.

Load - PowerCenter automates much of the load process. To


increase performance and throughput, loads can be multi-threaded,
pipelined, streamed (concurrent execution of the extract, transform,
and load steps), or serviced by more than one server. In addition
DB2, Oracle, Sybase IQ and Teradata external loaders can be used to
increase performance. Data can be delivered to EAI queues for
enterprise applications. Data loads can also take advantage of Open

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 272 of 1017


Database Connectivity (ODBC) or use native database drivers to
optimize performance. Pushdown optimization can even allow some
or all of the transformation work to occur in the target database itself.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 273 of 1017


Phase 5: Build
Task 5.1 Launch Build
Phase

Description

In order to begin the Build phase, all analysis performed in previous phases of the
project needs to be compiled, reviewed and disseminated to the members of the Build
team. Attention should be given to project schedule, scope, and risk factors. The team
should be provided with:

● Project background
● Business objectives for the overall solution effort
● Project schedule, complete with key milestones, important deliverables,
dependencies, and critical risk factors
● Overview of the technical design including external dependencies
● Mechanism for tracking scope changes, problem resolution, and other
business issues

A series of meetings may be required to transfer the knowledge from the Design team
to the Build team, ensuring that the appropriate staff is provided with relevant
information. Some or all of the following types of meetings may be required to get
development under way:

● Kick-off meeting to introduce all parties and staff involved in the Build phase
● Functional design review to discuss the purpose of the project and the benefits
expected and review the project plan
● Technical design review to discuss the source to target mappings, architecture
design, and any other technical documentation

Information provided in these meetings should enable members of the data integration
team to immediately begin development. As a result of these meetings, the integration
team should have a clear understanding of the environment in which they are to work,
including databases, operating systems, database/SQL tools available in the
environment, file systems within the repository and file structures within the
organization relating to the project, and all necessary user logons and passwords.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 274 of 1017


The team should be provided with points of contact for all facets of the environment (e.
g., DBA, UNIX\NT Administrator, PowerCenter Administrator, etc.). The team should
also be aware of the appropriate problem escalation plan. When team members
encounter design problems or technical problems, there must be an appropriate path
for problem escalation. The Project Manager should establish a specific mechanism for
problem escalation along with a problem tracking report.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Architect (Primary)

Data Integration Developer (Secondary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Secondary)

Presentation Layer Developer (Primary)

Quality Assurance Manager (Primary)

Repository Administrator (Review Only)

System Administrator (Review Only)

Technical Project Manager (Primary)

Test Manager (Primary)

Considerations

It is important to include all relevant parties in the launch activities. If all points of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 275 of 1017


discussion cannot be resolved during the kick-off meeting, the key personnel in each
area should be present to reschedule quickly, so as not to affect the overall schedule.

Because of the nature of the development process, there are often bottlenecks in the
development flow. The Project Manager should be aware of the risk factors, which
emanate from outside the project, and should be able to anticipate where bottlenecks
are likely to occur. The Project Manager also needs to be aware of the external factors
that create project dependencies, and should avoid having meetings prematurely when
external dependencies have not been resolved. Having meetings prior to resolving
these issues can result in significant down time for the developers while they wait to
have their sources in place and finalized.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 276 of 1017


Phase 5: Build
Subtask 5.1.1 Review
Project Scope and Plan

Description

The Build team needs to understand the project's objectives, scope, and plan in order
to prepare themselves for the Build Phase. There is often a tendency to waste time
developing non-critical features or functions. The team should review the project plan
and identify the critical success factors and key deliverables to avoid focusing on
relatively unimportant tasks. This helps to ensure that the project stays on its original
track and avoids much unnecessary effort. The team should be provided with:

● Detailed descriptions of deliverables and timetables.


● Dependencies that effect deliverables.
● Critical success factors.
● Risk assessments made by the design team.

With this information, the Build team should be able to enhance the project plan to
navigate through the risk areas, dependencies, and tasks to reach its goal of
developing an effective solution.

Prerequisites
None

Roles

Business Analyst (Review Only)

Data Architect (Review Only)

Data Integration Developer (Review Only)

Data Warehouse Administrator (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 277 of 1017


Database Administrator (DBA) (Review Only)

Presentation Layer Developer (Review Only)

Quality Assurance Manager (Review Only)

Technical Project Manager (Primary)

Considerations

With the Design Phase complete, this is the first opportunity for the team to review
what it has learned during the Architect Phase and the Design Phase about the
sources of data for the solution. It is also a good time to review and update the project
plan, which was created before these findings, to incorporate the knowledge gained
during the earlier phases. For example, the team may have learned that the source of
data for marketing campaign programs is a spreadsheet that is not easily accessible by
the network on which the data integration platform resides. In this case, the team may
need to plan additional tasks and time to build a method for accessing the data. This is
also an appropriate time to review data profiling and analysis results to ensure all data
quality requirements have been taken into consideration.

During the project scope and plan review, significant effort should be made to identify
upcoming Build Phase risks and assess their potential impact on project schedule and/
or cost. Because the design is complete, risk management at this point tends to be
more tactical than strategic; however, the team leadership must be fully aware of key
risk factors that remain. Team members are responsible for identifying the risk factors
in their respective areas and notifying project management during the review process.

Best Practices
None

Sample Deliverables

Project Review Meeting Agenda

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 278 of 1017


Phase 5: Build
Subtask 5.1.2 Review
Physical Model

Description

The data integration team needs the physical model of the target database in order to
begin analyzing the source to target mappings and develop the end user interface
known as the presentation layer.

The Data Architect can provide database specifics such as: what are the indexed
columns, what partitions are available and how they are defined, and what type of data
is stored in each table.

The Data Warehouse Administrator can provide metadata information and other source
data information, and the Data Integration Developer(s) needs to understand the entire
physical model of both the source and target systems, as well as all the dimensions,
aggregations, and transformations that will be needed to migrate the data from the
source to the target.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Architect (Primary)

Data Integration Developer (Secondary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Secondary)

Presentation Layer Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 279 of 1017


Quality Assurance Manager (Review Only)

Repository Administrator (Review Only)

Technical Project Manager (Review Only)

Considerations

Depending on how much up-front analysis was performed prior to the Build phase, the
project team may find that the model for the target database does not correspond well
with the source tables or files. This can lead to extremely complex and/or poorly
performing mappings. For this reason, it is advisable to allow some flexibility in the
design of the physical model to permit modifications to accommodate the sources. In
addition, some end user products may not support some datatypes specific to a
database. For example, Teradata's BYTEINT datatype is not supported by some end
user reporting tools.

As a result of the various kick-off and review meetings, the data integration team
should have sufficient understanding of the database schemas to begin work on the
Build-related tasks.

Best Practices
None

Sample Deliverables

Physical Data Model Review Agenda

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 280 of 1017


Phase 5: Build
Subtask 5.1.3 Define Defect
Tracking Process

Description

Since testing is designed to uncover defects, it is crucial to properly record the defects
as they are identified, along with their resolution process. This requires a ‘defect
tracking system’ that may be entirely manual, based on shared documents such as
spreadsheets, or automated using, say, a database with a web browser front-end.

Whatever tool is chosen, sufficient details of the problem must be recorded to allow
proper investigation of the root cause and then the tracking of the resolution process.

The success of a defect tracking system depends on:

● Formal test plans and schedules being in place, to ensure that defects are
discovered, and that their resolutions can be retested.
● Sufficient details being recorded to ensure that any problems reported are
repeatable and can be properly investigated.

Prerequisites
None

Roles

Data Integration Developer (Review Only)

Data Warehouse Administrator (Review Only)

Database Administrator (DBA) (Review Only)

Presentation Layer Developer (Review Only)

Quality Assurance Manager (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 281 of 1017


Repository Administrator (Review Only)

System Administrator (Review Only)

Technical Project Manager (Primary)

Test Manager (Primary)

Considerations

The defect tracking process should encompass these steps:

● Testers prepare Problem Reports to describe defects identified.


● Test Manager reviews these reports and assigns priorities on an Urgent/High/
Medium/Low basis (‘Urgent’ should only be used for problems that will prevent
or severely delay further testing).
● Urgent problems are immediately passed to the Project Manager for review/
action.
● Non-urgent problems are reviewed by the Test Manager and Project Manager
on a regular basis (this can be daily at a critical development time, but is
usually less frequent) to agree priorities for all outstanding problems.
● The Project Manager assigns problems for investigation according to the
agreed-upon priorities.
● The ‘investigator’ attempts to determine the root cause of the defect and to
define the changes needed to rectify the defect.
● The Project Manager reviews the results of investigations and assigns
rectification work to ‘fixers’ according to priorities and effective use of
resources.
● The ‘fixer’ make the required changes and conducts unit testing.
● Regression testing is also typically conducted. The Project Manager may
decide to group a number of fixes together to make effective use of resources.

The Project Manager and Test Manager review the test results at their next meeting
and agree on closure, if appropriate.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 282 of 1017


Sample Deliverables

Issues Tracking

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 283 of 1017


Phase 5: Build
Task 5.2 Implement
Physical Database

Description

Implementing the physical database is a critical task that must be performed efficiently
to ensure a successful project. In many cases, correct database implementation can
double or triple the performance of the data integration processes and presentation
layer applications. Conversely, poor physical implementation generally has the greatest
negative performance impact on a system.

The information in this section is intended as an aid for individuals responsible for the
long-term maintenance, performance, and support of the database(s) used in the
solution. It should be particularly useful for programmers, Database Administrators, and
System Administrators with an in-depth understanding of their database engine and
Informatica product suite, as well as the operating system and network hardware.

Prerequisites
None

Roles

Data Architect (Secondary)

Data Integration Developer (Review Only)

Database Administrator (DBA) (Primary)

Repository Administrator (Secondary)

System Administrator (Secondary)

Considerations

Nearly everything is a trade-off in the physical database implementation. One example

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 284 of 1017


is the trade off of the flexibility of a completely 3rd Normal Form data schema for the
improved performance of a 2nd Normal Form database.

The DBA is responsible for determining which of the many available alternatives is the
best implementation choice for the particular database. For this reason, it is critical for
this individual to have a thorough understanding of the data, database, and desired use
of the database by the end-user community prior to beginning the physical design and
implementation processes.

The DBA should be thoroughly familiar with the design of star-schemas for Data
Warehousing and Data Integration solutions, as well as standard 3rd Normal
Form implementations for operational systems.

For data migration projects this task often refers exclusively to the development of new
tables in either a reference data schema or staging schemas. Developers are
encouraged to leverage a reference data database which will hold reference data such
as valid values, cross-reference tables, default values, exception handling details, and
other tables necessary for successful completion of the data migration. Additionally,
tables will get created in staging schemas. There should be little creation of tables in
the source or target system due to the nature of the project. Therefore most of the table
development will be in the developer space rather then in the applications that are part
of the data migration.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 285 of 1017


Phase 5: Build
Task 5.3 Design and Build
Data Quality Process

Description

Follow the steps in this task to design and build the data quality enhancement
processes that can ensure that the project data meets the standards of data quality
required for progress through the rest of the project.

The processes designed in this task are based on the results of 2.8 Perform Data
Quality Audit. Both the design and build components are captured in the Build Phase
since much of this work is interative as intermediate builds of the data quality process
are reviewed, the design is further expanded and enhanced.

Note: If the results of the Data Quality Audit indicate that the project data already
meets all required levels of data quality, then you can skip this task. However, this
is unlikely to occur.

Here again (as in subtask 2.3.1 Identify Source Data Systems) it is important to work as
far as is practicable with the actual source data. Using data derived from the actual
source systems - either the complete dataset or a subset - was essential in identifying
quality issues during the Data Quality Audit and determining if the data meets the
business requirements (i.e., if it answers the business questions identified in
the Manage Phase). The data quality enhancement processes designed in the
subtasks of this task must operate on as much of the project dataset(s) as deemed
necessary, and possibly the entire dataset.

Data quality checks can be of two types: one can cover the metadata characteristics of
the data, and the other covers the quality of the data contents from a business
perspective. In the case of complex ERP systems like SAP, where implementation has
a high degree of variation from the base product, a thorough data quality check should
be performed to consider the customizations.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 286 of 1017


Business Analyst (Primary)

Business Project Manager (Secondary)

Data Integration Developer (Secondary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Secondary)

Technical Project Manager (Approve)

Considerations

Because the quality of the source system data has a major effect on the correctness of
all downstream data, it is imperative to resolve as many of the data issues as possible,
as early as possible. Making the necessary corrections at this stage eliminates many of
the questions that may otherwise arise later during testing and validation.

If the data is flawed, the development initiative faces a very real danger of failing. In
addition, eliminating errors in the source data makes it far easier to determine the
nature of any problems that may arise in the final data outputs. If data comes from
different sources, it is mandatory to correct data for each source as well as for the
integrated data. If data comes from a mainframe, it is necessary to use the proper
access method to interpret data correctly. Note however that Informatica Data Quality
(IDQ) applications do not read data directly from mainframe.

As indicated above, the issue of data quality covers far more than simply whether the
source and target data definitions are compatible. From the business perspective, data
quality processes seek to answer the following questions: what standard has the data
achieved in areas that are important to the business, and what standards are required
in these areas?

There are six main areas of data quality performance: Accuracy, Completeness,
Conformity, Consistency, Integrity, and Duplication. These are fully explained in
task 2.8 Perform Data Quality Audit. The Data Quality Developer uses the results of the
Data Quality Audit as the benchmark for the data quality enhancement steps you need
to apply in the current task. Before beginning to design the data quality processes, the
Data Quality Developer, Business Analyst, Project Sponsor, and other interested

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 287 of 1017


parties must meet to review the outcome of the Data Quality Audit and agree the extent
of remedial action needed for the project data. The first step is to agree on the business
rules to be applied to the data. (See Subtask 5.3.1 Design Data Quality Technical
Rules.)

The tasks that follow are written from the perspective of Informatica Data Quality,
Informatica’s dedicated data quality application suite.

Best Practices

Data Cleansing

Sample Deliverables
None

Last updated: 01-Feb-07 18:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 288 of 1017


Phase 5: Build
Subtask 5.3.1 Design Data
Quality Technical Rules

Description

Business rules are a key driver of data enhancement processes. A business rule is a
condition of the data that must be true if the data is to be valid and, in a larger sense,
for a specific business objective to succeed. In may cases, poor data quality is directly
related to the data’s failure concerning a business rule.

In this subtask the Data Quality Developer and the Business Analyst, and optionally
other personnel representing the business, establish the business rules to be applied to
the data. An important factor in completing this task is proper documentation of the
business rules.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Secondary)

Considerations

All areas of data quality can be affected by business rules, and business rules can be
defined at high- and low-levels and at varying levels of complexity. Some business
rules can be tested mathematically using simple processes, whereas others may
require complex processes or reference data assistance.

For example, consider a financial institution that must store several types of information
for account holders in order to comply with the Sarbanes-Oxley or the USA-PATRIOT
Act. It defines several business rules for its database data, including:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 289 of 1017


● Field 1-Field n must not be null or populated with default values.
● Date of Birth field must contain dates within certain ranges (e.g., to indicate
that the account holder is between 18 and 100 years old).
● All account holder addresses are validated as postally correct.

These three rules are equally easy to express, but they are implemented in different
ways. All three rules can be checked in a straightforward manner using Informatica
Data Quality (IDQ), although the third rule, concerning address validation, requires
reference data verification. The decision to use external reference data is covered in
subtask 5.3.2 Determine Dictionary and Reference Data Requirements.

When defining business rules, the Data Quality Developer must consider the following
questions:

● How to document the rules.


● How to build the data quality processes to validate the rules.

Documenting Business Rules

Documenting rules is essential as a means of tracking the implementation of the


business requirements. When documenting business rules, the following information
must be provided:

● A unique ID should be provided for each rule. This can be as simple as a


incremented number, or assigning a project code to each rule.
● A text description of the rule. This should be as complete as possible –
however, if the description becomes too lengthy or complex, it may be
advisable to break it down into multiple rules.
● The name of the data source containing the records affected by the rule.
● The data headers or field names containing the values affected by the rule.
The Data Quality Developer and the Business Analyst can refer back to the
results of the Data Quality Audit to identify this information.
● Add columns for the plan name and the results of implementing the rule. The
Data Quality Developer can provide this information later.

Note: In IDQ, a discrete data quality process is called a plan. A plan has inputs,
outputs, and analysis or enhancement algorithms and is analogous to a PowerCenter

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 290 of 1017


mapping. It is important to understand that a data quality plan can be added to a
PowerCenter custom transformation and run within a PowerCenter mapping.

Assigning Business Rules to Data Quality Plans

When the Data Quality Developer and Business Analyst have agreed on the business
rules to apply to the data, the Data Quality Developer must decide how to convert the
rules into data quality plans. (The Data Quality Developer need not to create the plans
at this stage)

The Data Quality Developer may create a plan for each rule, or may incorporate
several rules into a single plan. This decision is taken on a rule-by-rule basis. There is
a trade-off between simplicity in plan design, wherein each plan contains a single rule,
and efficiency in plan design, wherein a single plan addresses several rules.

Typically a plan handles more than one rule. One advantage of this course of action is
that the Data Quality Developer does not need to define and maintain multiple
instances of input and output data, covering small increments of data quality progress,
where a single set of inputs and outputs can do the same job in a more sophisticated
plan.

It’s also worth considering if the plan will be run from within IDQ or added to a
PowerCenter mapping for execution in a workflow. Bear in mind that the Data Quality
Integration transformation in PowerCenter accepts information from one plan. To add
several plans to a mapping, you must add the same number of transformations.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:12

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 291 of 1017


Phase 5: Build
Subtask 5.3.2 Determine
Dictionary and Reference
Data Requirements

Description

Many data quality plans make use of reference data files to validate and improve the
quality of the input data. The main purposes of reference data are:

● To validate the accuracy of the data in question. For example, in cases where
input data is verified against tables of known-correct data.
● To enrich data records with new data or enhance partially-correct data values.
For example, in cases of address records that contain usable but incomplete
postal information. (Typos can be identified and fixed; Plus-4 information can
be added to zip codes.)

When preparing to build data quality plans, the Data Quality Developer must determine
the types of dictionary and reference files that may be used in the data quality plans,
obtain approval to use third-party data, if necessary, and define a strategy for
maintaining and distributing reference files. An important factor in completing this task
is the proper documentation of the required dictionary or reference files.

Prerequisites
None

Roles

Business Analyst (Secondary)

Business Project Manager (Secondary)

Data Quality Developer (Primary)

Data Steward/Data Quality Steward (Secondary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 292 of 1017


Considerations

Data quality plans can make use of three types of reference data.

● Standard dictionary files. These files are installed with Informatica Data
Quality (IDQ) and can be used by several types of components in Workbench.
All dictionaries installed with IDQ are text dictionaries. These are plain-text
files saved in .DIC file format. They can be created and edited manually.

IDQ installs with a set of dictionary files in generic business information areas
including forenames, city and town names, units of measurement, and gender
identification. Informatica also provides and supports reference data of external
origin, such as postal address data endorsed by national postal carriers.

Database dictionaries. Users with database expertise can create


and specify dictionaries that are linked to database tables, and can,
therefore, be updated dynamically when the underlying data is
updated. Database dictionaries are useful when the reference data
has been originated for other purposes and is likely to change
independently of IDQ. By making use of a dynamic connection, data
quality plans can always point to the current version of the reference
data. Database dictionaries are stored as SELECT statements that
query the database at the time of plan execution. IDQ does not
install any database dictionaries.
● Third-party reference data. These data files originate from third-party

vendors and are provided by Informatica as premium product options. The


reference data provided by third-party vendors is typically in database format.

If the Data Quality Developer feels that externally-derived reference data files are
necessary, he or she must inform the Project Manager or other business personnel as
soon as possible, as this is likely to effect (1) the project budget and (2) the software
architecture implementation.

Managing and Distributing Reference Files

Managing standard-installed dictionaries is straightforward, as long as the Data Quality


Developer does not move the designed plans to non-standard locations.

What is a non-standard location? One where the plans cannot see the dictionary files.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 293 of 1017


IDQ recognizes set locations for dictionary and reference data files. A Standard (i.e.,
client-only) install of IDQ looks for its dictionary files in the \Dictionaries folder of the
installation. An Enterprise (i.e., client-server) install looks in this location, and also looks
in the logged-on user’s \Dictionaries folder on the server if the plan is executed on the
server. These locations are specified in IDQ’s config.xml file.

If the relevant dictionary files are moved out of these locations, the plan cannot run
unless the config.xml file has been edited. Conversely, if the user has created new or
modified dictionaries within the standard dictionary format, and wishes to copy (publish)
plans to a server or another IDQ installation, the user must copy the new dictionary files
to a recognized location for the server or the other IDQ also.

Third-party reference data adds another set of actions. The third-party data currently
available from Informatica is packaged in a manner that installs to locations recognized
by IDQ. (Again, these locations are defined in the config.xml file.) However, copying
these files to other locations is not as simple, because the installation of these files is
less simple, and because the files are licensed and delivered separately from IDQ. The
business must agree to license these files before the Data Quality Developer can
assume he or she can develop plans using third-party files, and the system
administrator must understand that the reference data will be installed in the required
locations.

Note: Informatica customers license third-party data on a subscription basis.


Informatica provides regular updates to the reference data, and the customer (possibly
the system administrator) must perform the updates.

Whenever you add a dictionary or reference data file to a plan, you must document
exactly how you have done so: record the plan name, the reference file name, and the
component instance that uses the reference file. Make sure you pass the inventory of
reference data to all other personnel who are going to use the plan.

Data migration projects have additional reference data requirements which include a
need to determine the valid values for key code fields and to ensure that all input data
aligns with these codes. It is recommended to build valid value processes to perform
this validation. It is also recommended to use a table driven approach to populate hard-
coded values which then allows for easy changes if the specific hard-coded values
change over time. Additionally, a large number of basic cross-references are also
required for data migration projects. These data types are examples of reference data
that should be planned for by using a specific approach to populate and maintain them
with input from the business community. These needs can be met with a variety of
Informatica products, but to expedite development, they must be addressed prior to
building data integration processes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 294 of 1017


Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:14

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 295 of 1017


Phase 5: Build
Subtask 5.3.3 Design and Execute
Data Enhancement Processes

Description

This subtask, along with subtask 5.3.4 Design Run-time and Real-time Processes for Operate
Phase Execution concerns the design and execution of the data quality plans that will prepare the
project data for the Data Integration Design and Development in the Build Phase.

While this subtask describes the creation and execution of plans through Informatica Data Quality
(IDQ) Workbench, subtask 5.3.4 focuses on the steps to deploy plans in a runtime or scheduled
environment. All plans are created in Workbench. However, there are several aspects to creating
plans primarily for runtime use, and these are covered in 5.3.4. Users who are creating plans
should read both subtasks.

Note: IDQ provides a user interface, the Data Quality Workbench, within which plans can be
designed, tested, and deployed to other Data Quality engines across the network. Workbench is an
intuitive user interface; however, the plans that users construct in Workbench can grow in size and
complexity, and Workbench, like all software applications, requires user training. These subtasks
are not a substitute for that training. Instead, they describe the rudiments of plan construction, the
elements required for various types of plans, and the next steps to plan deployment. Both subtasks
assume that the Data Quality Developer will have received formal training in IDQ.

Prerequisites
None

Roles

Data Quality Developer (Primary)

Technical Project Manager (Approve)

Considerations

A data quality plan is a discrete set of data analysis and/or enhancement operations with a data
source and a data target (or sink). At a high level, the design of a plan is not dissimilar to the
design of a PowerCenter mapping. The data sources, sinks, and analysis/enhancement
components are represented on-screen by icons, much like the sources, targets, and
transformations in a mapping. Sources, sinks, and other components can be configured through a
tabbed dialog box in the same way as PowerCenter transformations. One difference between
PowerCenter and Workbench is that users cannot define workflows that contain serial data quality

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 296 of 1017


plans, although this functionality is available in a runtime/batch scenario.

Data quality plans can read source data from, and write data to file and database. Most delimited,
flat, or fixed-width file types are usable, as are DB2, Oracle, SQL Server databases and any
database legible via ODBC. Informatica Data Quality (IDQ) stores plan data in its own MySQL data
repository. The following figure illustrates a simple data quality plan.

This data quality plan shows a data source reading from a SQL database, an operational
component analyzing the data, and a data sink component that receives the data available as plan
output. A plan can have any number of operational components.

Plans can be designed to fulfill several data quality requirements, including data analysis, parsing,
cleansing and standardization, enrichment, validation, matching, and consolidation. These are
described in detail in the Best Practice Data Cleansing.

When designing data quality plans, the questions to consider include:

● What types of plan are necessary to meet the needs of the project The business
should have already signed-off on specific data quality goals as a part of agreeing the
overall project objectives, and the Data Quality Audit should have indicated the areas
where the project data requires improvement. For example, the audit may indicate that the
project data contains a high percentage of duplicate records, and therefore matching and
pre-match grouping plans may be necessary.
● What test cycles are appropriate for the plans? Testing and tuning plans in Workbench
is a normal part of plan development. In many cases, testing a plan in Workbench is akin
to validating a mapping in PowerCenter, and need not be part of a formal test scenario.
However, the Data Quality Developer must be able to sign-off on each plan as valid and
executable.
● What source data will be used for the plans? This is related to the testing issue
mentioned above. The final plans that operate on the project data are likely to operate on

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 297 of 1017


the complete project dataset; in every case, the plans will effect changes in the customer
data. Ideally, a complete ‘clone’ of the project dataset should be available to the Data
Quality Developer, so that the plans can be designed and tested on a fully faithful version
of the project data. At the minimum, a meaningful sample of the dataset should be
replicated and made available for plan design and test purposes.

Bear in mind that a plan that is published to a service domain repository will translate the data
source locations set at design time into new locations local to the new computer on which it
resides. See subtask 5.3.4 Design Run-time and Real-time Processes for Operate
Phase Execution and the Informatica Data Quality User Guide for more information.

Where will the plans be deployed? IDQ can be installed in a client-server configuration, with
multiple Workbench installations acting as clients to the IDQ server. The server employs service
domain architecture, so that a connected Workbench user can run a plan from a local or domain
repository to any Execution Service on the service domain. Likewise, the Data Quality Developer
may publish plans from Workbench to a remote repository on the IDQ service domain for execution
by other Data Quality Developers.

An important consideration here is, will the plans be deployed as runtime plans? A plan is
considered a runtime plan if it is deployed in a scheduled or batch operation with other plans. In
such cases, the plan is run using a command line instruction. See subtask 5.3.4 Design Run-time
and Real-time Processes for Operate Phase Execution for details.

Bear in mind also that it is possible to add a plan to a mapping if the Data Quality Integration plug-
in has been installed client-side and server-side to PowerCenter. The Integration enables the
following types of interaction:

● It enables you to browse the Data Quality repository and add a data quality plan to the
Data Quality Integration transformation. The functional details of the plan are saved as
XML in the PowerCenter repository.
● It enables the PowerCenter Integration Service to send data quality plan XML to the Data
Quality engine when a session containing a Data Quality Integration transformation is run.

A plan designed for use in a PowerCenter mapping must set its data source and data sink
components to process data in realtime. A subset of the source and sink components can be
configured in this way (six out of twenty-one components).

Note that plans with realtime capabilities are also suitable for use in a request-response
environment, such as a point of data entry environment. These realtime plans can be called by a
third-party application to analyze keyboard data inputs and correct human error.

Best Practices
None

Sample Deliverables

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 298 of 1017


None

Last updated: 12-Feb-07 15:05

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 299 of 1017


Phase 5: Build
Subtask 5.3.4 Design Run-
time and Real-time
Processes for Operate
Phase Execution

Description

This subtask, along with subtask 5.3.3 Design and Execute Data Enhancement
Processes concerns the design and execution of the data quality plans to prepare the
project data for the Data Integration component of the Build Phase and possibly later
phases.

While subtask 5.3.3 describes the creation and execution of plans through Data Quality
Workbench, this subtask focuses on the steps to deploy plans in a runtime or
scheduled environment. All data quality plans are created in Workbench. However,
there are several aspects to creating plans primarily for runtime which are described in
this subtask. Users who are creating plans should read both subtasks.

Because they can be scheduled and run in a batch, runtime plans present two
opportunities for the Data Quality Developer and the data project as a whole:

● A plan that may take several hours to run — such as a large-scale data
matching plan — can be scheduled to run overnight as a runtime plan.
● A runtime plan can be scheduled to run at regular intervals on the dataset to
analyze dataset quality; such plans can outlive the project in which they are
designed and provide a method for ongoing data monitoring in the enterprise.

Because runtime plans need not be run from a user interface, they are commonly
published or moved to a computer where higher-performance is available. When
publishing or moving a runtime plan, consider the issues discussed in this subtask.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 300 of 1017


Business Analyst (Review Only)

Data Quality Developer (Primary)

Technical Project Manager (Review Only)

Considerations

The two main factors to consider when planning to use runtime plans are:

● What data sources will the plan use?


● What reference files will the plan use?

In both cases, the source data and reference files must reside in locations that are
visible to Informatica Data Quality (IDQ). This is pertinent as the runtime plan will
typically be moved from its design-time computer to another computer for execution.

Data source locations are set in the in the plan at design time. If the plan connects to a
file, the name and path to the file(s) are set in the data source component. If the source
data is stored in a database, the same database connection must be available on the
machine to which the plans are moved. If the plan is run on the machine on which it
was designed, then the data locations can remain static — so long as the data source
details do not change. However, if the plan is moved to another machine, consider the
following questions:

Will the plan be run in an IDQ service domain? A plan moved to another machine
may be run through Data Quality Server (specifically, by a machine hosting a Data
Quality Execution Service.) In this case, the Data Quality engine can run the plan from
the repository, and you can publish the plan to repository from the Workbench client.

When you publish a plan, bear in mind that IDQ recognizes a specific set of folders as
valid source file locations. If a Data Quality Developer defines a plan with a source file
stored in the following location on the Workbench computer:

C:\Myfiles\File.txt

A Data Quality Server on Windows will look for the file here:

C:\Program Files\Data Quality\users\user.name\Files\Myfiles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 301 of 1017


And a Data Quality Server on UNIX installed at /home/Informatica/dataquality/ will look
for the file here:

/home/informatica/dataquality/users/user.name/Files/Myfiles

where user.name is the logged-on Data Quality Developer name. (The Data Quality
Developer must be working on a Workbench machine that has a client connection to
the Data Quality Server.)

Path translations are platform-independent, that is, a Windows path will be mapped to a
UNIX path.

Are the source files in a non-standard location on the runtime computer? If a


Data Quality Developer publishes a plan to a service domain repository for runtime
execution, and the plan source file is located in a non-standard location on the
executing computer, the Data Quality Developer can add a parameter file to the run
command, translating the location set in the plan into the required file location.

Will the plan be deployed to IDQ machines outside the service domain? If so, the
plans must be saved as a .xml file for runtime deployment. (Plans can also be saved
as .pln files for use in another instance of Workbench.) The Data Quality Developer can
set the run command to distinguish between plans stored in the Data Quality repository
and plans saved on the file system.

Do the plans use non-standard dictionary files, or dictionary/reference files in


non-standard locations? The Data Quality Developer must check that any dictionary
or reference files added to a plan at design time are also available at the runtime
location.

If a plan uses standard dictionary files (i.e., the files that installed with the product) then
IDQ takes care of this automatically, as long as the plan resides on a service domain. If
a plan is published or copied to a network location and uses non-standard reference
files, these files must be copied to the a location that is recognizable to the IDQ
installation that will run the deployed plans. For more information on valid dictionary
and reference data files, see the Informatica Data Quality User Guide.

Implications for Plan Design

The above settings can have a significant bearing on plan design. When the Data
Quality Developer designs a plan in Workbench, he or she should ensure that the
folders created for file resources can map efficiently to the server folder structure.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 302 of 1017


For example, let’s say the Developer creates a data source file folder on a Workbench
installation at the following location:

C:\Program Files\Data Quality\Sources

When the plan runs on the server side, the Data Quality Server looks for the source file
in the following location:

C:\Program Files\Data Quality\users\user.name\Files\Program Files\Data


Quality\Sources

Note that the folder path Program Files\Data Quality is repeated here: in this case,
good plan design suggests the creation of folders under C:\ that can be recreated
efficiently on the server.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:16

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 303 of 1017


Phase 5: Build
Subtask 5.3.5 Develop
Inventory of Data Quality
Processes

Description

When the Data Quality Developer has designed and tested the plans to be used later in
the project, he or she must then create an inventory of the plans. This inventory should
be as exhaustive as possible. Data quality plans, once they achieve any size, can be
hard for personnel other than the Data Quality Developer to read. Moreover, other
project personnel and business users are likely to rely on the inventory to identify
where the plan functioned in the project.

Prerequisites
None

Roles

Data Quality Developer (Primary)

Considerations

For each plan created for use in the project (or for use in the Operate Phase and post-
project scenarios), the inventory document should answer the following questions. The
questions can be divided into two sections: one relating to the plan’s place and function
relative to the project and its objectives, and the other relating to the plan design itself.
The questions below are a subset of those included in the sample deliverable
document Data Quality Plan Documentation and Handover.

Project-related Questions

● What is the name of the plan?

● What project is the plan part of? Where does the plan fit in the overall project?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 304 of 1017


● What particular aspect of the project does the plan address?

● What are the objectives of the plan?

● What issues, if any, apply to the plan or its data?

● What department or group uses the plan output?

● What are the predicted ‘before and after’ states of the plan data?

● Where is the plan located (include machine details and folder location) and
when was it executed?

● Is the plan version-controlled? What are the creation/medatada details for the
plan?

● What steps were taken or should be taken following plan execution?

Plan Design-related Questions

● What are the specific data or business objectives of the plan?

● Who ran (or should run) the plan, and when?

● In what version of IDQ was the plan was designed?

● What Informatica application will run the plan, and on which applications will
the plan run?

● Provide a screengrab of the plan layout in the Workbench user interface.

● What data source(s) are used?

● Where is the source located? What are the format and origin of the database
table?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 305 of 1017


● Is the source data an output from another IDQ plan, and if so, which one?

● Describe the activity of each component in the plan. Component functionality


can be described at a high level or low level, as appropriate.

● What reference files or dictionaries are applied?

● What business rules are defined? This question can refer to the documented
business rules from subtask 5.3.1 Design Data Quality Technical Rules.
Provide the logical statements, if appropriate.

● What are the outputs for the instance, and how are they named?

● Where is the output written: report, database table, or file?

● Are there exception files? If so, where are they written?

● What is the next step in the project?

● Will the plan(s) be re-used (e.g., in a runtime environment)?

● Who receives the plan output data, and what actions are they likely to they
take?

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:19

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 306 of 1017


Phase 5: Build
Subtask 5.3.6 Review and
Package Data
Transformation
Specification Processes
and Documents

Description

In this subtask the Data Quality Developer collates all the documentation produced for
the data quality operations thus far in the project and makes them available to the
Project Manager, Project Sponsor, and Data Integration Developers — in short, to all
personnel who need them.

The Data Quality Developer must also ensure that the data quality plans themselves
are stored in locations known to and usable by the Data Integration Developers.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Data Quality Developer (Primary)

Technical Project Manager (Review Only)

Considerations

After the Data Quality Developer verifies that all data quality-related materials produced
in the project are complete, he or she should hand them all over to other interested
parties in the project. The Data Quality Developer should either arrange a handover
meeting with all relevant project roles or ask the Data Steward to arrange such a
meeting.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 307 of 1017


The Data Quality Developer should consider making a formal presentation at the
meeting and should prepare for a Q&A session before the meeting ends. The
presentation may constitute a PowerPoint slide show and may include dashboard
reports from data quality plans. The presentation should cover the following areas:

● Progress in treating the quality of the project data (‘before and after’ states of
the data in the key data quality areas)
● Success stories, lessons learned
● Data quality targets: met or missed?
● Recommended next steps for project data

Regarding data quality targets met or missed, the Data Quality Developer must be able
to say whether the data operated on is now in a position to proceed through the rest of
the project. If the Data Quality Developer believes that there are “show stopper” issues
in the data quality, he or she must inform the business managers and provide an
estimate of the work necessary to remedy the data issues. The business managers can
then decide if the data can pass to the next stage of the project or if remedial action is
appropriate.

The materials that the Data Quality Developer must assemble include:

● Inventory of data quality plans (prepared in subtask 5.3.5 Develop Inventory of


Data Quality Processes).
● Data Quality plan files (.pln or .xml files), or locations of the Data Quality
repositories containing the plans.
● Details of backup data quality plans. (All Data Quality repositories containing
final plans should be backed up.)
● Inventory of business rules used in the plans (prepared in subtask 5.3.1
Design Data Quality Technical Rules).
● Inventory of dictionary and reference files used in the plans (prepared in
subtask 5.3.2 Determine Dictionary and Reference Data Requirements).
● Data Quality Audit results (prepared in task 2.8 Perform Data Quality Audit).
● Summary of task 5.3 Design and Build Data Quality Process.

Best Practices

Build Data Audit/Balancing Processes

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 308 of 1017


Sample Deliverables

Data Quality Plan Design

Last updated: 01-Feb-07 18:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 309 of 1017


Phase 5: Build
Task 5.4 Design and
Develop Data Integration
Processes

Description

A properly designed data integration process performs better and makes more efficient
use of machine resources than a poorly designed process. This task includes the
necessary steps for developing a comprehensive design plan for the data integration
process, which incorporates high-level standards such as error-handling strategies, and
overall load-processing strategies, as well as specific details and benefits of individual
mappings. Many development delays and oversights are attributable to an incomplete
or incorrect data integration process design, thus underscoring the importance of this
task.

When complete, this task should provide the development team with all of the detailed
information necessary to construct the data integration processes with minimal
interaction with the design team. This goal is somewhat unrealistic, however, because
requirements are likely to change, design elements need further clarification, and some
items are likely to be missed during the design process. Nevertheless, the goal of this
task should be to capture and document as much detail as possible about the data
integration processes prior to development.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Integration Developer (Primary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 310 of 1017


Quality Assurance Manager (Primary)

Technical Project Manager (Review Only)

Considerations

The PowerCenter platform provides facilities for developing and executing mappings
for extraction, transformation and load operations. These mappings determine the flow
of data between sources and targets, including the business rules applied to the data
before it reaches a target. Depending on the complexity of the transformations, moving
data can be a simple matter of passing data straight from a data source through an
expression transformation to a target, or may involve a series of detailed
transformations that use complicated expressions to manipulate the data before it
reaches the target. The data may also undergo data quality operations inside or outside
PowerCenter mappings; note also that some business rules may be closely aligned
with data quality issues. (Pre-emptive steps to define business rules and to avoid data
errors may have been performed already as part of task 5.3 Design and Build Data
Quality Process.)

It is important to capture design details at the physical level. Mapping specifications


should address field sizes, transformation rules, methods for handling errors or
unexpected results in the data, and so forth. This is the stage where business rules are
transformed into actual physical specifications, avoiding the use of vague terms and
moving any business terminology to a separate "business description" area. For
example, a field that stores "Total Cost" should not have a formula that reads 'Calculate
total customer cost.' Instead, the formula for 'Total Cost' should be documented as:

Orders.Order_Qty * Item.Item_Price - Customer.Item_Discount where Order.


Item_Num = Item.Item_Num and Order.Customer_Num = Customer.
Customer_Num.

Data Migration projects differ from typical data integration projects in that they should
have an established process and templates for most processes that are
developed. This is due to the fact that development is accelerated and more time is
spent on data quality and driving out incomplete business rules then on traditional
development. For migration projects the data integration processes can be further
subdivided into the following processes:

● Develop Acquire Processes


● Develop Convert Processes
● Develop Migrate/Load Processes

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 311 of 1017


● Develop Audit Processes

Best Practices

Real-Time Integration with PowerCenter

Sample Deliverables
None

Last updated: 27-May-08 16:19

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 312 of 1017


Phase 5: Build
Subtask 5.4.1 Design High
Level Load Process

Description

Designing the high-level load process involves the factors that must be considered
outside of the mapping itself. Determining load windows, availability of sources and
targets, session scheduling, load dependencies and session level error handling are all
examples of issues that developers should deal with in this task. Creating a solid load
process is an important part of developing a sound data integration solution.

This subtask incorporates three steps, all of which involve specific activities,
considerations, and deliverables. The steps are:

1. Identify load requirements . In this step, members of the development team work
together to determine the load window. The load window is the amount of time it will
take to load an individual table or an entire data warehouse or data mart. To begin this
step, the team must have a thorough understanding of the business requirements
developed in task 1.1 Define Project. The team should also consider the differences
between the requirements for initial and subsequent loading; tables may be loaded
differently in the initial load than they will subsequently. The load document generated
in this step, describes the rules that should be applied to the session or mapping, in
order to complete the loads successfully.

2. Determine dependencies . In this step, the Database Administrator works with the
Data Warehouse Administrator and Data Integration Developer, to identify and
document the relationships and dependencies that exist between tables within the
physical database. These relationships affect the way in which a warehouse is loaded.
In addition, the developers should consider other environmental factors, such as
database availability, network availability, and other processes that may be executing
concurrently with the data integration processes.

3. Create initial and ongoing load plan . In this step, the Data Integration Developer
and Business Analyst use information created in the two earlier steps to develop a load
plan document; this lists the estimated run times for the batches and sessions required
to populate the data warehouse and/or data marts.

Prerequisites

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 313 of 1017


None

Roles

Business Analyst (Review Only)

Data Integration Developer (Primary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Primary)

Quality Assurance Manager (Approve)

Technical Project Manager (Review Only)

Considerations

Determining Load Requirements

The load window determined in step 1 of this subtask, can be used by the Data
Integration Developers as a performance target. Mappings should be tailored to ensure
that their sessions run to successful completion within the constraints set by the load
window requirements document. The Database Administrator, Data Warehouse
Administrator and Technical Architect are responsible for ensuring that their respective
environments are tuned properly to allow for maximum throughput, to assist with
this goal.

Subsequent loads of a table are often performed differently than the initial load. For
example, suppose the primary focus of a mapping is an update of a dimension. But in
the first load of a warehouse, the dimension table has no data. The initial load of a table
may involve the execution of a subset of the database operations used by subsequent
loads. For example, if the primary focus of a mapping is an update of a dimension,
before the first load of the warehouse, the dimension table will be empty.
Consequently, the first load will perform a large number of inserts, while subsequent
loads may perform a smaller number of both insert and update operations. The
development team should consider and document such situations and convey the
different load requirements to the developer creating the mappings, and to the
operations personnel configuring the sessions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 314 of 1017


Identifying Dependencies

Foreign key (i.e., parent / child) relationships are the most common variable that should
be considered in this step. When designing the load plan, the parent table must always
be loaded before the child table, or integrity constraints (if applied) will be broken and
the data load will fail. The Data Integration Developer is responsible for documenting
these dependencies at a mapping level so that loads can be planned to coordinate with
the existence of dependent relationships. The Developer should also consider and
document other variables such as source and target database availability, network up/
down time, and local server processes unrelated to PowerCenter when designing the
load schedule.

TIP
Load parent / child tables in the same mapping to speed development and
reduce the number of sessions that must be managed.

To load tables with parent / child relationships in the same mapping, use the
constraint-based loading option at the session level. Use the target load plan
option in PowerCenter Designer to ensure that the parent table is marked to be
loaded first. The parent table keys will be loaded before an associated child
foreign key is loaded into its table.

The load plans should be designed around the known availability of both source and
target databases; it is particularly important to consider the availability of source
systems, as these systems are typically beyond the operational control of the
development team. Similarly, if sources or targets are located across a network, the
development team should consult with the Network Administrator to discuss network
capacity and availability in order to avoid poorly performing batches and sessions.
Finally, although unrelated local processes executing on the server are not likely to
cause a session to fail, they can severely decrease performance by keeping available
processors and memory away from the PowerCenter server engine, thereby slowing
throughput and possibly causing a load window to be missed.

Best Practices
None

Sample Deliverables
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 315 of 1017


Last updated: 15-Feb-07 19:21

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 316 of 1017


Phase 5: Build
Subtask 5.4.2 Develop
Error Handling Strategy

Description

After the high-level load process is outlined and source files and tables are identified, a
decision needs to be made regarding how the load process will account for data errors.

The identification of a data error within a load process is driven by the standards of
acceptable data quality. The identification of a process error is driven by the stability
of the process itself. It is unreasonable to expect any source system to contain perfect
data. It is also unreasonable to expect any automated load process to execute correctly
100 percent of the time. Errors can be triggered by any number of events or scenarios,
including session failure, platform constraints, bad data, time constraints, mismatched
control totals, dependencies, or server availability.

The challenge in implementing an error handling strategy is to design mappings and


load routines robust enough to handle any or all possible scenarios or events that may
trigger an error during the course of the load process. The degree of complexity of the
error handling strategy varies from project to project, depending on such variables as
source data, target system, business requirements, load volumes, load windows,
platform stability, end-user environments, and reporting tools.

The error handling development effort should include all the work that needs to be
performed to correct errors in a reliable, timely, and automated manner.

Several types of tasks within the Workflow Manager are designed to assist in error
handling. The following is a subset of these tasks:

● Command Task allows the user to specify one or more shell commands to
run during the workflow.
● Control Task allows the user to stop, abort, or fail the top-level workflow or
the parent workflow based on an input-link condition.
● Decision Task allows the user to enter a condition that determines the
execution of the workflow. This task determines how the PowerCenter
Integration Service executes a workflow.
● Event Task specifies the sequence of task execution in a workflow. The event
is triggered based on the completion of the sequence of tasks.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 317 of 1017


● Timer Task allows the user to specify the period of time to wait before the
Integration Service executes the next task in the workflow. The user can
choose to either set a specific time and date to start the next task or wait a
period of time after the start time of another task.
● Email Task allows the user to configure email to be set to an administrator or
business owner in the event that an error is encountered by a workflow task.

The Data Integration Developer is responsible for determining:

● What data gets rejected,


● Why the data is rejected,
● When the rejected rows are discovered and processed,
● How the mappings handle rejected data, and
● Where the rejected data is written.

Data integration developers should find an acceptable balance between the end users'
needs for accurate and complete information and the cost of additional time and
resources required to repair errors. The Data Integration Developer should consult
closely with the Data Quality Developer in making these determinations, and include in
the discussion the outputs from tasks 2.8 Perform Data Quality Audit and 5.3 Design
and Build Data Quality Process.

Prerequisites
None

Roles

Data Integration Developer (Primary)

Database Administrator (DBA) (Secondary)

Quality Assurance Manager (Approve)

Technical Project Manager (Review Only)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 318 of 1017


Data Integration Developers should address the errors that commonly occur during the
load process in order to develop an effective error handling strategy. These errors
include:

● Session Failure. If a PowerCenter session fails during the load process, the
failure of the session itself needs to be recognized as an error in the load
process. The error handling strategy commonly includes a mechanism for
notifying the process owner that the session failed, whether it is in the form of
a message to a pager from operations or a post-session email from a
PowerCenter Integration Service. There are several approaches to handling
session failures within the Workflow Manager. These include custom-written
recovery routines with pre- and post- session scripts, workflow variables such
as the pre-defined task-specific variables or user-defined variables, and event
tasks (e.g., the event-raise task and the event-wait task) can be used to start
specific tasks in reaction to a failed task.

● Data Rejected by Platform Constraints. A load process may reject certain


data if the data itself does not comply with database and data type constraints.
For instance:

❍ The database server will reject a row if the primary key field(s) of that row
already exists in the target.
❍ A PowerCenter Integration Service will reject a row if a date/time field is
sent to a character field without implicitly converting the data.

In both of these scenarios, the data will be rejected regardless of whether or not
it was accounted for in the code. Although the data is rejected without
developer intervention, accounting for it remains a challenge. In the first
scenario, the data will end up in a reject file on the PowerCenter server. In the
second scenario, the row of data is simply skipped by the Data Transformation
Manager (DTM) and is not written to the target or to any reject file. Both
scenarios require post-load reconciliation of the rejected data. An error handling
strategy should account for data that is rejected in this manner; either by
parsing reject files or balancing control totals.

● "Bad" Data. Bad data can be defined as data that enters the load process
from one or more source systems, but is prevented from entering the target
systems, which are typically staging areas, end-user environments, or
reporting environments. This data can be rejected by the load process itself or
designated as "bad" by the mapping logic created by developers.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 319 of 1017


Some of the reasons that bad data may be encountered between the time it is
extracted from the source systems and the time it is loaded to the target include:

❍ The data is simply incorrect.


❍ The data violates business rules.
❍ The data fails on foreign key validation.
❍ The data is converted improperly in a transformation.

The strategy that is implemented to handle these types of errors determines


what data is available to the business as well as the accuracy of that data. This
strategy can be developed with PowerCenter mappings, which flag records
within the data flow for success or failure, based on the data itself and the logic
applied to that data. The records flagged for success are written to the target
while the records flagged for failure are written to a reject file or table for
reconciliation.

● Data Rejected by Time Constraints Load windows are typically pre-defined


before data is moved to the target system. A load window is the time that is
allocated for a load process to complete (i.e., start to finish) based on data
volumes, business hours, and user requirements. If a load process does not
complete within the load window, notification and data that has not been
committed to the target system must be incorporated in the error handling
strategy. Notification can take place via operations, email, or page. Data that
has not been loaded within the window can be written to staging areas or
processed in recovery mode at a later time.
● Irreconcilable Control Totals. One way to ensure that all data is being
loaded properly is to compare control totals captured on each session. Control
totals can be defined as detailed information about the data that is being
loaded in a session. For example, how many records entered the job stream?
How many records were written to target X? How many records were written to
target Y? A post-session script can be launched to reconcile the total records
read into the job stream with the total numbers written to the target(s). If the
number in does not match the number out, there may have been an error
somewhere in the load process.

To a degree, the PowerCenter session logs and repository tables store this type
of information. Depending on the level of detail desired to capture control totals,
some organizations run post-session reports against the repository tables and
parse the log files. Others, wishing to capture more in-depth information about
their loads, incorporate control totals in their mapping logic, spinning off check
sums, row counts, and other calculations during the load process. These totals

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 320 of 1017


are then compared to figures generated by the source systems, triggering
notification when numbers do not match up.

The pmcmd command, gettaskdetails, provides information to assist in


analyzing the loads. Issuing this command for a session task returns various
data regarding a workflow, including the mapping name, session log file name,
first error code and message, number of successful and failed rows from the
source and target, and the number of transformation errors.

● Job Dependencies. Sessions and workflows can be configured to run based


on dependencies. For example, the start of a session can be dependent on the
availability of a source file. Or a batch may have sessions embedded that are
dependent on each other's completion. If a session or batch fails at any point
in the load process because of a dependency violation, the error handling
strategy should catch the problem.

The use of events allows the sequence of execution within a


workflow to be specified; an event raised on completion of one set of
tasks, triggering the initiation of another. There are two event tasks
that can be included in a workflow: event-raise and event-wait
tasks. An event-wait task instructs the Integration Service to wait for
a specific event to be raised before continuing with the workflow,
while an event-raise task triggers an event at a particular point in a
workflow. Events themselves can either be defined by the user or
pre-defined (i.e., a file watch event) by PowerCenter.

● Server Availability. If a node is unavailable at runtime, any sessions and


workflows scheduled on it will not be run if it is the only resource configured
within a domain. Similarly, if a PowerCenter Integration Service goes down
during a load process, the sessions and workflows currently running on it will
fail if it is the only service configured in the domain. Problems such as
this are usually directly related to the stability of the server platform; network
interrupts do happen, database servers do occasionally go down, and log/file
space can inadvertently fill up. A thorough error handling strategy should
assess and account for the probability of services not being available 100
percent of the time. This strategy may vary considerably depending on the
PowerCenter configuration employed. For example, PowerCenter's High
Availability options can be harnessed to eliminate many single points of failure
within a domain and can help to ensure minimal service interruption.

Ensure Data Accuracy and Integrity

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 321 of 1017


In addition to anticipating the common load problems, developers need to investigate
potential data problems and the integrity of source data. One of the main goals of the
load process is to ensure the accuracy of the data that is committed to the target
systems. Because end users typically build reports from target systems and managers
make decisions based on their content, the data in these systems must be sufficiently
accurate to provide users with a level of confidence that the information they are
viewing is correct.

The accuracy of the data, before any logic is applied to it, is dependent on the source
systems from which it is extracted. It is important, therefore, for developers to identify
the source systems and thoroughly examine the data in them. task 2.8 Perform Data
Quality Audit is specifically designed to establish such knowledge about project data
quality, and task 5.3 Design and Build Data Quality Process is designed specifically to
eliminate data quality problems as far as possible before data enters the Build Phase
of the project. In the absence of dedicated data quality steps such as these, one
approach is to estimate, along with source owners and data stewards, how much of the
data is still bad (vs. good) on a column-by-column basis, and then to determine which
data can be fixed in either the source or the mappings, and which does not need to be
fixed before it enters the target. However, the former approach is preferable as it (1)
provides metrics to business and project personnel and (2) provides an effective means
of addressing data quality problems.

Data Integrity deals with the internal relationships of the data in the system and how
those relationships are maintained (i.e., data in one table must match corresponding
data in another table). When relationships cannot be maintained because of incorrect
information entered from the source systems, the load process needs to determine if
processing can continue or if the data should be rejected.

Including lookups in a mapping is a good way of checking for data integrity. Lookup
tables are used to match and validate data based upon key fields. The error handling
process should account for the data that does not pass validation. Ideally, data integrity
issues will not arise since the data has already been processed in the steps described
in task 4.6.

Determine Responsibility For Data Integrity/Business Data Errors

Since it is unrealistic to expect any source system to contain data that is 100 percent
accurate, it is essential to assign the responsibilities of correcting data errors. Taking
ownership of these responsibilities throughout the project is vital to correcting errors
during the load process. Specifically, individuals should be held accountable for:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 322 of 1017


● Providing business information
● Understanding the data layout
● Data stewardship (understanding the meaning and content of data elements)
● Delivering accurate data

Part of the load process validates that the data conforms to known rules from the
business. When these rules are not met by the source system data, the process should
handle these exceptions in an appropriate manner. End users should either accept the
consequences of permitting invalid data to enter the target system or they should
choose to reject the invalid data. Both options involve complex issues for the business
organization.

The individuals responsible for providing business information to the developers must
be knowledgeable and experienced in both the internal operations of the organization
and the common practices of the relevant industry. It is important to understand the
data and functionality of the source systems as well as the goals of the target
environment. If developers are not familiar with the business practices of the
organization, it is practically impossible to make valid judgments about which data
should be allowed in the target system and which data should be flagged for error
handling.

The primary purpose for developing an error handling strategy is to prevent data that
inaccurately portrays the state of the business from entering the target system.
Providers of business information play a key role in distinguishing good data from bad.

The individuals responsible for maintaining the physical data structures play an equally
crucial role in designing the error handling strategy. These individuals should be
thoroughly familiar with the format, layout, and structure of the data. After
understanding the business requirements, developers must gather data content
information from the individuals that have first-hand knowledge of how the data is laid
out in the source systems and how it is to be presented in the target systems. This
knowledge helps to determine which data should be allowed in the target system based
on the physical nature of the data as opposed to the business purpose of the data.

Data stewards, or their equivalent, are responsible for the integrity of the data in and
around the load process. They are also responsible for maintaining translation tables,
codes, and consistent descriptions across source systems. Their presence is not
always required, depending on the scope of the project, but if a data steward is
designated, he or she will be relied upon to provide developers with insight into such
things as valid values, standard codes, and accurate descriptions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 323 of 1017


This type of information, along with robust business knowledge and a degree of
familiarity with the data architecture, will give the Build team the necessary level of
confidence to implement an error handling strategy that can ensure the delivery of
accurate data to the target system. Data stewards are also responsible for correcting
the errors that occur during the load process and in their field of expertise. If, for
example, a new code is introduced from the source system that has no equivalent in a
translation table, it should be flagged and presented to the data steward for review. The
data steward can determine if the code should be in the translation table, and if it
should have been flagged for error. The goal is to have the developers design the error
handling process according to the information provided by the experts. The error
handling process should recognize the errors and report them to the owners with the
relevant expertise to fix them.

For Data Migration projects, it is important to develop a standard method to track data
exceptions. Normally this tracking data is stored in a relational database with a
corresponding set of exception reports. By developing this important standardized
strategy, all data cleansing and data correction development will be expedited due to
having a predefined method of determining what exceptions have been raised and
which data caused the exception.

Best Practices

Disaster Recovery Planning with PowerCenter HA Option

Sample Deliverables
None

Last updated: 04-Dec-07 18:18

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 324 of 1017


Phase 5: Build
Subtask 5.4.3 Plan
Restartability Process

Description

The process of updating a data warehouse with new data is sometimes described as
"conducting a fire drill". This is because it often involves performing data updates within
a tight timeframe, taking all or part of the data warehouse off-line while new data is
loaded. While the update process is usually very predictable, it is possible for
disruptions to occur, stopping the data load in mid-stream.

To minimize the amount of time required for data updates and further ensure the quality
of data loaded into the warehouse, the development team must anticipate and plan for
potential disruptions to the loading process. The team must design the data integration
platform so that the processes for loading data into the warehouse can be restarted
efficiently in the event that they are stopped or disrupted.

Prerequisites
None

Roles

Data Integration Developer (Primary)

Database Administrator (DBA) (Secondary)

Quality Assurance Manager (Approve)

Technical Project Manager (Review Only)

Considerations

Providing backup schemas for sources and staging areas for targets is one step toward
improving the efficiency with which a stopped or failed data loading process can be
restarted. Source data should not be changed prior to restarting a failed process, as
this may cause the PowerCenter server to return missing or repeat values. A backup

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 325 of 1017


source schema allows the warehouse team to store a snapshot of source data, so that
the failed process can be restarted using its original source. Similarly, providing a
staging area for target data gives the team the flexibility of truncating target tables prior
to restarting a failed process, if necessary.

If flat file sources are being used, all sources should be date-stamped and stored until
the loading processes using those sources that have successfully completed. A script
can be incorporated into the data update process to delete or move flat file sources
only upon successful completion of the update.

A second step in planning for efficient restartability is to configure PowerCenter


sessions so that they can be easily recovered. Sessions in workflows manage the
process of loading data into a data warehouse.

TIP
You can configure the links between sessions to only trigger downstream
sessions upon success status.

Also, PowerCenter versions 6 and above have the ability to configure a Workflow
to Suspend on Error. This places the workflow in a state of suspension, so that
the environmental problem can be assessed and fixed, while the workflow can be
resumed from the point of suspension.

Follow these steps to identify and create points of recovery within a workflow:

1. Identify the major recovery points in the workflow. For example,


suppose a workflow has tasks A,B,C,D,E that run in sequence. In this
workflow, if a failure occurs at task A, you can restart the task and the
workflow will automatically recover. Since session A is able to recover by
merely restarting it, it is not a major recovery point. On the other hand, if
task B fails, it will impact data integrity or subsequent runs. This means
that task B is a major recovery point. All tasks that may impact data
integrity or subsequent runs should be recovery points.
2. Identify the strategy for recovery
❍ Build restorability in mapping. If data extraction from source is

datetime-driven, create a delete path within the mapping and run


the workflow in suspend mode. When configuring sessions, if
multiple sessions are to be run, arrange the sessions in a
sequential manner within a workflow. This is particularly important
if mappings in later sessions are dependent on data created by
mappings in earlier sessions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 326 of 1017


❍ Include transaction controls in mappings. Create a copy of the
workflow and create session-level override and start-from date
where recovery is required. One option is to delete records from
the target and restart the process. Also, in some cases a special
mapping that has a filter on the source may be required. This filter
should be based on the recovery date or other relevant criteria.
3. Use the high availability feature in PowerCenter.

Other Ways to Design Restartability

PowerCenter Workflow Manager provides the ability to use post-session emails or to


create email tasks to send notification to designated recipients informing them about a
session run. Configure sessions so that an email is sent to the Workflow Operator
when a session or workflow fails. This allows the operator to respond to the failed
session as soon as possible.

On the session property screen, configure the session to stop if errors occur in pre-
session scripts. If the session stops, review and revise scripts as necessary.

Determine whether or not a session really needs to be run in bulk mode. Successful
recovery on a bulk-load session is not guaranteed, as bulk loading bypasses the
database log. While running a session in bulk load can increase session performance,
it may be easier to recover a large, normal loading session, rather than truncating
targets and re-running a bulk-loaded session.

If a session stops because it has reached a designated number of non-fatal errors


(such as Reader, Writer, or DTM errors), consider increasing the possible number of
non-fatal errors allowed, or de-selecting the "Stop On" option in the session property
screen.

Always be sure to examine log files when a session stops, and research and resolve
potential reasons for the stop.

Data Migration Projects often have a need to migrate significant volumes of data. Due
to this fact, re-start processing should be considered in the Architect Phase and
throughout the Design Phase and Build Phase. In many cases a full refresh is the
best course of action. However, if large amounts of data need to be loaded, then the
final load processes should include re-start processing design which should be
prototyped during the Architect Phase. This will limit the amount of time lost if any

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 327 of 1017


large-volume load fails.

Best Practices

Disaster Recovery Planning with PowerCenter HA Option

Sample Deliverables
None

Last updated: 04-Dec-07 18:16

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 328 of 1017


Phase 5: Build
Subtask 5.4.4 Develop
Inventory of Mappings &
Reusable Objects

Description

The next step in designing the data integration processes is breaking the development
work into an inventory of components. These components then become the work tasks
that are divided among developers and subsequently unit tested. Each of these
components would help further refine the project plan by adding the next layer of detail
for the tasks related to the development of the solution.

Prerequisites
None

Roles

Data Integration Developer (Primary)

Considerations

The smallest divisions of assignable work in PowerCenter are typically mappings and
reusable objects. The Inventory of Reusable Objects and Inventory of Mappings
created during this subtask are valuable high-level lists of development objects that
need to be created for the project.

Naturally, the lists will not be completely accurate at this point; they will be added to
and subtracted from over the course of the project and should be continually updated
as the project moves forward. Despite the ongoing changes, however, these lists are
valuable tools, particularly from the perspective of the lead developer and project
manager, because the objects on these lists can be assigned to individual developers
and their progress tracked over the course of the project.

A common mistake is to assume a source to target mapping document equates to a


single mapping. This is often not the case. To load any one target table – it might
easily take more than one mapping to perform all of the needed tasks to correctly
populate the table.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 329 of 1017


Assume the case of loading a Data Warehouse Dimension table for which you have
one source to target matrix document. You might then generate a :

Source to Staging Area Mapping (Incremental)


Data Cleansing and Rationalizaiton Mapping


Staging to Warehouse Update/Insert Mapping


Primary Key extract (Full extract of Primary Keys used in the delete
mapping)

Logical Delete Mapping (Mark dimension records as deleted if they no


longer appear in source)

It is important to break down the work into this level of detail because from the list
above, you can see how a single source to target matrix may generate 5 separate
mappings that could each be developed by different developers. From a project
planning perspective, it is then useful to track each of these 5 mappings separately for
status and completion.

Also included in your mapping inventory are the special purpose mappings that are
involved in the end to end process but not specifically defined by the business
requirements and source to target matrixes. These would include audit mappings,
aggregate mappings, mapping generation mappings, templates and other objects that
will need to be developed during the build phase.

For reusable objects, it is important to keep a holistic view of the project in mind when
determining which objects are reusable and which ones are custom built. Sometimes
an object that would seem sharable across any mapping making use of it, may need
different versions depending on purpose.

Having a list of the common objects that are being developed across the project allows
individual developers to better plan their mapping level development efforts. By
knowing that a particular mapping is going to utilize 4 reusable objects - they can focus
on the unique work to that particular mapping and not duplicate the same functionality
of the 4 reusable objects. This is another area where Metadata Manager can become
very useful for developers who want to do where used analysis for objects. As a result
of the processes and tools implemented during the project, developers can achieve

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 330 of 1017


communication and coordination to improve productivity.

Best Practices

Working with Pre-Built Plans in Data Cleanse and Match

Sample Deliverables

Mapping Inventory

Last updated: 01-Feb-07 18:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 331 of 1017


Phase 5: Build
Subtask 5.4.5 Design
Individual Mappings &
Reusable Objects

Description

After the Inventory of Mappings and Inventory of Reusable Objects is created, the next
step is to provide detailed design for each object on each list. The detailed design
should incorporate sufficient detail to enable developers to complete the task of
developing and unit testing the reusable objects and mappings. These details include
specific physical information, down to the table, field, and datatype level, as well as
error processing and any other information requirements identified.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Integration Developer (Primary)

Considerations

A detailed design must be completed for each of the items identified in the Inventory of
Mapping and Inventory of Reusable Objects. Developers use the documents created in
subtask 5.4.4 Develop Inventory of Mappings & Reusable Objects to construct the
mappings and reusable objects, as well as any other required processes.

Reusable Objects

Three key items should be documented for the design of reusable objects: inputs,
outputs, and the transformations or expressions in between.

Developers who have a clear understanding of what reusable objects are available are
likely to create better mappings that are easy to maintain. For the project, consider

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 332 of 1017


creating a shared folder for common objects like sources, targets, and
transformations. When you want to use these objects, you can create shortcuts that
point to the object. Document the process and the available objects in the shared
folder. In a multi-developer environment, assign a developer the task of keeping the
objects organized in the folder, and updating sources and targets when appropriate.

It is crucial to document reusable objects, particularly in a multi-developer


environment. For example, if one developer creates a mapplet that calculates tax rate,
the other developers must understand the mapplet in order to use it properly. Without
documentation, developers have to browse through the mapplet objects to try to
determine what the mapplet is doing. This is time consuming and often overlooks vital
components of the mapplet. Documenting reusable objects provides a comprehensive
overview of the workings of relevant objects and helps developers determine if an
object is applicable in a specific situation.

Mappings

Before designing a mapping, it is important to have a clear picture of the end-to-end


processes that the data will flow through. Then, design a high-level view of the mapping
and document a picture of the process within the mapping, using a textual description
to explain exactly what the mapping is supposed to accomplish and the methods or
steps it follows to accomplish its goal.

After the high-level flow has been established, it is important to document pre-mapping
logic. Special joins for the source, filters, or conditional logic should be made clear
upfront. The data being extracted from the source system dictates how the developer
implements the mapping. Next, document the details at the field level, listing each of
the target fields and the source field(s) that are used to create the target field.
Document any expression that may take place in order to generate the target field (e.g.,
a sum of a field, a multiplication of two fields, a comparison of two fields, etc.).
Whatever the rules, be sure to document them and remember to keep it at a physical
level. The designer may have to do some investigation at this point for business rules
as well. For example, the business rules may say, "For active customers, calculate a
late fee rate". The designer of the mapping must determine that, on a physical level,
that translates to 'for customers with an ACTIVE_FLAG of "1", multiply the
DAYS_LATE field by the LATE_DAY_RATE field'.

Document any other information about the mapping that is likely to be helpful in
developing the mapping. Helpful information may, for example, include source and
target database connection information, lookups and how to match data in the lookup
tables, data cleansing needed at a field level, potential data issues at a field level, any
known issues with particular fields, pre or post mapping processing requirements, and
any information about specific error handling for the mapping.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 333 of 1017


The completed mapping design should then be reviewed with one or more team
members for completeness and adherence to the business requirements. The design
document should be updated if the business rules change or if more information is
gathered during the build process.

The mapping and reusable object detailed designs are a crucial input for building the
data integration processes, and can also be useful for system and unit testing. The
specific details used to build an object are useful for developing the expected results to
be used in system testing.

For Data Migrations, often the mappings are very similar for some of the stages; such
as populating the reference data structures, acquiring data from the source, loading the
target and auditing the loading process. In these cases, it is likely that a detailed
‘template’ is documented for these mapping types. For mapping specific alterations
such as converting data from source to target format, individual mapping designs
may be created. This strategy reduces the sheer documentation required for the
project, while still providing sufficient detail to develop the solution.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:26

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 334 of 1017


Phase 5: Build
Subtask 5.4.6 Build
Mappings & Reusable
Objects

Description

With the analysis and design steps complete, the next priority is to put everything
together and build the data integration processes, including the mappings and reusable
objects.

Reusable objects can be very useful in the mapping building process. By this point,
most reusable objects should have been identified, although the need for additional
objects may become apparent during the development work. Commonly-used objects
should be put into a shared folder to allow for code reuse via shortcuts. The mapping
building process also requires adherence to naming standards, which should be
defined prior to beginning this step. Developing, and consistently using, naming
standards helps to ensure clarity and readability for the original developer and
reviewers, as well as for the maintenance team that inherits the mappings after
development is complete.

In addition to building the mappings, this subtask involves updating the design
documents to reflect any changes or additions found necessary to the original design.
Accurate, thorough documentation helps to ensure good knowledge transfer and is
critical to project success.

Once the mapping is completed, a session must be made for the mapping in Workflow
Manager. A unit testing session can be created initially to test that the mapping logic is
executing as designed. To identify and troubleshoot problems in more detail, the debug
feature may be leveraged; this feature is useful for looking at the data as it flows
through each transformation. Once the initial session testing proves satisfactory, then
pre- and post-session processes and session parameters should be incorporated and
tested (if needed), so that the session and all of its processes are ready for unit testing.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 335 of 1017


Data Integration Developer (Primary)

Database Administrator (DBA) (Secondary)

Considerations

Although documentation for building the mapping already exists in the design
document, it is extremely important to document the sources, targets, and
transformations in the mapping at this point to help end users understand the flow of
the mapping and ensure effective knowledge transfer.

Importing the sources and targets is the first step in building a mapping. Although the
targets and sources are determined during the Design Phase the keys, fields, and
definitions should be verified in this subtask to ensure that they correspond with the
design documents.

TIP
When data modeling or database design tools (e.g., CA ERwin, Oracle
Designer/2000, or Sybase PowerDesigner) are used in the design phase,
Informatica PowerPlugs can be helpful for extracting the data structure
definitions of source and target sources. Metadata Exchange for Data Models
extract table, column, index and relationship definitions, as well as descriptions
from a data model. This can save significant time because the PowerPlugs also
import documentation and help users to understand the source and target
structures in the mapping.

For more information about Metadata Exchange for Data Models. PowerPlugs,
refer to Informatica's web site (www.informatica.com) or the Metadata Exchange
for Data Models' manuals.

The design documents may specify that data can be obtained from numerous sources,
including DB/2, Informix, SQL Server, Oracle, Sybase, ASCII/EBCDIC flat files
(including OCCURS and REDEFINES), Enterprise Resource Planning (ERP)
applications, and mainframes via PowerExchange data access products. The design
documents may also define the use of target schema and specify numerous ways of
creating the target schema. Specifically, target schema may be created:

● From scratch.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 336 of 1017


● From a default schema that is then modified as desired.
● With the help of the Cubes and Dimensions wizard (for multidimensional data
models).
● By reverse-engineering the target from the database.
● With Metadata Exchange for Data Models.
TIP
When creating sources and targets in PowerCenter Designer, be sure to include
a description of the source/target in the object's comment section, and follow the
appropriate naming standards identified in the design documentation (for
additional information on source and target objects, refer to the PowerCenter
User Guide).

Reusable objects are useful when standardized logic is going to be used in multiple
mappings. A single reusable object is referred to as a mapplet. Mapplets represent a
set of transformations and are constructed in the Mapplet Designer, much like creating
a "normal" mapping. When mapplets are used in a mapping, they encapsulate logic
into a single transformation object, making the flow of a mapping easier to understand.
However, because the mapplets hide their underlying logic, it is particularly important to
carefully document their purpose and function.

Other types of reusable objects, such as reusable transformations, can also be very
useful in mapping. When reusable transformations are used with mapplets, they
facilitate the overall mapping maintenance. Reusable transformations can be built in
either of two ways:

● If the design specifies that a transformation should be reusable, it can be


created in the Transformation Developer, which automatically creates reusable
transformations.
● If shared logic is not identified until it is needed in more than one mapping,
transformations created in the Mapping Designer can be designated as
reusable in the Edit Transformation dialog box. Informatica recommends using
this method with care however, because after a transformation is changed to
reusable, the change cannot be undone. Changes to a reusable
transformation are reflected immediately in all mappings that employ the
transformation.

When all the transformations are complete, everything must be linked together (as
specified in the design documentation) and arrangements made to begin unit testing.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 337 of 1017


Best Practices

Data Connectivity using PowerCenter Connect for Web Services

Development FAQs

Using Parameters, Variables and Parameter Files

Using Shortcut Keys in PowerCenter Designer

Sample Deliverables
None

Last updated: 17-Oct-07 17:23

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 338 of 1017


Phase 5: Build
Subtask 5.4.7 Perform Unit
Test

Description

The success of the solution rests largely on the integrity of the data available for
analysis. If the data proves to be flawed, the solution initiative is in danger of failure.
Complete and thorough unit testing is, therefore, essential to the success of this type of
project. Within the presentation layer, there is always a risk of performing less than
adequate unit testing. This is due primarily to the iterative nature of development and
the ease with which a prototype can be deployed. Experienced developers are,
however, quick to point out that data integration solutions and the presentation layers
should be subject to more rigorous testing than transactional systems. To underscore
this point, consider which poses a greater threat to an organization: sending a supplier
an erroneous purchase order or providing a corporate vice president with flawed
information about that supplier's ranking relative to other strategic suppliers?

Prerequisites
None

Roles

Business Analyst (Review Only)

Data Integration Developer (Primary)

Considerations

Successful unit testing examines any inconsistencies in the transformation logic and
ensures correct implementation of the error handling strategy.

The first step in unit testing is to build a test plan (see Unit Test Plan). The test plan
should briefly discuss the coding inherent in each transformation of a mapping and
elaborate on the tests that are to be conducted. These tests should be based upon the
business rules defined in the design specifications rather than on the specific code
being tested. If unit tests are based only upon the code logic, they run the risk of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 339 of 1017


missing inconsitencies between the actual code and the business rules defined during
the Design Phase.

If the transformation types include data quality transformations (that is, transformations
designed on the Data Quality Integration transformation that links to Informatica Data
Quality (IDQ) software) then the data quality processes (or plans) defined in IDQ are
also candidates for unit testing. Good practice holds that all data quality plans that are
going to be used on project data — whether as part of a PowerCenter transfomation or
a discrete process — should be tested before formal use on such data. Consider
establishing a discrete unit test stage for data quality plans.

Test data should be available from the initial loads of the system. Depending on
volumes, a sample of the initial load may be appropriate for development and unit
testing purposes. It is important to use actual data in testing since test data does not
necessarily cover all of the anomalies that are possible with true data, and creating test
data can be very time consuming. However, depending upon the quality of the actual
data used, it may be necessary to create test data in order to test any exception, error,
and/or value threshold logic that may not be triggered by actual data.

While it is possible to analyze test data without tools, there are many good tools
available for creating and manipulating test data. Some are useful in editing data in a
flat file, and most all offer some improvements in productivity.

A detailed test script is essential for unit testing; the test scripts indicate the
transformation logic being tested by each test record and should contain an expected
result for each record.

TIP
Session log tracing can be set in a mapping's transformation level, in a
session's "Mapping" tab, or in a session's "Config Object" tab. For testing, it is
generally good practice to override logging in a session's "Mapping" tab
transformation properties. For instance, if you are testing the logic performed in
a Lookup transformation, create a test session and only activate verbose data
logging on the appropriate Lookup. This focuses the log file on the unit test at
hand.

If you change the tracing level in the mapping itself, you will have to go back
and modify the mapping after the testing has been completed. If you override
tracing in a session's "Config Object" tab properties, this will affect all
transformation objects in the mapping and potentially create a significantly
larger session log to parse.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 340 of 1017


It is also advisable to activate the test load option in the PowerCenter session
properties and indicate the number of test records that are to be sourced. This
ensures that the session does not write data to the target tables. After running
a test session, analyze and document the actual results compared to the
expected results outlined in the test script.

Running the mapping in the Debugger also allows you to view the target data
without the session writing data to the target tables. You can then document
the actual results as compared to the expected results outlined in the test
script. The ability to change the data running through the mapping while in
debug mode is an extremely valuable tool because it allows you to test all
conditions and logic as you step through the mapping, thereby ensuring
appropriate results.

The first session should load test data into empty targets. After checking for errors from
the initial load, a second run of test data should occur if the business requirements
demand periodic updates to the target database.

A thorough unit test should uncover any transformation flaws and document the
adjustments needed to meet the data integration solution's business requirements.

Best Practices
None

Sample Deliverables

Defect Log

Defect Report

Test Condition Results

Last updated: 01-Feb-07 18:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 341 of 1017


Phase 5: Build
Subtask 5.4.8 Conduct
Peer Reviews

Description

Peer review is a powerful technique for uncovering and resolving issues that otherwise
would be discovered much later in the development process (i.e., during testing) when
the cost of fixing is likely to be much higher. The main types of object that can be
subject to formal peer review are: documents, code, and configurations.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Quality Assurance Manager (Primary)

Considerations

The peer review process encompasses several steps, which vary depending on the
object (i.e., document, code, etc.) being reviewed. In general, the process should
include these steps:

When an author confirms that an object has reached a suitable stage


for review, he or she communicates this to the Quality Assurance
Manager, who then schedules a review meeting.

The number of reviewers at the meeting depends on the type of


review, but should be limited to the minimally acceptable number.
For example, the review meeting for a design document may include
the business analyst who specified the requirements, a design
authority, and one or two technical experts in the particular design
aspects.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 342 of 1017


It is a good practice to select reviewers with a direct interest in the
deliverable. For example, the DBA should be involved in reviewing
the logical data model to ensure that he/she has sufficient
information to conduct the physical design.
● If possible, appropriate documents, code, and review checklist should be

distributed prior to the review meeting to allow preparation.


● The author or Quality Assurance Manager should lead the meeting to ensure
that it is structured and stays on point. The meeting should not be allowed to
become bogged down in resolving defects, but should reach consensus on
rating the object using a High/Medium/Low scale.
● During the meeting, reviewers should look at the object point-by-point and note
any defects found in the Defect Log. Trivial items such as spelling or
formatting errors should not be recorded in the log (to avoid ‘clutter’).
● If the number and impact of defects is small, the Quality Assurance Manager
may decide to conduct an informal mini-review after the defects are corrected
to ensure that all problems have been appropriately rectified.
● If the initial review meeting identifies a significant amount of required rework,
the Quality Assurance Manager should schedule another review meeting with
the same review team to ensure that all defects are corrected.

There are two main factors to consider when rating the ‘impact’ of defects discovered
during peer review, the effect on functionality and the saving in rework time. If a defect
would result in a significant functional deficiency, or large amount of rework later in the
project, it should be rated as 'high impact'.

Metrics can be used to help in tracking the value of the review meetings. The ‘cost’ of
formal peer reviews is the man-time spent on meeting preparation, the review meeting
itself, and the subsequent re-work. This can be recorded in man-days.

The ‘benefit’ of such reviews is the potential time saved. Although this can be estimated
when the defect is originally noted, such estimates are unlikely to be reliable. It may be
better to assign a notional ‘benefit’ – say two hours for a low-impact defect, one day for
a medium-impact defect and two days for a high-impact defect. Adding up the benefit in
man-days allows a direct comparison with ‘cost’. If no net benefit is obtained from the
peer reviews, the Quality Assurance Manager should investigate a less intensive
review regime, which can be implemented across the project or, more likely, in specific
areas of the project.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 343 of 1017


Sample Deliverables
None

Last updated: 01-Feb-07 18:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 344 of 1017


Phase 5: Build
Task 5.5 Populate and
Validate Database

Description

This task bridges the gap between unit testing and system testing. After unit testing is
complete, the sessions for each mapping must be ordered so as to properly execute
the complete data migration from source to target. Creating workflows containing
sessions and other tasks with the proper execution order does this.

By incorporating link conditions and/or decision tasks into workflows, the execution
order of each session or task is very flexible. Additionally, event raises and event waits
can be incorporated to further develop dependencies. The tasks within the workflows
should be organized so as to achieve an optimum load in terms of data quality and
efficiency.

When this task is completed, the development team should have a completely
organized loading model that it can use to perform a system test. The objective here is
to eliminate any possible errors in the system test that relate directly to the load
process. The final product of this task - the completed workflow(s) - is not static,
however. Since the volume of data used in production may differ significantly from the
volume used for testing, it may be necessary to move sessions and workflows around
to improve performance.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Integration Developer (Primary)

Technical Project Manager (Review Only)

Test Manager (Approve)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 345 of 1017


Considerations

At a minimum, this task requires a single instance of the target database(s). Also, while
data may not be required for initial testing, the structure of the tables must be identical
to those in the operational database(s). Additionally, consider putting all mappings to
be tested in a single folder. This will allow them to be executed in the same workflows
and reordered to assess optimum performance.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 346 of 1017


Phase 5: Build
Subtask 5.5.1 Build Load
Process

Description

Proper organization of the load process is essential for achieving two primary load
goals:

● Maintaining dependencies among sessions and workflows and


● Minimizing the load window

Maintaining dependencies between sessions, worklets and workflows is critical for


correct data loading; lack of dependency control results in incorrect or missing data.
Minimizing the load window is not always as important, however this is dependent
primarily on load volumes, hardware, and available load time.

Prerequisites
None

Roles

Business Analyst (Review Only)

Data Integration Developer (Primary)

Technical Project Manager (Review Only)

Considerations

The load development process involves the following five steps:

1. Clearly define and document all dependencies


2. Analyze and document the load volume
3. Analyze the processing resources available
4. Develop operational requirements such as notifications, external processes and

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 347 of 1017


timing
5. Develop tasks, worklets and workflows based on the results

If the volume of data is sufficiently low for the available hardware to handle, you may
consider volume analysis optional, developing the load process solely on the
dependency analysis. Also, if the hardware is not adequate to run the sessions
concurrently, you will need to prioritize them. The highest priority within a group is
usually assigned to sessions with the most child dependencies.

Another possible component to add into the load process is sending e-mail. Three e-
mail options are available for notification during the load process:

● Post-session e-mails can be sent after a session completes successfully or


when it fails
● E-mail tasks can be placed in workflows before or after an event or series of
events
● E-mails can be sent when workflows are suspended

When the integrated load process is complete, it should be subject to unit test. This is
true even if all of the individual components have already been subjected to unit test.
The larger volumes associated with an actual operational run would be likely to hamper
validation of the overall process. With unit test data, the staff members who perform
unit testing should be able to easily identify major errors when the system is placed in
operation.

Analyzing Load Volumes

The Load Dependency Analysis should list all sessions, in order of their dependency,
together with any other events (Informatica or other), on which the sessions depend.
The analysis must clearly document the dependency relationships between each
session and/or event, the algorithm or logic needed to test the dependency conditions
during execution, and the impact of any possible dependency test results (e.g., do not
run a session, fail a session, fail a parent or worklet, etc.).

The load dependency documentation would for example, follow the following format:

The first set of sessions or events listed in the analysis (Group A),
would be those with no dependencies.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 348 of 1017


The second set listed (Group B), would be those with a dependency
on one or more sessions or events in the first set (Group A). Against
each session in this list, the following information would be included:

Dependency relationships (e.g. Succeed, Fail, Completed by


(time), etc.)

Action (e.g. do not run, fail parent)


Notification (e.g. e-mail)


The third set (Group C), would be those with a dependency on one or
more sessions or events in the second set (Group B). Against each
session in this list, similar dependency information as above would be
included.

The listing would be continued in the document, until all sessions are
included.

The Load Volume Analysis should list all the sources , source row counts and row
widths, expected for each session. This should include the sources for all lookup
transformations, in addition to the extract sources, as the amount of data that is read to
initialize a lookup cache can materially affect the initialization and total execution time
of a session. The Load Volume Analysis should also list sessions in descending order
of processing time, estimated based these factors (i.e., the number of rows extracted,
number of rows loaded, number and volume of lookups in the mappings).

For Data Migration projects, the final load processes are the set of load scripts,
scheduling objects, or master workflows that will be executed for the data migration. It
is important that developers develop with a load plan in mind so that these load
procedures can be developed quickly, as they are often developed late in the project
development cycle when time is of short supply.

It is recommended to keep all load scripts/schedules/master workflows to a minimum


as the execution of each will become a given line item on the migration punchlist.

Best Practices
None

Sample Deliverables

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 349 of 1017


None

Last updated: 15-Feb-07 19:28

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 350 of 1017


Phase 5: Build
Subtask 5.5.2 Perform
Integrated ETL Testing

Description

The task of integration testing is to check that components in a software system or, one
step up, software applications at the company level , interact without error. There are a
number of strategies that can be employed for integration testing, two examples are as
follows:

● Integration testing based on business processes. In this strategy, tests


examine all the system components affected by a particular business process.
For instance, the one set of tests might cover the processing of a customer
order, from acquisition and registration, through to delivery and payment.
Additional business processes are incorporated into the tests, until all system
components or applications have been sufficiently tested.
● Integration testing based on test objectives. For example, a test objective
might be the integration of system components that use a common interface.
In this strategy tests would be defined based on the interface.
These two strategies illustrate that the ETL process is merely part of the equation
rather than the focus of it. It is still important to take note of the ETL load so as to
ensure that such aspects as performance and data quality are not adversely affected.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Integration Developer (Primary)

Technical Project Manager (Review Only)

Test Manager (Approve)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 351 of 1017


Considerations

Although this is a minor test from an ETL perspective, it is crucial for the ultimate goal
of a successful process implementation. Primary proofing of the testing method
involves matching the number of rows loaded to each individual table.

It is a good practice to keep the Load Dependency Analysis and Load Volume Analysis
in mind during this testing, particularly if the process identifies a problem in the load
order. Any deviations from those analyses are likely to cause errors in the loaded data.

The final product of this subtask, the Final Load Process document, is the layout of
workflows, worklets, and session tasks that will achieve an optimal load process. The
Final Load Process document orders workflows, worklets, and session tasks in such a
way as to maintain the required dependencies while minimizing the overall load
window. This document will differ from that generated in the previous subtask, 5.5.1
Build Load Process to represent the current actual result. However, this layout is still
dynamic and may change as a result of ongoing performance testing.

Tip
The Integration Test Percentage (ITP) is a useful tool that indicates the percentage
of the project's source code that has been unit and integration tested. The formula for
ITP is:

ITP = 100% * Transformation Objects Unit Tested/Total Objects

As an example, this table shows the number of transformation objects for mappings.

Mapping Trans. Objects


M_ABC 15
M_DEF 3
M_GHI 24
M_JKL 7

If mapping M_ABC is the only one unit tested, the ITP is:

ITP = 100% * 15 / 49 = 30.61%

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 352 of 1017


If mapping M_DEF is the only one unit tested, the ITP is:

ITP = 100% * 3 / 49 = 6.12%

If mappings M_GHI and M_JKL are unit tested, the ITP is:

ITP = 100% * (24 + 7) / 49 = 100% * 31 / 49 = 75.00%

And if all modules are unit tested, the ITP is:

ITP = 100% * 49 / 49 = 100%

The ITP metric provides a precise measurement as to how much unit and integration
testing has been done. On actual projects, the definition of a unit can vary. A unit
may be defined as an individual function, a group of functions, or an entire Computer
Software Unit (which can be several thousand lines of code). The ITP metric is not
based on the definition of a unit. Instead, the ITP metric is based on the actual
number of transformation objects tested with respect to the total number of
transformation objects defined in the project.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 353 of 1017


Phase 5: Build
Task 5.6 Build Presentation
Layer

Description

The objective of this task is to develop the end-user analysis, using the results from 4.4
Design Presentation Layer. The result of this task should be a final presentation layer
application that satisfies the needs of the organization. While this task may run in
parallel with the building of the data integration processes, data is needed to validate
the results of any presentation layer queries. This task cannot therefore, be completed
before tasks 5.4 Design and Develop Data Integration Processes and 5.5 Populate and
Validate Database. The Build Presentation Layer task consists of two subtasks which
may need to be performed iteratively several times:

1. Developing the end-user presentation layer

2. Presenting the end-user the presentation layer to business analysts to elicit and
incorporate their feedback.

Throughout the Build Phase, the developers should refer to the deliverables produced
during the Design Phase. These deliverables include a working prototype, end user
feedback, metadata design framework and, most importantly, the Presentation
Layer Design document, which is the final result of the Design Phase and incorporates
all efforts completed during that phase. This document provides the necessary
specifications for building the front-end application for the user community.

This task incorporates both development and unit testing. Test data will be available
from the initial loads of the target system. Depending on volumes, a sample of the initial
load may be appropriate for development and unit testing purposes. This sample data
set can be used to assist in building the presentation layer and validating reporting
results , without the added effort of fabricating test data.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 354 of 1017


Business Analyst (Primary)

Presentation Layer Developer (Primary)

Project Sponsor (Approve)

Technical Project Manager (Review Only)

Considerations

The development of the presentation layer includes developing interfaces and


predefined reports to provide end users with access to the data. It is important that data
be available to validate the accuracy of the development effort.

Having end users available to review the work-in-progress is an advantage, enabling


developers to incorporate changes or additions early in the review cycle.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 355 of 1017


Phase 5: Build
Subtask 5.6.1 Develop
Presentation Layer

Description

By the time you get to this subtask, all the design work should be complete, making this
subtask relatively simple. Now is the time to put everything together and build the
actual objects such as reports, alerts and indicators.

During the build, it is important to follow any naming standards that may have been
defined during the design stage, in addition to the standards set on layouts, formats,
etc. Also, keep detailed documentation of these objects during the build activity. This
will ensure proper knowledge transfer and ease of maintenance in addition to improving
the readability for everyone.

After an object is built, thorough testing should be performed to ensure that the data
presented by the object is accurate and the object is meeting the performance that is
expected.

The principles for this subtask also apply to metadata solutions providing metadata to
end users.

Prerequisites
None

Roles

Presentation Layer Developer (Primary)

Considerations

During the Build task, it is good practice to verify and review all the design options and
to be sure to have a clear picture of what the goal is. Keep in mind that you have to
create a report no matter what the final form of the information delivery is. In other
words, the indicators and alerts are derived off a report and hence your first task is to
create a report. The following considerations should be taken into account while

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 356 of 1017


building any piece of information delivery:

Step 1: What measurements do I want to display?

The measurements, which are called metrics in the BI terminology, are perhaps the
most important part of the report. Begin the build task by selecting your metrics, unless
you are creating an Attribute-only Report. Add all the metrics that you want to see on
the report and arrange them in the required order. You can add a prompt to the report if
you want to make it more generic over, for example, time periods or product categories.
Optionally, you can choose a Time Key that you want to use as well for each metric.

Step 2: What parameters should I include?

The metrics are always measured against a set of predefined parameters. Select these
parameters, which are called Attributes in the BI terminology, and add them to the
Report (unless you are creating a Metric-only Report). You can add Prompts and Time
Keys for the attributes too, just like the metrics.

Tip
Create a query for metrics and attributes. This will help in searching for the specific
metrics or attributes much faster than manually searching in a pool of hundreds of
metrics and attributes.

Time setting preferences can vastly differ from one user’s requirement to that of
another. One group of users may be interested just in the current data while another
group may want to compare the trends and patterns over a period of time. It is
important to thoroughly analyze the end user’s requirements and expectations prior to
adding the Time Settings to reports.

Step 3: What are my data limiting criteria for this report?

Now that you have selected all the data elements that you need in the report, it is time
to make sure that you are delivering only the relevant data set to the end users. Make
sure to use the right Filters and Ranking criteria to accomplish this in the report.
Consider using Filtersets instead of just Filters so that important criteria limiting the
data sets can be standardized over a project or department, for example.

Step 4: How should I format the report?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 357 of 1017


Presenting the information to the end user in an appealing format is as important as
presenting the right data. A good portion of the formatting should be decided during the
Design phase. However, you can consider the following points while formatting the
reports:

Table report type: The data in the report can be arranged in one of the following three
table types: tabular, cross tabular, or sectional. Select the one that suits the report the
best.

Data sort order: Arrange the data such that the pattern makes it easy to find any part
of the information that one is interested in.

Chart or graph: A picture is worth a thousand words. A chart or graph can be very
useful when you are trying to make a comparison between two or more time periods,
regions, or product categories, etc.

Step 5: How do I deliver the information?

Once the report is ready, you should think about how the report should be delivered. In
doing so, be sure to address the following points:

Where should the report reside? – Select a folder that is most suited for the data that
the report contains. If the report is shared by more than one group of users, you may
want to save it in a shared folder.

Who should get the report, when, and how should they get it ? – Make sure that
proper security options are implemented for each report. There may be sensitive and
confidential data that you want to ensure is not accessible by unauthorized users.

When should the report be refreshed? - You can chose to run the report on-demand
or schedule it to be automatically refreshed at regular intervals. Ad-hoc reports that are
of interest to a smaller set of individuals are usually run on-demand. However, the bulk
of the reports that are viewed regularly by different business users need to be
scheduled to refresh periodically. The refresh interval should typically consider the
period for which the business users are likely to consider the data ‘current’ as well as
the frequency of data change in the data warehouse.

Occasionally, there will be a requirement to see the data in the report as soon as the
data changes in the data warehouse (and data in the warehouse may change very
frequently). You can handle situations like this by having the report refresh at ‘real-time’.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 358 of 1017


Special requirements– You should consider any special requirements a report may
have at this time, such as whether the report needs to be broadcast to users, whether
there is a need to export the data in the report to an external format etc. Based on
these requirements, you can make minor changes in the report as necessary.

Packing More Power into the Information

Adding certain features to the report can make it more useful for everybody. Consider
the following for each report that you build:

Title of the report: The title of the report should reflect what the report contents are
meant to convey. Rarely, it may become a tough task to name a report very accurately
if the same report is viewed in two different perspectives by two different sets of users.
You may consider making a copy of the report and naming the two instances to suit
each set of users.

Analytic workflows: Analytic workflows make the information analysis process as a


whole more robust. Add the report to one or more analytic workflows so that the user
can get additional questions answered in the context of a particular report’s data.

Drill paths: Check to make sure that the required drill paths are set up. If you don’t find
a drill path that you think is useful for this report, you may have to contact the
Administrator and have it set up for you.

Highlighters: It may also be a good idea to use highlighters to make critical pieces of
information more conspicuous in the report.

Comments and description: Comments and Descriptions make the reports more
easily readable as well as helping when searching for a report.

Keywords: It is not uncommon to have numerous reports pertaining to the same


business area residing in the same location. Including keywords in the report setup, will
assist users in searching for a report more easily.

Indicator Considerations

After the base report is complete, you can build indicators on top of that report. First,
you will need to determine and select the type of indicator that best suits the primary
purpose. You can use chart, table or gauge indicators. Remember that there are
several types of chart indicators as well as several different gauge indicators to choose

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 359 of 1017


from. To help decide what types of indicators to use, consider the following:

Do you want to display information on one specific metric?

Gauge indicators allow you to monitor a single metric and display whether or not that
indicator is within an acceptable range. For example, you can create a gauge indicator
to monitor the revenue metric value for each division of your company. When you
create a gauge indicator, you have to determine and specify three ranges (low,
medium, and high) for the metric value. Additionally, you have to decide how the
gauge should be displayed: circular, flat, or digital.

Do you want to display information on multiple metrics?

If you want to display information for one or more attributes or multiple metrics, you can
create either chart or table indicators. If you chose chart indicators, you have more than
a dozen different types of charts to choose from (standard bar, stacked line, pie, etc).
However, if you’d like to see a subset of an actual report, including sum calculations, in
a table view, chose a table indicator.

Alert Considerations

Alerts are created when something important is occurring, such as falling revenue or
record breaking sales. When creating indicators, consider the following:

What are the important business occurrences?

These answers will come from discussions with your users. Once you find out what is
important to the users, you can define the Alert rules.

Who should receive the alert?

It is important that the alert is delivered to the appropriate audience. An alert may go on
a business unit’s dashboard or a personal dashboard.

How should the alert be delivered?

Once the appropriate Alert receiver is identified, you must determine the proper
delivery device. If the user doesn’t log into Power Analyzer on a daily basis, maybe an
email should be sent. If the alert is critical, a page could be sent. Furthermore, make
sure that the required delivery device (i.e., email, phone, fax, or pager) has been

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 360 of 1017


registered in the BI tool.

Testing and Performance

Thorough testing needs to be performed on the report/indicator/alert after it is built to


ensure that you are presenting accurate and desired information. Try to make sure that
the individual rows, as well as aggregate values, have accurate numbers and are
reported against the correct attributes.

Always keep performance of the reports in mind. If a report takes too long time to
generate data, then you need to identify what is causing the bottleneck and eliminate or
reduce the bottleneck. The following points are worth remembering:

● Complex queries, especially against dozens of tables, can make a well-


designed data warehouse look inefficient.
● Multi-pass SQL is supported by Data Analyzer

Indexing is important, even in simple star schemas.

Tip
You can view the query in your report if your report is taking a long time to get the
data from the source system. Copy the query and evaluate the query, by running
utilities such as Explain Plan on the query in Oracle, to make sure that it is optimized.

Best Practices
None

Sample Deliverables
None

Last updated: 18-Oct-07 15:05

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 361 of 1017


Phase 5: Build
Subtask 5.6.2 Demonstrate
Presentation Layer to
Business Analysts

Description

After the initial development effort, the development team should present the
presentation layer to the Business Analysts to elicit and incorporate their feedback.
When educating the end users about the front end tool whether a business intelligence
tool or an application, it is important to focus on the capabilities of the tool and the
differences between typical reporting environments and solution architectures. When
end users thoroughly understand the capabilities of the front end that they will use, they
can offer more relevant feedback.

Prerequisites
None

Roles

Business Analyst (Primary)

Presentation Layer Developer (Primary)

Project Sponsor (Approve)

Technical Project Manager (Review Only)

Considerations

Demonstrating the presentation layer to the business analysts should be an iterative


process that continues throughout the Build Phase. This approach helps the
developers to gather and incorporate valuable user feedback and enables the end
users to validate or clarify the interpretation of their requirements prior to the release of
the end product, thereby ensuring that the end result meets the business requirements.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 362 of 1017


The Project Manager must play an active role in the process of accepting and
prioritizing end user requests. While the initial release of the presentation layer should
satisfy user requirements, in an iterative approach, some of the additional requests
may be implemented in future releases to avoid delaying the initial release. The Project
Manager needs to work closely with the developers and analysts to prioritize the
requests based upon the availability of source data to support the end users' requests
and the level of effort necessary to incorporate the changes into the initial (or current)
release. In addition, the Project Manager must communicate regularly with the end
users to set realistic expectations and establish a process for evaluating and prioritizing
feedback. This type of communication helps to avoid end-user dissatisfaction,
particularly when some requests are not included in the initial release. The Project
Manager also needs to clearly communicate release schedules and future development
plans, including specifics about the availability of new features or capabilities, to the
end-user community.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 363 of 1017


Phase 6: Test

6 Test

● 6.1 Define Overall Test


Strategy
❍ 6.1.1 Define Test Data Strategy
❍ 6.1.2 Define Unit Test Plan
❍ 6.1.3 Define System Test Plan
❍ 6.1.4 Define User Acceptance Test Plan
❍ 6.1.5 Define Test Scenarios
❍ 6.1.6 Build/Maintain Test Source Data Set
● 6.2 Prepare for Testing Process
❍ 6.2.1 Prepare Environments
❍ 6.2.2 Prepare Defect Management Processes
● 6.3 Execute System Test
❍ 6.3.1 Prepare for System Test
❍ 6.3.2 Execute Complete System Test
❍ 6.3.3 Perform Data Validation
❍ 6.3.4 Conduct Disaster Recovery Testing
❍ 6.3.5 Conduct Volume Testing
● 6.4 Conduct User Acceptance Testing
● 6.5 Tune System Performance
❍ 6.5.1 Benchmark
❍ 6.5.2 Identify Areas for Improvement
❍ 6.5.3 Tune Data Integration Performance
❍ 6.5.4 Tune Reporting Performance

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 364 of 1017


Phase 6: Test

Description

The diligence with which you


pursue the Test Phase of your project will inevitably determine its acceptance by its
end-users, and therefore, its success against its business objectives. During the Test
Phase you must essentially validate that your system accomplishes everything that the
project objectives and requirements specified and that all the resulting data and reports
are accurate. Test is also a critical preparation against any eventuality that could
impact your project; whether that be radical changes to data volumes, disasters that
disrupt service for the system in some way, or spikes in concurrent usage.

The Test phase includes the full design of your testing plans and infrastructure as well
as two categories of comprehensive system-wide verification procedures; the System
Test and the User Acceptance Test (UAT). The System Test is conducted after all
elements of the system have been integrated into the test environment. It includes a
number of detailed technically-oriented verifications that are managed as processes by
the technical team with primarily technical criteria for acceptance. UAT is a detailed
user-oriented set of verifications with User Acceptance as the objective. It is typically
managed by end-users with participation from the technical team. Any test cannot be
considered complete until there is verification that it has accomplished the agreed-upon
Acceptance Criteria. Because of the natural tension that exists between completion of
the preset project timeline and completion of Acceptance Criteria (which may take
longer than expected) the Test Phase schedule is often owned by a QA Manager
or Project Sponsor rather than the Project Manager.

Velocity includes as a final step in the Test Phase activities related to tuning system
performance. Satisfactory performance and system responsiveness can be a critical
element of user acceptance.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Integration Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 365 of 1017


Data Warehouse Administrator (Primary)

Database Administrator (DBA) (Primary)

End User (Primary)

Network Administrator (Primary)

Presentation Layer Developer (Primary)

Project Sponsor (Review Only)

Quality Assurance Manager (Primary)

Repository Administrator (Primary)

System Administrator (Primary)

System Operator (Primary)

Technical Project Manager (Secondary)

Test Manager (Primary)

User Acceptance Test Lead (Primary)

Considerations

To ensure the Test Phase is successful it must be preceded by diligent planning and
preparation. Early on, project leadership and project sponsors should establish test
strategies and begin building plans for System Test and UAT. Velocity recommends
that this planning process begins, at the latest, during the Design Phase, and that it
includes descriptions of timelines, participation, test tools, guidelines and scenarios, as
well as detailed Acceptance Criteria.

The Test Phase includes the development of test plans and procedures. It is intended

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 366 of 1017


to overlap with the Build Phase which includes the individual design reviews and unit
test procedures. It is difficult to determine your final testing strategy until detailed
design and build decisions have been made in the Build Phase. Thus it is expected
that from a planning perspective, some tasks and subtasks in the Test Phase will
overlap with those in the Build Phase and possibly the Design Phase.

The Test Phase includes other important activities in addition to testing. Any defects or
deficiencies discovered must be categorized (severity, criticality, priority) recorded, and
weighed against the Acceptance Criteria (AC). The technical team should repair them
within the guidelines of the AC, and the results must be retested with the inclusion of
satisfactory regression testing. This process has the prerequisite for the development
of some type of Defect Tracking System; Velocity recommends that this be
developed during the Build Phase.

Although formal user acceptance signals the completion of the Test Phase, some of its
activities will be revisited, perhaps many times, throughout the operation of the
system. Performance tuning is recommended as a recurrent process. As data volume
grows and the profile of the data changes, performance and responsiveness may
degrade. You may want to plan for regular periods of benchmarking and tuning, rather
than waiting to be reactive to end-user complaints. By it's nature software
development is not always perfect, so some repair and retest should be expected. The
Defect Tracking System must be maintained to record defects and enhancements for
as long as the system is supported and used. Test scenarios, regression test
procedures, and other testing aids must also be retained for this purpose.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 367 of 1017


Phase 6: Test
Task 6.1 Define Overall
Test Strategy

Description

The purpose of testing is to verify that the software has been developed according to
the requirements and design specifications. Although the major testing actually occurs
at the end of the Build Phase , determining the amount and types of testing to be
performed should occur early in the development lifecycle. This enables project
management to allocate adequate time and resources to this activity. This also enables
the project to build the appropriate testing infrastructure prior to the beginning of the
testing phase. Thus, while all of the testing related activities have been consolodated
in the Testing phase, the beginning of these activities often begins as early as the
Design Phase. The detailed object level testing plans are continually updated and
modified as the development process continues since any change to development work
is likely to create a new scenario to test.

Planning should include the following components :

● resource requirements and schedule


● construction and maintenance of the test data
● preparation of test materials
● preparation of test environments
● preparation of the methods and control procedures for each of the major tests

Typically, there are three levels of testing:

Testing Level Description Performed By

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 368 of 1017


Unit Testing of each individual function. For Developer
example, with data integration this includes
testing individual mappings, UNIX scripts,
stored procedures, or other external
programs. Ideally, the developer tests all
error conditions and logic branches within
the code.

System or Testing performed to review the system as a System Test Team


Integration whole as well as its points of integration.
Testing may include, but is not limited to,
data integrity, reliability, and performance.

User Acceptance As most data integration solutions do not User Acceptance


directly touch end users, User Acceptance Testing Team
Testing should focus on the front-end
applications and reports, rather than the
load processes themselves.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Integration Developer (Primary)

End User (Primary)

Presentation Layer Developer (Primary)

Quality Assurance Manager (Approve)

Technical Project Manager (Approve)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 369 of 1017


None

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 370 of 1017


Phase 6: Test
Subtask 6.1.1 Define Test
Data Strategy

Description

Ideally, actual data from the production environment will be available for testing so that
tests can cover the full range of possible values and states in the data. However, the
full set of production data is often not available.

Additionally, there is sometimes a risk of sensitive information migrating from


production to less-controlled environments (i.e., test); in some circumstances, this may
even be illegal. There is also the chicken-and-egg problem of requiring the load of
production source data, in order to test the load of production source data. Therefore, it
is important to understand that with any set of data used for testing, there is no
guarantee that all possible exception cases and value ranges will occur in the sub-set
of the data used.

If generated data is used, the main challenge is to ensure that it accurately reflects the
production environment. Theoretically, generated data can be made to be
representative and engineered to test all of the project functionality. While the actual
record counts in generated tables are likely to differ from production environments, the
ratios between tables should be maintained; for example, if there is a one-to-ten ratio
between products and customers in the live environment, care should be taken to
retain this same ratio in the test environment.

The deliverable from this subtask is a description and schedule for how test data will be
derived, stored, and migrated to testing environments. Adequate test data can be
important for proper unit testing and is critical for satisfactory system and user
acceptance tests.

Prerequisites
None

Roles

Business Analyst (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 371 of 1017


Data Integration Developer (Secondary)

End User (Primary)

Presentation Layer Developer (Primary)

Quality Assurance Manager (Approve)

Technical Project Manager (Approve)

Test Manager (Primary)

Considerations

In stable environments, there is less of a premium on flexible maintenance of test data


structures; the overhead of developing software to load test data may not be justified.
In dynamic environments (i.e., where source and/or target data structures are not
finalized), the availability of a data movement tool such as PowerCenter greatly
expands the range of options for test data storage and movement.

Usually, data for testing purposes is stored in the same structure as the source in the
data flow. However, it is also possible to store test data in a format that is geared
toward ease of maintenance and to use PowerCenter to transfer the data to the source
system format. So if the source is a database with a constantly changing structure, it
may be easier to store test data in XML or CSV formats where it can easily be
maintained with a text editor. The PowerCenter mappings that load the test data from
this source can make use of techniques to insulate (to some degree) the logic from
schema changes by including pass-through transformations after source qualifiers and
before targets.

For Data Migration, the test data strategy should be focused on how much source data
to use rather than how to manufacture test data. It is strongly recommended that the
data used for testing is real production data but most likely of less volume then the
production system. By using real production data, the final testing will be more
meaningful and increase the level of confidence from the business community thus
making ‘go/no-go’ decisions easier.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 372 of 1017


Sample Deliverables

Critical Test Parameters

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 373 of 1017


Phase 6: Test
Subtask 6.1.2 Define Unit
Test Plan

Description

Any distinct unit of development must be adequately tested by the developer before it is
designated ready for system test and for integration with the rest of the project
elements. This includes any element of the project that can, in any way, be tested on its
own. Rather than conducting unit testing in a haphazard fashion with no means of
certifying satisfactory completion, all unit testing should be measured against a
specified unit test plan and its completion criteria.

Unit test plans are based on the individual business and functional requirements and
detailed design for mappings, reports, or components for the mapping or report. The
unit test plans should include specification of inputs, tests to verify, and expected
outputs and results. The unit test is the best opportunity to discover any
misinterpretation of the design as well as errors of development logic. The creation of
the unit test plan should be a collaborative effort by the designer and the developer,
and must be validated by the designer as meeting the business and functional
requirements and design criteria. The designer should begin with a test scenario or test
data descriptions and include checklists for the required functionality; the developer
may add technical tests and make sure all logic paths are covered.

The unit test plan consists of:

● Identification section: unit name, version number, date of build or change,


developer, and other identification information.
● References to all applicable requirements and design documents.
● References to all applicable data quality processes (e.g., data analysis,
cleansing, standardization, enrichment).
● Specification of test environment (e.g., system requirements, database/
schema to be used).
● Short description of test scenarios and/or types of test runs.
● Per test run:

❍ Purpose (what features/functionality are being verified).


❍ Prerequisites.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 374 of 1017


❍ Definition of test inputs.
❍ References to test data or load-files to be used.
❍ Test script (step-by-step guide to executing the test).
❍ Specification (checklist) of the expected outputs, messages, error
handling results, data output, etc.

● Comments and findings.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Integration Developer (Primary)

Presentation Layer Developer (Primary)

Quality Assurance Manager (Review Only)

Considerations

Reference to design documents should contain the name and location of any related
requirements documents, high-level and detailed design, mock-ups, workflows, and
other applicable documents.

Specification of the test environment should include such details as which reference or
conversion tables must be used to translate the source data for the appropriate target
(e.g., for conversion of postal codes, for key translation, other code translations). It
should also include specification of any infrastructure elements or tools to be used in
conjunction with the tests.

The description of test runs should include the functional coverage, and any
dependencies between test runs.

Prerequisites should include whatever is needed to create the correct environment for
the test to take place, any dependencies the test has on completion of other logic or

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 375 of 1017


test runs, availability of reference data, adequate space in database or file system, and
so forth.

The input files or tables must be specified with their locations. These data must be
maintained in a secure place to make repeatable tests possible.

Specifying the expected output is the main part of the test plan. It specifies in detail any
output records and fields, and any functional or operational results through each step of
the test run. The script should cover all of the potential logic paths and include all code
translations and other transformations that are part of the unit. Comparing the produced
output from the test run with this specification provides the verification that the build
satisfies the design.

The test script specifies all the steps needed to create the correct environment for the
test, to complete the actual test run itself, and the steps to analyze the results. Analysis
can be done by hand or by using compare scripts.

The Comments and Findings section is where all errors and unexpected results found
in the test run should be logged. In addition, errors in the test plan itself can be logged
here as well. It is up to the QA Management and/or QA Strategy to determine whether
to use a more advanced error tracking system for unit testing or to wait until system
test. Some sites demand a more advanced error logging system, (e.g., ClearCase)
where errors can be logged along with an indication of their severity and impact, as well
as information about who is assigned to resolve the problem.

One or more test runs can be specified in a single unit test plan. For example, one run
may be an initial load against an empty target, with subsequent runs covering
incremental loads against existing data or tests with empty input or with duplicate input
records or files and empty reports.

Test data must contain a mix of correct and incorrect data. Correct data can be
expected to result in the specified output; incorrect data may have results according to
the defined error-handling strategy such as creating error records or aborting the
process. Examples of incorrect data are:

● Value errors: value is not in acceptable domain or an empty value for


mandatory fields.
● Syntax errors: incorrect date format, incorrect postal code format, or non-
numeric data in numeric fields.
● Semantic errors: two values are correct, but can not exist in the same record.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 376 of 1017


Note that the error handling strategy should account for any Data Quality operations
built into the project. Note also that some PowerCenter transformations can make use
of data quality processes, or plans, developed in Informatica Data Quality (IDQ)
applications. Data quality plan instructions can be loaded into a Data Quality Integration
transformation (the transfomation is added to PowerCenter via a plug-in).

Data quality plans should be tested using IDQ applications before they are added to
PowerCenter transformations. The results of these tests will feed as prerequisites into
the main unit test plan. The tests for data quality processes should follow the same
guidelines as outlined in this document. A PowerCenter mapping should be validated
once the Data Quality Integration transformation has been added to it and configured
with a data quality plan.

Every difference between the output expectation and the test output itself should be
logged in the Comments and Findings section, along with information about the
severity and impact on the test process. The unit test can proceed after analysis and
error correction.

The unit test is complete when all test runs are successfully completed and the findings
are resolved and retested. At that point, the unit can be handed over to the next test
phase.

Best Practices

Testing Data Quality Plans

Sample Deliverables

Test Case List

Unit Test Plan

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 377 of 1017


Phase 6: Test
Subtask 6.1.3 Define
System Test Plan

Description

System Test (sometimes known as Integration Test) is crucial for ensuring that the
system operates reliably as a fully integrated system and functions according to the
business requirements and technical design. Success rests largely on business users'
confidence in the integrity of the data. If the system has flaws that impede its functions,
the data may also be flawed or users may perceive it as flawed,which results in a loss
of confidence in the system. If the system does not provide adequate performance and
responsiveness, the users may abandon it (especially if it is a reporting system)
because it does not meet their perceived needs.

As with the other testing processes, it is very important to begin planning for System
Test early in the project to make sure that all necessary resources are scheduled and
prepared ahead of time.

Prerequisites
None

Roles

Quality Assurance Manager (Review Only)

Test Manager (Primary)

Considerations

Since the system test addresses multiple areas and test types, creation of the test plan
should involve several specialists. The System Test Manager is then responsible for
compiling their inputs into one consistent system test plan. All individuals participating
in executing the test plan must agree on the relevant performance indicators that are
required to determine if project goals and objectives are being met. The performance
indicators must be documented, reviewed, and signed-off on by all participating team
members.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 378 of 1017


Performance indicators are placed in the context of Test Cases, Test Levels, and Test
Types, so that the test team can easily measure and monitor their evaluation criteria.

Test Cases

The test case (i.e., unit of work to be tested) must be sufficiently specific to track and
improve data quality and performance.

Test Levels

Each test case is categorized as occurring on a specific level or levels. This helps to
clearly define the actual extent of testing expected within a given test case. Test levels
may include one or more of the following:

● System Level. Covers all "end to end" integration testing, and involves the
complete validation of total system functionality and reliability through all
system entry points and exit points. Typically, this test level is the highest, and
the last level of testing to be completed.
● Support System Level. Involves verifying the ability of existing support
systems and infrastructure to accommodate new systems or the proposed
expansion of existing systems. For example, this level of testing may
determine the effect of a potential increase in network traffic due to an
expanded system user base on overall business operations.
● Internal Interface Level. Covers all testing that involves internal system data
flow. For example, this level of testing may validate the ability of PowerCenter
to successfully connect to a particular data target and load data.
● External Interface Level. Covers all testing that involves external data
sources. For example, this level of testing may collect data from diverse
business systems into a data warehouse.
● Hardware Component Level. Covers all testing that involves verifying the
function and reliability of specific hardware components. For example, this
level of testing may validate a back-up power system by removing the primary
power source. This level of testing typically occurs during the development
cycle.
● Software Process Level. Covers all testing that involves verifying the function
and reliability of specific software applications. This level of testing typically
occurs during the development cycle.
● Data Unit Level. Covers all testing that involves verifying the function and
reliability of specific data items and structures. This typically occurs during the
development cycle in which data types and structures are defined and tested

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 379 of 1017


based on the application design constraints and requirements.

Test Types

The Data Integration Developer generates a list of the required test types based on the
desired level of testing. The defined test types determine what kind of tests must be
performed to satisfy a given test case. Test types that may be required include:

● Critical Technical Parameters (CTPs). A worksheet of specific CTPs is


established, based on the identified test types. Each CTP defines specific
functional units that are tested. This should include any specific data items,
component, or functional parts.
● Test Condition Requirements (TCRs). Test Condition Requirement scripts
are developed to satisfy all identified CTPs. These TCRs are assigned a
numeric designation and include the test objective, list of any prerequisites,
test steps, actual results, expected results, tester ID, the current date, and the
current iteration of the test. All TCRs are included with each Test Case
Description (TCD).
● Test Execution and Progression. A detailed description of general control
procedures for executing a test, such as special conditions and processes for
returning a TCR to a developer in the event that it fails. This description is
typically provided with each TCD.
● Test Schedule. A specific test schedule that is defined within each TCD,
based upon the project plan, and maintained using MS Project or a
comparable tool. The overall Test Schedule for the project is available in the
TCD Test Schedule Summary, which identifies the testing start and end dates
for each TCD.

As part of 6.2 Execute System Test other specific tests should be planned for :-

● 6.3.3 Perform Data Validation


● 6.3.4 Conduct Disaster Recovery Testing
● 6.3.5 Conduct Volume Testing

The system test plans should include:

● System name, version number, list of components


● Reference to design document(s) such as high-level designs, workflow
designs, database model and reference, hardware descriptions, etc.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 380 of 1017


● Specification of test environment
● Overview of the test runs (coverage, interdependencies)
● Per test run:

❍ Type and purpose of the test run (coverage, results, etc.)


❍ Prerequisites (e.g., accurate results from other test runs, availability of
reference data, space in database or file system, availability of monitoring
tools, etc.)
❍ Definition of test input
❍ References to test data or load-files to be used (note: data must be stored
in a secure place to permit repeatable tests)
❍ Specification of the expected output and system behaviour (including
record counts, error records expected, expected runtime, etc.)
❍ Specification of expected and maximum acceptable runtime
❍ Step-by-step guide to execute the test (including environment
preparation, results recording, and analysis steps, etc.)

● Defect tracking process and tools


● Description of structure for meetings to discuss progress, issues and defect
management during the test

The system test plan consists of one or more test runs, each of which must be
described in detail. The interaction between the test runs must also be specified. After
each run, the System Test Manager can decide, depending on the defect count and
severity, whether the system test can proceed with subsequent test runs or that errors
must be corrected and the previous run repeated.

Every difference between the expected output and the test output itself should
be recorded and entered into the defect tracking system with a description of the
severity and impact on the test process. These errors and the general progress of the
system test should be discussed in a weekly or bi-weekly progress meeting. At this
meeting, participants review the progress of the system test, any problems identified,
and assignments to resolve or avoid them. The meeting should be directed by the
System Test Manager and attended by the testers and other necessary specialists like
designers, developers, systems engineers and database administrators.

After assignment of the findings, the specialists can take the necessary actions to
resolve the problems. After the solution is approved and implemented, the system test
can proceed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 381 of 1017


When all tests are run successfully and all defects are resolved and retested, the
system test plan will have been completed.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:38

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 382 of 1017


Phase 6: Test
Subtask 6.1.4 Define User
Acceptance Test Plan

Description

User Acceptance Testing (often know as UAT) is essential for gaining approval,
acceptance and project sign off. It is the end user community that needs to carryout the
testing and identify relevant issues for fixing. Resources for the testing will include
physical environment setup as well as allocation of staff to testing from the user
community. As with system testing, planning for User Acceptance Testing should be
begun early in the project so as to ensure the necessary resources are scheduled and
ready. In addition, the user acceptance criteria will need to be distilled from the
requirements and existing gold standard reports. These criteria need to be documented
and agreed by all parties so as to avoid delays through scope creep.

Prerequisites
None

Roles

Business Analyst (Secondary)

End User (Primary)

Quality Assurance Manager (Approve)

Test Manager (Primary)

Considerations

The plan should be a construction of the acceptance criteria, with test scripts of actions
that users will need to carry out to achieve certain results. For example, instructions to
run particular workflows and run reports within which, the users can then examine the
data. The author of the plan needs to bear in mind that the testers from user community
may not be technically minded. Indeed, one possible benefit of having non technical
users involved, is that they will provide an insight into the time and effort required for

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 383 of 1017


adoption and training when the completed data integration project is deployed.

In addition to test scripts for execution additional criteria for acceptance need to be
defined:-

● Performance, required response time and usability


● Data quality tolerances
● Validation procedures for verifying data quality
● Tolerable bugs based on the defect management processes

In Data Migration projects, user acceptance testing is even more user-focused than
other data integration efforts. This testing usually takes two forms, traditional UAT and
‘day-in-the-life’. During these two phases, business users are working through the
system, executing their normal daily routine and driving out issues and
inconsistencies. It is very important that the data migration team works closely with the
business testers to both provide appropriate data for these tests and to
capture feedback to improve the data as soon as possible. This UAT activity is the best
way to find out if the data is correct and if the data migration was completed
successfully.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 384 of 1017


Phase 6: Test
Subtask 6.1.5 Define Test
Scenarios

Description

Test scenarios provide the context, the “story line”, for much of the test procedures,
whether Unit Test, System Test or UAT. How can you know that the software solution
you’re developing will work within its ultimate business usage? A scenario provides the
business case for testing specific functionality, enabling testers to pretend to carry-out
the related business activity and then measure the results against expectations. For
this reason, design of the scenarios is a critical activity and one that may involve
significant effort in order to provide coverage for all the functionality that needs testing.

The test scenario forms the basis for development of test scripts and checklists, the
source data definitions, and other details of specific test runs.

Prerequisites
None

Roles

Business Analyst (Secondary)

End User (Primary)

Quality Assurance Manager (Approve)

Test Manager (Primary)

Considerations

Test scenarios must be based on the functional and technical requirements by dividing
them into specific functions that can be treated in a single test process.

Test scenarios may include:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 385 of 1017


● The purpose/objective of the test (functionality being tested) described in end-
user terms.
● Description of business, functional, or technical context for the test.
● Description of the type of technologies, development objects, and/or data that
should be included.
● Any known dependencies on other elements of the existing or new systems.

Typical attributes of test scenarios:

● Should be designed to represent both typical and unusual situations.


● Should include use of valid data as well as invalid or missing data.
● Test engineers may define their own unit test cases.
● Business cases and test scenarios for System and Integration Tests are
developed by the test team with assistance of developers and end-users.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 386 of 1017


Phase 6: Test
Subtask 6.1.6 Build/
Maintain Test Source Data
Set

Description

This subtask deals with the procedures and considerations for actually creating,
storing, and maintaining the test data. The procedures for any given project are, of
course, specific to its requirements and environments, but are also opportunistic. For
some projects, there will exist a comprehensive set of data or at least a good start in
that direction, while for other projects, the test data may need to be created from
scratch.

In addition to test data that allows full functional testing (i.e., functional test data), there
is also a need for adequate data for volume tests (i.e., volume test data). The following
paragraphs discuss each of these data types.

Functional Test Data

Creating a source data set to test the functionality of the transformation software should
be the responsibility of a specialized team largely consisting of business-aware
application experts. Business application skills are necessary to ensure that the test
data not only reflects the eventual production environment but that it is also engineered
to trigger all the functionality specified for the application. Technical skills in whatever
storage format is selected are also required to facilitate data entry and/or movement.

Volume is not a requirement of the functional test data set; indeed, too much data is
undesirable since the time taken to load it needlessly delays the functional test.

In a data integration project, while functional test data for the application sources is
indispensable, the case for a predefined data set for the targets should also be
considered. If available, such a data set makes it possible to develop an automated test
procedure to compare the actual result set to a predicted result set (making the
necessary adjustments to generated data, such as surrogate keys, timestamps, etc.).
This has additional value in that the definition of a target data set in itself serves as a
sort of design audit.

Volume Test Data

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 387 of 1017


The main objective for the volume test data set is to ensure that the project satisfies
any Service Level Agreements that are in place and generally meets performance
expectations in the live environment.

Once again, PowerCenter can be used to generate volumes of data and to modify
sensitive live information in order to preserve confidentiality. There are a number of
techniques to generate multiple output rows from a single source row, such as:

● Cartesian join in source qualifier


● Normalizer transformation
● Union transformation
● Java transformation

If possible, the volume test data set should also be available to developers for unit
testing in order to identify problems as soon as possible.

Maintenance

In addition to the initial acquisition or generation of test data, you will need a protected
location for its storage and procedures for migrating it to test environments in such a
fashion that the original data set is preserved (for the next test sequence). In addition,
you are likely to need procedures that will enable you to rebuild or rework the test data,
as required.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Integration Developer (Primary)

Considerations

Creating the source and target data sets and conducting automated testing are non-
trivial, and are therefore, often dismissed as impractical. This is partly the result of a

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 388 of 1017


failure to appreciate the role that PowerCenter can play in the execution of the test
strategy.

At some point in the test process, it is going to be necessary to compile a schedule of


expected results from a given starting point. Using PowerCenter to make this
information available and to compare the actual results from the execution of the
workflows can greatly facilitate the process.

Data Migration projects should have little need for generating test data. It is strongly
recommended that all data migration integration and system tests use
actual production data. Therefore, effort spent generating test data on a data migration
project should be very limited.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:40

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 389 of 1017


Phase 6: Test
Task 6.2 Prepare for
Testing Process

Description

This is the first major task of the Test Phase – general preparations for System Test
and UAT. This includes preparing environments, ramping up defect management
procedures, and generally making sure the test plans and all their elements are
prepared and that all participants have been notified of the upcoming testing processes.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Database Administrator (DBA) (Primary)

Presentation Layer Developer (Secondary)

Quality Assurance Manager (Primary)

Repository Administrator (Primary)

System Administrator (Primary)

Test Manager (Primary)

Considerations

Prior to beginning this subtask, you will need to collect and review the documentation
generated by the previous tasks and subtasks, including the test strategy, system test
plan, and UAT plan. Verify that all required test data has been prepared and that the
defect tracking system is operational. Ensure that all unit test certification procedures

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 390 of 1017


are being followed.

Based on the system test plan and UAT plan:

● Collect all relevant requirements, functional and internal design specifications,


end-user documentation, and any other related documents. Develop the test
procedures and documents for testers to follow from these.
● Verify that all expected participants have been notified of the applicable test
schedule.
● Review the upcoming test processes with the Project Sponsor to ensure that
they are consistent with the organization's existing QA culture (i.e., in terms of
testing scope, approaches, and methods).
● Review the test environment requirements (e.g., hardware, software,
communications, etc.) to ensure that everything is in place and ready.
Review testware requirements (e.g., coverage analyzers, test tracking, problem/bug
tracking, etc.) to ensure that everything is ready for the upcoming tests.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 391 of 1017


Phase 6: Test
Subtask 6.2.1 Prepare
Environments

Description

It is important to prepare the test environments in advance of System Test with the
following objectives:

● To emulate, to the extent possible, the Production environment.


● To provide test environments that enable full integration of the system, and
isolation from development.
● To provide secure environments that support the test procedures and
appropriate access.
● To allow System Tests and UAT to proceed without delays and without system
disruptions.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Database Administrator (DBA) (Primary)

Presentation Layer Developer (Secondary)

Repository Administrator (Primary)

System Administrator (Primary)

Test Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 392 of 1017


Plans

A formal test plan needs to be prepared by the Project Manager in conjunction with the
Test Manager. This plan should cover responsibilities, tasks, time-scales, resources,
training, and success criteria. It is vital that all resources, including off-project support
staff, are made available for the entire testing period.

Test scripts need to be prepared, together with a definition of the data required to
execute the scripts. The Test Manager is responsible for preparing these items, but is
likely to delegate a large part of the work.

A formal definition of the required environment also needs to be prepared, including all
necessary hardware components (i.e., server and client), software components (i.e.,
operating system, database, data movement, testing tools, application tools, custom
application components etc., including versions), security and access rights, and
networking.

Establishing security and isolation is critical for preventing any unauthorized or


unplanned migration of development objects into the test environments. The test
environment administrator(s) must have specific verifications, procedures, and timing
for any migrations and sufficient controls to enforce them.

Review the test plans and scenarios to determine the technical requirements for the
test environments. Volume tests and disaster/recovery tests may require special
system preparations.

The System Test environment may evolve into the UAT environment, depending on
requirements and stability.

Processes

Where possible, all processes should be supported by the use of appropriate tools.
Some of the key terminology related to the preparation of the environments and the
associated processes include:

● Training testers – a series of briefings and/or training sessions should be


made available. This may be any combination of formal presentations, formal
training courses, computer based tutorials or self-study sessions.
● Recording test results – the results of each test must be recorded and cross-
referenced to the defect reporting process.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 393 of 1017


● Reporting and resolution of defects (see 5.1.3 Define Defect Tracking
Process) – a process for recording defects, prioritizing their resolution, and
tracking the resolution process.
● Overall test management – a process for tracking the effectiveness of UAT
and the likely effort and timescale remaining

Data

The data required for testing can be derived from the test cases defined in the scripts.
This should enable a full dataset to be defined, ensuring that all possible cases are
tested. 'Live data' is usually not sufficient because it does not cover all the cases the
system should handle, and may require some sampling to keep the data volumes at
realistic levels. It is, of course, possible to use modified live data, adding the additional
cases or modifying the live data to create the required cases.

The process of creating the test data needs to be defined. Some automated approach
to creating all or the majority of the data is best. There is often a need to process data
through a system where some form of OLTP is involved. In this case, it must be
possible to roll-back to a base-state of data to allow reapplication of the ‘transaction’
data – as would be achieved by restoring from back-up.

Where multiple data repositories are involved, it is important to define how these
datasets relate. It is also important that the data is consistent across all the repositories
and that it can be restored to a known state (or states) as and when required.

Environment

A properly set-up environment is critical to the success of UAT. This covers:

● Server(s) – must be available for the required duration and have sufficient disk
space and processing power for the anticipated workload.
● Client workstations – must be available and sufficiently powerful to run the
required client tools.
● Server and client software – all necessary software (OS, database, ETL, test
tools, data quality tools, connectivity etc.) should be installed at the version
used in development (normally) with databases created as required.
● Networking – all required LAN and WAN connectivity must be set up and
firewalls configured to allow appropriate access. Bandwidth must be available
for any particular large data transmissions.
● Databases – all necessary schemas must be created and populated with an

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 394 of 1017


appropriate backup/restore strategy in place, and access rights defined and
implemented.
● Application software – correct versions should be migrated from development.

For Data Migration, the system test environment should not be limited to the
Informatica environment, but should also include all source systems, target systems,
reference data and staging databases, and file systems. The system tests will be a
simulation of production systems so the entire process should execute like a production
environment.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 395 of 1017


Phase 6: Test
Subtask 6.2.2 Prepare
Defect Management
Processes

Description

The key measure of software quality is, of course, the number of defects (a defect is
anything that produces results other than the expected results based on the software
design specification). Therefore it is essential for software projects to have a systematic
approach to detecting and resolving defects early in the development life cycle.

Prerequisites
None

Roles

Quality Assurance Manager (Primary)

Test Manager (Primary)

Considerations

Personal and peer reviews are primary sources of early defect detection. Unit testing,
system testing and UAT are other key sources, however, in these later project stages,
defect detection is a much more resource-intensive activity. Worse yet, change
requests and trouble reports are evidence of defects that have made their way to the
end users.

There are two major components of successful defect management, defect prevention
and defect detection. A good defect management process should enable developers to
both lower the number of defects that are introduced, and remove defects early in the
life cycle prior to testing.

Defect management begins with the design of the initial QA strategy and a good,
detailed test strategy. They should clearly define methods for reviewing system
requirements and design and spell out guidelines for testing processes, tracking

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 396 of 1017


defects, and managing each type of test. In addition, many QA strategies include
specific checklists that act as gatekeepers to authorize satisfactory completion of tests,
especially during unit and system testing.

To support early defect resolution, you must have a defect tracking system that is
readily accessible to developers and includes the following:

Ability to identify and type the defect, with details of its behaviour

Means for recording the timing of the defect discovery, resolution,


and retest

Complete description of the resolution

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 397 of 1017


Phase 6: Test
Task 6.3 Execute System
Test

Description

System Test (sometimes known as Integration Test) is crucial for ensuring that the
system operates reliably and according to the business requirements and technical
design. Success rests largely on business users' confidence in the integrity of the data.
If the system has flaws that impede its function, the data may also be flawed, or users
may perceive it as flawed - which results in a loss of confidence in the system. If the
system does not provide adequate performance and responsiveness, the users may
abandon it (especially if it is a reporting system) because it does not meet their
perceived needs.

System testing follows unit testing, providing the first tests of the fully integrated
system, and offers an opportunity to clarify users performance expectations and
establish realistic goals that can be used to measure actual operation after the system
is placed in production. It also offers a good opportunity to refine the data volume
estimates that were originally generated in the Architect Phase. This is useful for
determining if existing or planned hardware will be sufficient to meet the demands on
the system.

This task incorporates five steps:

1.
6.3.1 Prepare for System Test , in which the test team determines how to test
the system from end-to-end to ensure a successful load as well as planning for
the environments, participants, tools and timelines for the test.
2.
6.3.2 Execute Complete System Test , in which the data integration team works
with the Database Administrator to run the system tests planned in the prior
subtask. It is crucial to also involve end-users in the planning and review of
system tests.
3.
6.3.3 Perform Data Validation , in which the QA Manager and QA team ensure
that the system is capable of delivering complete, valid data to the business
users.
4.
6.3.4 Conduct Disaster Recovery Testing , in which the system’s robustness

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 398 of 1017


and recovery in case of disasters such as network or server failure is tested.
5.
6.3.5 Conduct Volume Testing , in which the system’s capability to handle large
volumes is tested.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Integration Developer (Primary)

Database Administrator (DBA) (Primary)

End User (Primary)

Network Administrator (Secondary)

Presentation Layer Developer (Secondary)

Project Sponsor (Review Only)

Quality Assurance Manager (Review Only)

Repository Administrator (Secondary)

System Administrator (Primary)

Technical Project Manager (Review Only)

Test Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 399 of 1017


All involved individuals and departments should review and approve the test plans, test
procedures, and test results prior to beginning this subtask.

It is important to thoroughly document the system testing procedure, describing the


testing strategy, acceptance criteria, scripts, and results. This information can be
invaluable later on, when the system is in operation and may not be meeting
performance expectations or delivering the results that users want - or expect.

For Data Migration projects, system tests are important because these are essentially
‘dress-rehearsals’ for the final migration. These tests should be executed with
production-level controls and be tracked and improved upon from system test cycle to
system test cycle. In data migration projects these system tests are often referred to as
‘mock-runs’ or ‘trial cutovers’.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 400 of 1017


Phase 6: Test
Subtask 6.3.1 Prepare for
System Test

Description

System test preparation consists primarily of creating the environment(s) required for
testing the application and staging the system integration. System Test is the first
opportunity, following comprehensive unit testing, to fully integrate all the elements of
the system, and to test the system by emulating how it will be used in production. For
this reason, the environment should be as similar as possible to the production
environment in its hardware, software, communications, and any support tools.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Database Administrator (DBA) (Secondary)

System Administrator (Secondary)

Test Manager (Primary)

Considerations

The preparations for System Test often take much more effort than expected, so they
should be preceded by a detailed integration plan that describes how all of the system
elements will be physically integrated within the System Test environment. The
integration plan should be specific to your environment, but some of the general steps
are likely be the same. The following are some general steps that are common in most
integration plans.

● Migration of Informatica development folders to the system test


environment. These folders may also include shared folders and/or shortcut

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 401 of 1017


folders that may have been added or modified during the development
process. In versioned repositories, deployment groups may be used for this
purpose. Often, flat files or parameter files reside on the development
environment’s server and need to be copied to the appropriate directories on
the system test environment server.
● Data consistency in system test environment is crucial. In order to
emulate the production environment, the data being sourced and targeted
should be as close as possible to production data in terms of data quality and
size.
● The data model of the system test environment should be very similar to
the model that is going to be implemented in production. Columns,
constraints, or indices often change throughout development, so it is important
to system test the data model before going into production.
● Synchronization of incremental logic is key when doing system testing.
In order to emulate the production environment, the variables or parameters
used for incremental logic need to match the values in the system test
environment database(s). If the variables or parameters don’t match, they can
cause missing data or unusual amounts of data being sourced.

For Data Migration projects the system test should not just involve running Informatica
Workflows, it should also include data set-up, migrating code, executing data and
process validation and post-process auditing. The system test set-up should be part of
the system test, not a pre-system test step.

Best Practices
None

Sample Deliverables

System Test Plan

Last updated: 01-Feb-07 18:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 402 of 1017


Phase 6: Test
Subtask 6.3.2 Execute
Complete System Test

Description

System testing offers an opportunity to establish performance expectations and verify


that the system works as designed, as well as to refine the data volume estimates
generated in the Architect Phase .

This subtask involves a number of guidelines for running the complete system test and
resolving or escalating any issues that may arise during testing.

Prerequisites
None

Roles

Business Analyst (Secondary)

Data Integration Developer (Secondary)

Database Administrator (DBA) (Review Only)

Network Administrator (Review Only)

Presentation Layer Developer (Secondary)

Quality Assurance Manager (Review Only)

Repository Administrator (Review Only)

System Administrator (Review Only)

Technical Project Manager (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 403 of 1017


Test Manager (Primary)

Considerations

System Test Plan

A System test plan needs to include pre-requisites to enter into the system test phase,
criteria to successfully exit system test phase, and defect classifications. In addition, all
test conditions, expected results, and test data need to be available prior to system test.

Load Routines

Ensure that the system test plan includes all types of load that may be encountered
during the normal operation of the system. For example, a new data warehouse (or a
new instance of a data warehouse) may include a one-off initial load step. There may
also be weekly, monthly, or ad-hoc processes beyond the normal incremental load
routines.

System testing is a cyclical process. The project team should plan to execute multiple
iterations of the most common load routines within the timeframe allowed for system
testing. Applications should be run in the order specified in the test plan.

Scheduling

An understanding of dependent predecessors is crucial for the execution of end-to-end


testing, as is the schedule for the testing run. Scheduling, which is the responsibility of
the testing team, is generally facilitated through an application such as the
PowerCenter Workflow Manager module and/or a third-party scheduling tool. Use the
pmcmd command line syntax when running PowerCenter tasks and workflows with a
third-party scheduler. Third-party scheduling tools can create dependencies between
PowerCenter tasks and jobs that may not be possible to run on PowerCenter.

Also the tools in PowerCenter and/or a third-party scheduling tool can be used to detect
long running sessions/tasks and alert the system test team via email. This helps to
identify issues early and manage system test timeframe effectively.

System Test Results

The team executing the system test plan is responsible for tracking the expected and

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 404 of 1017


actual results of each session and task run. Commercial software tools are available for
logging test cases and storing test results.

The details of each PowerCenter session run can be found in the Workflow Monitor. To
see the results:

● Right-click the session in the Workflow Monitor and choose ‘Properties’.


● Click the Transformation Statistics tab in the Properties dialog box.

Session statistics are also available in the PowerCenter repository view


REP_SESS_LOG, or through Metadata Reporter.

Resolution of Coding Defects

The testing team must document the specific statistical results of each run and
communicate those results back to the project development team. If the results do not
meet the criteria listed in the test case, or if any process fails during testing, the test
team should immediately generate a change request. The change request is assigned
to the developer(s) responsible for completing system modifications. In the case of a
PowerCenter session failure, the test team should seek the advice of the appropriate
developer and business analyst before continuing with any other dependent tests.

Ideally, all defects will be captured, fixed, and successfully retested within the system
testing timeframe. In reality, this is unlikely to happen. If outstanding defects are still
apparent at the end of the system testing period, the project team needs to decide how
to proceed. If system test plan contains successful system test completion criteria,
those criteria must be fulfilled.

Defect levels must meet established criteria for completion of the system test cycle.
Defects should be judged by their number and by their impact. Ultimately, the project
team is responsible for ensuring that the tests adhere to the system test plan and the
test cases within it (developed in Subtask 6.3.1 Prepare for System Test ). The project
team must review and sign-off on the results of the tests.

For Data Migration projects, because they are usually part of a larger
implementation the system test should be integrated with the larger project system
test. The results of this test should be reviewed, improved upon and communicated to
the project manager or project management office (PMO). It is common for these types
of projects to have three or four full system tests otherwise known as ‘mock runs’ or
‘trial cutovers’.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 405 of 1017


Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:46

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 406 of 1017


Phase 6: Test
Subtask 6.3.3 Perform Data
Validation

Description

The purpose of data validation is to ensure that source data is populated as per
specification. The team responsible for completing the end-to-end test plan should be
in a position to utilize the results detailed in the testing documentation (e.g., TCR,
CTPs, TCD, and TCRs). Test team members should review and analyze the test
results to determine if project and business expectations are being met.

● If the team concludes that the expectations are being met, it can sign-off on
the end-to-end testing process.
● If expectations are not met, the testing team should perform a gap analysis on
the differences between the test results and the project and business
expectations.

The gap analysis should list the errors and requirements not met so that a Data
Integration Developer can be assigned to investigate the issue. The analysis should
also include data from initial runs in production. The Data Integration Developer should
assess the resources and time required to modify the data integration environment to
achieve the required test results. The Project Sponsor and Project Manager should
then finalize the approach for incorporating the modifications, which may include
obtaining additional funding or resources, limiting the scope of the modifications, or re-
defining the business requirements to minimize modifications.

Prerequisites
None

Roles

Business Analyst (Primary)

Data Integration Developer (Review Only)

Presentation Layer Developer (Secondary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 407 of 1017


Project Sponsor (Review Only)

Quality Assurance Manager (Review Only)

Technical Project Manager (Review Only)

Test Manager (Primary)

Considerations

Before performing data validation, it is important to consider these issues:

● Job Run Validation. A very high-level testing validation can be performed


using dashboards or custom reports using Informatica Data Explorer. The
session logs and the workflow monitor can be used to check if the job has
completed successfully. If relational database error logging is chosen, then the
error tables can be checked for any transformation errors and session errors.
The Data Integration Developer needs to resolve the errors identified in the
error tables.

The Integration Service generates the following tables to help you track row
errors:

❍ PMERR_DATA. Stores data and metadata about a transformation row


error and its corresponding source row.
❍ PMERR_MSG. Stores metadata about an error and the error message.
❍ PMERR_SESS. Stores metadata about the session.
❍ PMERR_TRANS. Stores metadata about the source and
transformation ports, such as name and datatype, when a
transformation error occurs.

● Involvement. The test team, the QA team, and, ultimately, the end-user
community are all jointly responsible for ensuring the accuracy of the data. At
the conclusion of system testing, all must sign-off to indicate their acceptance
of the data quality.

● Access To Front-End for Reviewing Results. The test team should have
access to reports and/or a front-end tool to help review the results of each

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 408 of 1017


testing run. Before testing begins, the team should determine just how results
are to be reviewed and reported, what tool(s) are to be used, and how the
results are to be validated. The test team should also have access to current
business reports produced in legacy and current operational systems. The
current reports can be compared to those produced from data in the new
system to determine if requirements are satisfied and that the new reports are
accurate.

The Data Validation task has enormous scope and is a significant phase in any project
cycle. Data validation can be either manual or automated.

Manual. This technique involves manually validating target data with source and also
ensuring that all the transformation have been correctly applied. Manual validation may
be valid for a limited set of data or for master data.

Automated. This technique involves using various techniques and/or tools to validate
data and ensure, at the end of cycle, that all the requirements are met. The
following tools are very useful for data validation:

File Diff. This utility is generally available with any testing tool and is
very useful if the source(s) and target(s) are files. Otherwise, the
result sets from the source and/or target systems can be saved as
flat files and compared using file diff utilities.

Data Analysis Using IDQ. The testing team can use Informatica
Data Quality (IDQ) Data Analysis plans to assess the level of data
quality needs. Plans can be built to identify problems with data
conformity and consistency. Once the data is analyzed, scorecards
can be used to generate a high-level view of the data quality. Using
the results from data analysis and scorecards, new test cases can be
added and new test data can be created for the testing cycle.

Using DataProfiler In Data Validation. Full data validation can be


one of the most time-consuming elements of the testing process.
During the System Test phase of the data integration project, you
can use data profiling technology to validate the data loaded to the
target database. Data profiling allows the project team to test the
requirements and assumptions that were the basis for the Design
Phase and Build Phase of the project, facilitating such tests as:

❍ Business rule validations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 409 of 1017


❍ Domain validations
❍ Row counts and distinct value counts
❍ Aggregation accuracy

Throughout testing, it is advisable to re-profile the source data. This provides


information on any source data changes that may have taken place since the Design
Phase. Additionally, it can be used to verify the makeup and diversity of any data sets
extracted or created for the purposes of testing. This is particularly relevant in
environments where production source data was not available during design. When
development data is used to develop the business rules for the mappings, surprises
commonly occur when production data finally becomes available.

Defect Management:

The defects encountered during the data validation should be organized using either a
simple tool like an Excel (or comparable) spreadsheet or a more advanced tool.
Advanced tools may have facilities for defect assignment, defect status changes, and/
or a section for defect explanation. The Data Integration Developer and the testing
team must ensure that all defects are identified and corrected before changing
the defect status.

For Data Migration projects it is important to identify a set of processes and procedures
to be executed to simplify the validation process. These processes and procedures
should be built into the Punch List and should focus on reliability and efficiency. For
large scale data migration projects it is important to realize the scale of validation. A set
of tools must be developed to enable the business validation personnel to quickly and
accurately validate that the data migration was complete. Additionally it is important
that the run book includes steps to verify that all technical steps were completed
successfully. PowerCenter Metadata Reporter should be leveraged and documented in
the punch list steps and detailed records of all interaction points should be included in
operational procedures.

Best Practices
None

Sample Deliverables
None

Last updated: 15-Feb-07 19:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 410 of 1017


Phase 6: Test
Subtask 6.3.4 Conduct
Disaster Recovery Testing

Description

Disaster testing is crucial for proving the resilience of the system to the business
sponsors and IT support teams, and for ensuring that staff roles and responsibilities are
understood if a disaster occurs.

Prerequisites
None

Roles

Database Administrator (DBA) (Primary)

End User (Primary)

Network Administrator (Secondary)

Quality Assurance Manager (Review Only)

Repository Administrator (Secondary)

System Administrator (Primary)

Test Manager (Primary)

Considerations

Prior to disaster testing, disaster tolerance and system architecture need to be


considered. These factors should already have been assessed during earlier phases of
the project.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 411 of 1017


The first step is to try to quantify the risk factors that could cause a system to fail and
evaluate how long the business could cope without the system should it fail. These
determinations should allow you to judge the disaster tolerance capabilities of the
system.

Secondly, consider the system architecture. A well-designed system will minimize the
risk of disaster. If a disaster occurs, the system should allow a smooth and timely
recovery.

Disaster Tolerance

Disaster tolerance is the ability to successfully recover applications and data after a
disaster within an acceptable time period. A disaster is an event that unexpectedly
disrupts service availability, corrupts data, or destroys data. Disasters may be
triggered by natural phenomena, malicious acts of sabotage against the organization,
or terrorist activity against society in general.

The need for a disaster tolerant system depends on the risk of disaster and how long
the business can afford applications to be out of action. The location and geographical
proximity of data centers plus the nature of the business affect risk. The vulnerability of
the business to disaster depends upon the importance of the system to the business as
a whole and the nature of a system. Service level agreements (SLA) for the availability
of a system dictate the need for disaster testing. For example, a real-time message-
based transaction processing application that has to be operational 24/7 needs to be
recovered faster than a management information system with a less stringent SLA.

System Architecture

Disaster testing is strongly influenced by the system architecture. A system can be


designed with a clustered architecture to reduce the impact of disaster. For example, a
user acceptance system and a production system can run in a clustered environment. If
the production server fails, the user acceptance machine can take over. As an extra
precaution, replication technology can be used to protect critical data.

PowerCenter server grid technology is beneficial when designing and implementing a


disaster tolerant system. Normally, server grids are used to balance loads and improve
performance on resource-intensive tasks, but they can help reduce disaster recovery
time too. Sessions in a workflow can be configured to run on any available server that
is registered to the grid. The servers in the grid must be able to create and maintain a
connection to each other across the network. If a server unexpectedly shuts down while
it is running a session, then the workflow can be set to fail. This depends on the
session settings specified and whether the server is configured as a master or worker

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 412 of 1017


server.

Although the failed workflow has to be manually recovered if one of the servers
unexpectedly shuts down, other servers in the grid should be available to rerun it,
unless a catastrophic network failure occurs.

The guideline is to aim to avoid single points of failure in a system where possible.
Clustering and server grid solutions alleviate single points of failure. Be aware that
single physical points of failure are often hardware and network related. Be sure to
have backup facilities and spare components available, for example auxiliary
generators, spare network cards, cooling systems; even a torch in case the lights go
out!

Perhaps the greatest risk to a system is human error. Businesses need to provide
proper training for all staff involved in maintaining and supporting the system. Also be
sure to provide documentation and procedures to cope with common support issues.

Remember a single mis-typed command or clumsy action can bring down a whole
system.

Disaster Test Planning

After disaster tolerance and system architecture have been considered, you can begin
to prepare the disaster test plan. Allow sufficient time to prepare the plan. Disaster
testing requires a significant commitment in terms of staff and financial resources.
Therefore, the test plan and activities should be precise, relevant, and achievable.

The test plan identifies the overall test objectives; consider what the test goals are and
whether they are worthwhile for the allocated time and resources. Furthermore, the
plan explains the test scope, establishes the criteria for measuring success, specifies
any prerequisites and logistical requirements (e.g., the test environment), includes test
scripts, and clarifies roles and responsibilities.

Test Scope

Test scope identifies the exact systems and functions to be tested. There may not be
time to test for every possible disaster scenario. If so the scope should list and explain
why certain functions or scenarios cannot be tested.

Focus on the stress points for each particular application when deciding on the test
scope. For example, in a typical data warehouse it is quite easy to recover data during

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 413 of 1017


the extract phase (i.e., when data is being extracted from a legacy system based on
date/time criteria). It may be more difficult to recover from a downstream data
warehouse or data mart load process, however. Be sure to enlist the help of application
developers and system architects to identify the stress points in the overall system.

Establish Success Criteria

In theory, success criteria can be measured in several ways. Success can mean
identifying a weakness in the system highlighted in the test cycle or successfully
executing a series of scripts to recover critical processes that were impacted by the
disaster test case.

Use SLAs to help establish quantifiable measures of success. SLAs should already
exist specifically for disaster recovery criteria.

In general, if the disaster testing results meet or beat the SLA standards, then the
exercise can be considered a success.

Environment and Logistical Requirements

Logistical requirements include schedules, materials, and premises, as well as


hardware and software needs.

Try and prepare a dedicated environment for disaster testing. As new applications are
created and improved, they should be tested in the isolated disaster-testing
environment. It is important to regularly test for disaster tolerance, particularly if new
hardware and / or software components are introduced to the system being tested.
Make sure that the testing environment is kept up to date with code and infrastructure
changes that are being applied in the normal system testing environment(s).

The test schedule is important because it explains what will happen and when. For
example, if the electricity supply is going to be turned off or the plug pulled on a
particular server, it must be scheduled and communicated to all concerned parties.

Test Scripts

The disaster test plan should include test scripts, detailing the actions and activities
required to actually conduct the technical tests. These scripts can be simple or
complex, and can be used to provide instructions to test participants. The test scripts
should be prepared by the business analysts and application developers.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 414 of 1017


Staff Roles and Responsibilities

Encourage the organization IT security team to participate in a disaster testing


exercise. They can assist in simulating an attack on the database, identifying
vulnerable access points on a network, and fine-tuning the test plan.

Involve business representatives as well as IT testing staff in the disaster testing


exercise. IT testing staff can focus on technical recovery of the system. Business users
can identify the key areas for recovery and prepare backup strategies and procedures
in case system downtime exceeds normal expectations.

Ensure that the test plan is approved by the appropriate staff members and business
groups.

Executing Disaster Tests

Disaster test execution should expose any flaws in the system architecture or in the
test plan itself. The testing team should be able to run the tests based on the
information within the test plan and the instructions in the test scripts.

Any deficiencies in this area need to be addressed because a good test plan forms the
basis of an overall disaster recovery strategy for the system.

The test team is responsible for capturing and logging test results. It needs to
communicate any issues in a timely manner to the application developers, business
analysts, end-users, and system architects.

It is advisable to involve other business and IT departmental staff in the testing where
possible, not just the department members who planned the test. If other staff can
understand the plan and successfully recover the system by following it, then the
impact of a real disaster is reduced.

Data Migration Projects

While data migration projects don’t fully require a full-blown disaster recovery solution,
it is recommended to establish a disaster recovery plan. Typically this is a simple
document to identify emergency procedures to follow if something were to happen to
any of the major pieces of infrastructure. Additionally, a back-out plan should be
present in the event the migration must stop mid-stream during the final implementation
weekend.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 415 of 1017


Conclusion and Postscript

Disaster testing is a critical aspect of the overall system testing strategy. If conducted
properly, disaster testing provides valuable feedback and lessons that will prove
important if a real disaster strikes.

Postscript: Backing Up PowerCenter Components

Apply safeguards to protect important PowerCenter components, even if disaster


tolerance is not considered a high priority by the business. Be sure to backup the
production repository every day. The backup takes two forms: a database backup of
the repository schema organized by the DBA, and a backup using the pmrep syntax
that can be called from a script. It is also advisable to back up the pmserver.cfg,
pmrepserver.cfg, and odbc.ini files.

Best Practices

Disaster Recovery Planning with PowerCenter HA Option

PowerCenter Enterprise Grid Option

Sample Deliverables
None

Last updated: 06-Dec-07 14:56

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 416 of 1017


Phase 6: Test
Subtask 6.3.5 Conduct
Volume Testing

Description

Basic volume testing seeks to verify that the system can cope with anticipated
production data levels. Taken to extremes, volume testing seeks to find the physical
and logical limits of a system; this is also known as stress testing. Stress and volume
testing seek to determine when and if system behavior changes as the load increases.

A volume testing exercise is similar to a disaster testing exercise. The test scenarios
encountered may never happen in the production environment. However, a well-
planned and conducted test exercise provides invaluable reassurance to the business
and IT communities regarding the stability and resilience of the system.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Database Administrator (DBA) (Primary)

Network Administrator (Secondary)

System Administrator (Secondary)

Test Manager (Primary)

Considerations

Understand Service Level Agreements

Before starting the volume test exercise, consider the Service Level Agreements (SLA)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 417 of 1017


for the particular system. The SLA should set measures for system availability and
projected temporal growth in the amount of data being stored by the system. The SLAs
are the benchmark to measure the volume test results against.

Estimate Projected Data Volumes Over Time and Consider Peak Load
Periods

Enlist the help of the DBAs and Business Analysts to estimate the growth in projected
data volume across the lifetime of the system. Remember to make allowances for any
data archiving strategy that exists in the system. Data archiving helps to reduce the
volume of data in the actual core production system, although of course, the net
volume of data will increase over time. Use the projected data volumes to provide
benchmarks for testing.

Organizations often experience higher than normal periods of activity at predictable


times. For example, a retailer or credit card supplier may experience peak activity
during weekends or holiday periods. A bank may have month or year-end processes
and statements to produce. Volume testing exercises should aim to simulate
throughput at peak periods as well as normal periods. Stress testing goes beyond the
peak period data volumes in order to find the limits of the system.

A task such as duplicate record identification (known as data matching in Informatica


Data Quality parlance) can place significant demands on system resources. Informatica
Data Quality (IDQ) can perform millions or billions of comparison operations in
a matching process. The time available for the completion of a matching process can
have a big impact on the perception that the plan is running correctly. Bear in mind that,
for these reasons, data matching operations are often scheduled for off-peak periods.

Data matching is also a processor-intensive activity: the speed of the processor has a
significant impact on how fast a matching process completes. If the project includes
data quality operations, consult with a Data Quality Developer when estimating data
volumes over time and peak load periods.

Volume Test Planning

Volume test planning is similar in many ways to disaster test planning. See 6.3.4
Conduct Disaster Recovery Testing for details on disaster test planning guidelines.

However, there are some volume-test specific issues to consider during the planning
stage:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 418 of 1017


Obtaining Volume Test Data and Data Scrambling

The test team responsible for completing the end-to-end test plan should
ensure that the volume(s) of test data accurately reflect the production business
environment. Obtaining adequate volumes of data for testing in a non-
production environment can be time-consuming and logistically difficult, so
remember to make allowances in the test plan for this.

Some organizations choose to copy data from the production environment into
the test system. Security protocol needs to be maintained if data is copied from
a production environment since the data is likely to need to be scrambled.
Some of the popular RDBMS products contain built-in scrambling packages;
third-party scrambling solutions are also available. Contact the DBA and the IT
security manager for guidance on the data scrambling protocol of the
department or organization.

For new applications, production data probably does not exist. Some
commercially-available software products can generate large volumes of data.
Alternatively, one of the developers may be able to build a customized suite of
programs to artificially generate data.

Hardware and Network Requirements and Test Timing

Remember to consider the hardware and network characteristics when


conducting volume testing. Do they match the production environment? Be
sure to make allowances for the test results if there is a shortfall in processing
capacity or network limitations on the test environment. Volume testing may
involve ensuring that testing occurs at an appropriate time of day and day of
week, and taking into account any other applications that may negatively affect
the database and/or network resources.

Increasing Data Volumes

Volume testing cycles need to include normal expected volumes of data and
some exceptionally high volumes of data. Incorporate peak period loads into the
volume testing schedules. If stress tests are being carried out, data
volume need to be increased even further. Additional pressure can be applied
to the system, for example, by adding a high number of database users or
temporarily bringing down a server.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 419 of 1017


Any particular stress test cases need to be logged in the test plan and the test
schedules.

Volume and Stress Test Execution

Volume Test Results Logging

The volume testing team is responsible for capturing volume test results. Be
sure to capture performance statistics for PowerCenter tasks, database
throughput, server performance and network efficiency.

PowerCenter Metadata Reporter provides an excellent method of logging


PowerCenter session performance over time. Run the Metadata Reporter for
each test cycle to capture session and workflow lapse time. The results can be
displayed in Data Analyzer dashboards or exported to other media (e.g., PDF
files). The views in the PowerCenter Repository can also be queried directly
with SQL statements.

In addition, collaboration should occur with the network and server


administrators regarding the option to capture additional statistics, such as
those related to CPU usage, data transfer efficiency, writing to disk etc. The
type of statistics to capture depend on the operating system in use.

If jobs and tasks are being run through a scheduling tool, use the features
within the scheduling tool to capture lapse time data. Alternatively, use shell
scripts or batch file scripts to retrieve time and process data from the operating
system.

System Limits, Scalability, and Bottlenecks

If the system has been well-designed and built, the applications are more likely
to perform in a predictable manner as data volumes increase. This is known as
scalability and is a very desirable trait in any software system.

Eventually however, the limits of the system are likely to be exposed as data
volumes reach a critical mass and other stresses are introduced into the
system. Physical or user-defined limits may be reached on particular
parameters. For example, exceeding the maximum file size supported on an

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 420 of 1017


operating system constitutes a physical limit. Alternatively, breaching sort space
parameters by running a database SQL query probably constitutes a limit that
has been defined by the DBA.

Bottlenecks are likely to appear in the load processes before such limits are
exceeded. For example, a SQL query called in a PowerCenter session may
experience a sudden drop in performance when data volumes reach a
threshold figure. The DBA and application developer need to investigate any
sudden drop in the performance of a particular query. Volume and stress testing
is intended to gradually increase the data load in order to expose weaknesses
in the system as a whole.

Conclusion

Volume and stress testing are important aspects of the overall system testing strategy.
The test results provide important information that can be used to resolve issues before
they occur in the live system.

However, be aware that it is not possible to test all scenarios that may cause the
system to crash. A sound system architecture and well-built software applications can
help prevent sudden catastrophic errors.

Best Practices
None

Sample Deliverables
None

Last updated: 18-Oct-07 15:11

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 421 of 1017


Phase 6: Test
Task 6.4 Conduct User
Acceptance Testing

Description

User Acceptance Testing (UAT) is arguably the most important step in the project and
is crucial to verifying that the system meets the users’ requirements. Being business
usage-focused, it relates to the business requirements rather than on testing all the
details of the technical specification. As such UAT is considered black box testing (i.e.,
without knowledge of all the underlying logic) that focuses on the deliverables to the
end user, primarily through the presentation layer. UAT is the responsibility of the user
community in terms of organization, staffing and final acceptance, but much of the
preparation will have been undertaken by IT staff working to a plan agreed with the
users. The function of the user acceptance testing is to obtain final functional approval
from the user community for the solution to be deployed into production. As such,
every effort must be made to replicate the production conditions.

Prerequisites
None

Roles

End User (Primary)

Test Manager (Primary)

User Acceptance Test Lead (Primary)

Considerations

Plans

By this time User Acceptance Criteria should have been precisely defined by the user
community as well, of course, as the specific business objectives and requirements for
the project. UAT Acceptance Criteria should include

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 422 of 1017


● tolerable bug levels, based on the defect management procedures
● report validation procedures (data audit, etc.) including “gold standard” reports
to use for validation
● data quality tolerances that must be met
● validation procedures that will be based for comparison to existing systems
(esp. for validation of data migration/synchronization projects or operational
integration)
● required performance tolerances, including response time and usability

As the testers may not have a technical background, the plan should include detailed
procedures for testers to follow. The success of UAT depends on having certain critical
items in place:

● Formal testing plan supported by detailed test scripts


● Properly configured environment, including the required test data (ideally a
copy of the real, production environment and data)
● Adequately experienced test team members from the end user community
● Technical support personnel to support the testing team and to evaluate and
remedy problems and defects discovered

Staffing the User Acceptance Testing

It is important that the user acceptance testers and their management are thoroughly
committed to the new system and ensuring its success. There needs to be
communication with the user community so that they are informed of the project’s
progress and able to identify appropriate members of staff to make available to carry
out the testing. These participants will become the users most equipped to adopt the
new system and so should be considered “super-users” who may participate in user
training thereafter.

Best Practices
None

Sample Deliverables
None

Last updated: 16-Feb-07 14:07

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 423 of 1017


Phase 6: Test
Task 6.5 Tune System
Performance

Description

Tuning a system can, in some cases, provide orders of magnitude performance gains.
However, tuning is not something that should just be performed after the system is in
production; rather, it is a concept of continual analysis and optimization. More
importantly, tuning is a philosophy. The concept of performance must permeate all
stages of development, testing, and deployment. Decisions made during the
development process can seriously impact performance and no level of production
tuning can compensate for an inefficient design that must be redeveloped.

The information in this section is intended for use by Data Integration Developers, Data
Quality Developers, Database Administrators, and System Administrators, but should
be useful for anyone responsible for the long-term maintenance, performance, and
support of PowerCenter Sessions, Data Quality Plans, PowerExchange
Connectivity and Data Analyzer Reports.

Prerequisites
None

Roles

Data Integration Developer (Primary)

Data Warehouse Administrator (Primary)

Database Administrator (DBA) (Primary)

Network Administrator (Primary)

Presentation Layer Developer (Primary)

Quality Assurance Manager (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 424 of 1017


Repository Administrator (Primary)

System Administrator (Primary)

System Operator (Primary)

Technical Project Manager (Review Only)

Test Manager (Primary)

Considerations

Performance and tuning the Data Integration environment is more than just simply
tuning PowerCenter or any other Informatica product. True system performance
analysis requires looking at all areas of the environment to determine opportunities for
better performance from relational database systems, file systems, network bandwidth,
and even hardware. The tuning effort requires benchmarking, followed by small
incremental tuning changes to the environment, then re-executing the
benchmarked data integration processes to determine the affect of the tuning changes

Often, tuning efforts mistakenly focus on PowerCenter as the only point of concern
when there may be other areas causing the bottleneck and needing attention. If you
are sourcing data from a relational database for example, your data integration loads
can never be faster than the source database can provide data. If the source database
is poorly indexed, poorly implemented, or underpowered - no amount of downstream
tuning in PowerCenter, hardware, network, file systems etc. can fix the problem of slow
source data access. Throughout the tuning process, the entire end-to-end process
must be considered and measured. The unit of work being baselined may be a single
PowerCenter session for example, but it is always necessary to consider the end-to-
end process of that session in the tuning efforts.

Another important consideration of system tuning is the availability of an on-going


means to monitor the system performance. While it is certainly important to focus on a
specific area, tune, and deploy to production to gain benefit, continuously monitoring
the performance of the system may reveal areas that show degredation over time and
sometimes even immediate, extreme degredation for one reason or another. Quick
identification of these areas allows pro-active tuning and adjustments before the
problems become catosrophic. A good monitoring system may involve a variety of
technologies to provide a full view of the environment.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 425 of 1017


Note: The PowerCenter Administrator's Guide provides extensive information on
performance tuning and is an excellent reference source on this topic.

For Data Migration projects performance is often an important consideration. If a data


migration project is the result of the implementation of a new package application
or operational system, a down-time is usually required. Because this down-time may
prevent the business from operating, the scheduled outage window must be as short as
possible. Therefore, performance tuning is often addressed between system tests.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 426 of 1017


Phase 6: Test
Subtask 6.5.1 Benchmark

Description

Benchmarking involves the process of running sessions or reports and collecting run
statistics to set a baseline for comparison. The benchmark can be used as the standard
for comparison after the session or report is tuned for performance. When determining
a benchmark, the two key statistics to record are:

● session duration from start to finish, and


● rows per second throughput.

Prerequisites
None

Roles

Data Integration Developer (Primary)

Data Warehouse Administrator (Primary)

Database Administrator (DBA) (Primary)

Network Administrator (Primary)

Presentation Layer Developer (Primary)

Repository Administrator (Primary)

System Administrator (Primary)

Test Manager (Primary)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 427 of 1017


Since the goal of this task is to improve the performance of the entire system, it is
important to choose a variety of mappings to benchmark. Having a variety of
mappings ensures that optimizing one session does not adversely affect the
performance of another session. It is important to work with the same exact data set
each time you run a session for benchmarking and performance tuning. For example, if
you run 1,000 rows for the benchmark, it is important to run the exact same rows for
future performance tuning tests.

After choosing a set of mappings, create a set of new sessions that use the default
settings. Run these sessions when no other processes are running in the background.

Tip
Tracking Results

One way to track benchmarking results is to create a reference spreadsheet.


This should define the number of rows processed for each source and target, the
session start time, end time, time to complete, and rows per second throughput.

Track two values for rows per second throughput: rows per second as calculated
by PowerCenter (from transformation statistics in the session properties), and the
average rows processed per second (based on total time duration divided by the
number of rows loaded).

If it is not possible to run the session without background processes, schedule the
session to run daily at a time where there are not many processes running on the
server. Be sure that the session runs at the same time each day or night for
benchmarking. The session should run at the same time for future tests.

Track the performance results in spreadsheet over a period of days or for several
runs. After the statistics are gathered, compile the average of the results in a new
spreadsheet. Once the average results are calculated, identify the sessions that have
lowest throughput or that miss their load window. These sessions are the first
candidates for performance tuning.

When the benchmark is complete, the sessions should be tuned for performance. It
should be possible to identify potential areas for improvement by considering the
machine, network, database, and PowerCenter session and server process.

Data Analyzer benchmarking should focus on the time taken to run the source query,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 428 of 1017


generate the report, and display it in the user’s browser.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 429 of 1017


Phase 6: Test
Subtask 6.5.2 Identify
Areas for Improvement

Description

The goal of this subtask is to identify areas for improvement, based on the performance
benchmarks established in Subtask 6.5.1 Benchmark .

Prerequisites
None

Roles

Data Integration Developer (Primary)

Data Warehouse Administrator (Primary)

Database Administrator (DBA) (Primary)

Network Administrator (Primary)

Presentation Layer Developer (Primary)

Repository Administrator (Primary)

System Administrator (Primary)

Test Manager (Primary)

Considerations

After performance benchmarks are established (in 6.5.1 Benchmark ), careful analysis
of the results can reveal areas that may be improved through tuning. It is important to
consider all possible areas for improvement, including:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 430 of 1017


● Machine. Regardless of whether the system is UNIX- or NT-based.
● Network. An often-overlooked facet of system performance, network
optimization can have a major affect on overall system performance. For
example, if the process of moving or FTPing files from a remote server takes
four hours and the PowerCenter session takes four minutes, then optimizing
and tuning the network may help to shorten the overall process of data
movement, session processing, and backup. Key considerations for network
performance include the network card and its settings, network protocol
employed, available bandwidth, packet size settings, etc.
● Database. Database tuning is, in itself, an art form and is largely dependent
on the DBA's skill, finesse, and in-depth understanding of the database
engine. A major consideration in tuning databases is in defining throughput
versus response time. It is important to understand that analytic solutions
define their performance in response time, while many OLTP systems
measure their performance in throughput, and most DBA's are schooled in
OLTP performance tuning rather than response time tuning. Each of the three
functional areas of database tuning (i.e., memory, disk I/O, and processing)
must be addressed for optimal performance, or one of the other areas will
suffer.
● PowerCenter. Most systems need to tune the PowerCenter session and
server process in order to achieve an acceptable level of performance. Tuning
the server daemon process and individual sessions can increase performance
by a factor of 2 or 3, or more. These goals can be achieved by decreasing the
number of network hops between the server and the databases, and by
eliminating paging of memory on the server running the PowerCenter sessions.
● Data Analyzer. It is possible that tuning may be required for source queries
and the reports themselves if the time taken to generate the report on screen
takes too long.

The actual tuning process can begin after the areas for improvement have been
identified and documented.

For data migration projects, other considerations must be included in the performance
tuning activities. Many ERP applications have two-step processes where the data is
loaded through simulated on-line processes. More specifically an API will be executed
that will replicate in a batch scenario the way that the on-line entry works, executing all
edits. In such a case, performance will not be the same as in a scenario where a
relational database is being populated. The best approach to performance tuning is to
set the expectation that all data errors should be identified and corrected in the ETL
layer prior to the load to the target application. This approach can improve performance
by as much as 80%.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 431 of 1017


Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 432 of 1017


Phase 6: Test
Subtask 6.5.3 Tune Data
Integration Performance

Description

The goal of this subtask is to implement system changes to improve overall system
performance, based on the areas for improvement that were identified and documented
in Subtask 6.5.2 Identify Areas for Improvement .

Prerequisites
None

Roles

Data Integration Developer (Primary)

Database Administrator (DBA) (Primary)

Network Administrator (Primary)

Quality Assurance Manager (Review Only)

Repository Administrator (Primary)

System Operator (Primary)

Technical Project Manager (Review Only)

Test Manager (Primary)

Considerations

Performance tuning should include the following steps:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 433 of 1017


1. Run a session and monitor the server to determine if the system is paging
memory or if the CPU load is too high for the number of available processors. If
the system is paging, correcting the system to prevent paging (e.g., increasing
the physical memory available on the machine) can greatly improve
performance.

2. Re-run the session and monitor the performance details, watching the buffer
input and outputs for the sources and targets.

3. Tune the source system and target system based on the performance details.
Once the source and target are optimized, re-run the PowerCenter session or
Data Analyzer report to determine the impact of the changes.

4. Only after the server, source, and target have been tuned to their peak
performance should the mapping and session be analyzed for tuning. This is
because, in most cases, the mapping is driven by business rules. Since the
purpose of most mappings is to enforce the business rules, and the business
rules are usually dictated by the business unit in concert with the end-user
community, it is rare that the mapping itself can be greatly tuned. Points to look
for in tuning mappings are: filtering unwanted data early, cached lookups,
aggregators that can be eliminated by programming finesse and using sorted
input on certain active transformations. For more details on tuning mappings
and sessions refer to the Best Practices.

5. After the tuning achieves a desired level of performance, the DTM (data
transformation manager) process should be the slowest portion of the session
details. This indicates that the source data is arriving quickly, the target is
inserting the data quickly, and the actual application of the business rules is the
slowest portion. This is the optimal desired performance. Only minor tuning of
the session can be conducted at this point and usually has only a minimal effect.

6. Finally, re-run the benchmark sessions, comparing the new performance with
the old performance. In some cases, optimizing one or two sessions to run
quickly can have a disastrous effect on another mapping and care should be
taken to ensure that this does not occur.

Best Practices

Session and Data Partitioning

Sample Deliverables
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 434 of 1017


Last updated: 18-Oct-07 15:14

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 435 of 1017


Phase 6: Test
Subtask 6.5.4 Tune
Reporting Performance

Description

The goal of this subtask is to identify areas where changes can be made to improve the
performance of Data Analyzer reports.

Prerequisites
None

Roles

Database Administrator (DBA) (Primary)

Network Administrator (Secondary)

Presentation Layer Developer (Primary)

Quality Assurance Manager (Review Only)

Repository Administrator (Primary)

System Administrator (Primary)

Technical Project Manager (Review Only)

Test Manager (Primary)

Considerations

Database Performance

1. Generate SQL for each report and explain this SQL in the database to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 436 of 1017


determine if the most efficient access paths are being used. Tune the database
hosting the data warehouse and add indexes on the key tables. Take care in
adding indexes since indexes affect ETL load times.

2. Analyze SQL requests made against the database to identify common patterns
with user queries. If you find that many users are running aggregations against
detail tables, consider creating an aggregate table in the database and perform
the aggregations via ETL processing. This will save time when the user runs the
report as the data will already be aggregated.

Data Analyzer Performance

1. Within Data Analyzer, use filters within reports as much as possible. Try to
restrict as much data as possible. Also try to architect reports to start out with a
high-level query, then provide analytic workflows to drill down to more detail.
Data Analyzer report rendering performance is directly related to the number of
rows returned from the database.

2. If the data within the report does not get updated frequently, make the report a
cached report. If the data is being updated frequently, make the report a
dynamic report.

3. Try to avoid sectional reports as much as possible since they take more time in
rendering.

4. Schedule reports to run during off peak hours. Reports run in batches can use
considerable resources. Therefore such reports should be run at the time when
there is least use on the system subject to other dependencies.

Application Server Performance

1. Fine tune the application server Java Virtual Machine (JVM) to correspond with
the recommendations in the Best Practice on Data Analyzer Configuration and
Performance Tuning. This should significantly enhance Data Analyzer's
reporting performance.

2. Ensure that the application server has sufficient CPU and memory to handle the
expected user load. Strawman estimates for CPU and memory are as follows:

❍ 1 CPU per 50 users


❍ 1-2 GB RAM per CPU

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 437 of 1017


3. You may need additional memory if a large number of reports are cached. You
may need additional CPUs if a large number of reports are on-demand.

Best Practices
None

Sample Deliverables
None

Last updated: 16-Feb-07 14:09

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 438 of 1017


Phase 7: Deploy

7 Deploy

● 7.1 Plan Deployment


❍ 7.1.1 Plan User Training
❍ 7.1.2 Plan Metadata Documentation and Rollout
❍ 7.1.3 Plan User Documentation Rollout
❍ 7.1.5 Develop Communication Plan
❍ 7.1.6 Develop Run Book
● 7.2 Deploy Solution
❍ 7.2.1 Train Users
❍ 7.2.2 Migrate Development to Production
❍ 7.2.3 Package Documentation

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 439 of 1017


Phase 7: Deploy

Description

Upon completion of the Build


Phase (when both development and testing are finished) the data integration solution is
ready to be installed in a production environment and submitted to the ultimate test as
a viable solution that meets the users' requirements.

The deployment strategy developed during the Architect Phase is now put into action.
During the Build Phase components are created that may require special initialization
steps and proceedures. For the production deployment, checklists and procedures are
developed to ensure that crucial steps are not missed in the production cut over.

To the end user, this is where the fruits of the project are exposed and the end user
acceptance begins. Up to this point, developers have been developing data cleansing,
data transformations, load processes, reports, and dashboards in one or more
development environments. But whether a project team is developing the back-end
processes for a legacy migration project or the front-end presentation layer for a
metadata management system, deploying a data integration solution is the final step in
the development process.

Metadata, which is the cornerstone of any data integration solution, should play an
integral role in the documentation and training rollout to users. Not only is metadata
critical to the current data integration effort, but it will be integral to planned metadata
management projects down the road. After the solution is actually deployed, it must be
maintained to ensure stability and scalability.

All data integration solutions must be designed to support change as user requirements
and the needs of the business change. As data volumes grow and user interest
increases, organizations face many hurdles such as software upgrades, additional
functionality requests, and regular maintenance. Use the Deploy Phase as a guide to
deploying an on-time, scalable, and maintainable data integration solution that provides
business value to the user community.

Prerequisites
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 440 of 1017


Roles

Business Analyst (Primary)

Business Project Manager (Primary)

Data Architect (Secondary)

Data Quality Developer (Primary)

Data Warehouse Administrator (Primary)

Database Administrator (DBA) (Primary)

End User (Secondary)

Metadata Manager (Primary)

Presentation Layer Developer (Primary)

Project Sponsor (Approve)

Quality Assurance Manager (Approve)

Repository Administrator (Primary)

System Administrator (Primary)

Technical Architect (Secondary)

Technical Project Manager (Review Only)

Considerations

None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 441 of 1017


Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 442 of 1017


Phase 7: Deploy
Task 7.1 Plan Deployment

Description

The success or failure associated with deployment often determines how users and
management perceive the completed data integration solution. The steps involved in
planning and implementing deployment are, therefore, critical to project success. This
task addresses three key areas of deployment planning:

● Training
● Metadata documentation
● User documentation

Prerequisites
None

Roles

Application Specialist (Secondary)

Business Analyst (Review Only)

Data Integration Developer (Secondary)

Database Administrator (DBA) (Primary)

End User (Secondary)

Metadata Manager (Primary)

Project Sponsor (Primary)

Quality Assurance Manager (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 443 of 1017


System Administrator (Secondary)

Technical Project Manager (Secondary)

Considerations

Although training and documentation are considered part of the Deploy Phase, both
activities need to start early in the development effort and continue throughout the
project lifecycle. Neither can be planned nor implemented effectively without the
following:

● Thorough understanding of the business requirements that the data integration


is intended to address
● In-depth knowledge of the system features and functions and its ability to meet
business users' needs
● Understanding of the target users, including how, when, and why they will be
using the system

Companies that have training and documentation groups in place should include
representatives of these groups in the project development team. Companies that do
not have groups in place need to assign resources on the project team to these tasks,
ensuring effective knowledge transfer throughout the development effort. And,
everyone involved in the system design and build should understand the need for good
documentation and make it a part of his or her everyday activities. This "in-process"
documentation then serves as the foundation for the training curriculum and user
documentation that is generated during the Deploy Phase.

Although most companies have training programs and facilities in place, it is sometimes
necessary to create these facilities to provide training on the data integration solution. If
this is the case, the determination to create a training program must be made as early
in the project lifecycle as possible, and the project plan must specify the necessary
resources and development time. Creating a new training program is a double-edged
sword: it can be quite time-consuming and costly, especially if additional personnel and/
or physical facilities are required but it also gives project management the opportunity
to tailor a training program specifically for users of the solution rather than "fitting" the
training needs into an existing program.

Project management also needs to determine policies and procedures for documenting
and automating metadata reporting early in the deployment process rather than making
reporting decisions on-the-fly.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 444 of 1017


Finally, it is important to recognize the need to revise the end-user documentation and
training curriculum over the course of the project lifecycle as the system and user
requirements change. Documentation and training should both be developed with an
eye toward flexibility and future change.

For Data Migration projects it is very important that the operations team has the tools
and processes to allow for a mass deployment of large amounts of code at one time, in
a consistent manner. Capabilities should include:

● The ability to migrate code efficiently with little effort


● The ability to report what was deployed
● The ability to roll back changes if necessary

This is why team-based development is normally a part of any data migration project.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 445 of 1017


Phase 7: Deploy
Subtask 7.1.1 Plan User Training

Description

Companies often misjudge the level of effort and resources required to plan, create, and successfully
implement a user training program. In some cases, such as legacy migration initiatives, it may be that very
little training is required on the data integration component of the project. However, in most cases, multiple
training programs are required in order to address a wide assortment of user types and needs. For
example, when deploying a metadata management system, it may be necessary to train administrative
users, presentation layer users, and business users separately. When deploying a data conversion project,
on the other hand, it may only be necessary to train administrative users. Note also that users of data
quality applications such as Informatica Data Quality or Informatica Data Explorer will require training, and
that these products may be of interest to personnel at several layers of the organization.

The project plan should include sufficient time and resources for implementing the training program - from
defining the system users and their needs, to developing class schedules geared toward training as many
users as possible, efficiently and effectively, with minimal disruption of everyday activities.

In developing a training curriculum, it is important to understand that there is seldom a "one size fits all"
solution. The first step in planning user training is identifying the system users and understanding both their
needs and their existing level of expertise. It is generally best to focus the curriculum on the needs of
"average" users who will be trained prior to system deployment, then consider the specialized needs of
high-end (i.e., expert) users and novice users who may be completely unfamiliar with decision-support
capabilities. The needs of these specialized users can be addressed most effectively in follow-up classes.

Planning user training also entails ensuring the availability of appropriate facilities. Ideally, training should
take place on a system that is separate from the development and production environments. In most cases,
this system mirrors the production environment, but is populated with only a small subset of data. If a
separate system is not available, training can use either a development or production platform, but this
arrangement raises the possibility of affecting either the development efforts or the production data. In any
case, if sensitive production data is used in a training database, ensure appropriate security measures are
in place to prevent unauthorized users in training from accessing confidential data.

Prerequisites
None

Roles

End User (Secondary)

Considerations

Successful training begins with careful planning. Training content and duration must correspond with end-
user requirements. A well-designed and well-planned training program is a "must have" for a data
integration solution to be considered successfully deployed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 446 of 1017


● Business users often do not need to understand the back-end processes and mechanisms
inherent in a data integration solution, but they do need to understand the access tools, the
presentation layer, and the underlying data content to use it effectively. Thus, training should focus
on these aspects, simplifying the necessary information as much as possible and organizing it to
match the users' requirements.
● Training for business users usually focuses on three areas:

❍ The presentation layer


❍ Data content
❍ Application

● While the presentation layer is often the primary focus of training, data content and application
training are also important to business users. Many companies overlook the importance of training
users on the data content and application, providing only data access tool training. In this case,
users often fail to understand the full capabilities of the data integration system and the company
is unlikely to achieve optimal value from the system.

Careful curriculum preparation includes developing clear, attractive training materials, including good
graphics and well-documented exercise materials that encourage users to practice using the system
features and functions. Laboratory materials can make or break a training program by encouraging users to
try using the system on their own. Training materials that contain obvious errors or poorly documented
procedures actually discourage users from trying to use the system, as does a poorly-designed
presentation layer. If users do not gain confidence using the system during training, they are unlikely to use
the data integration solution on a regular basis in their everyday activities.

The training curriculum should include a post-training evaluation process that provides users with an
opportunity to critique the training program, identifying both its strengths and weaknesses and making
recommendations for future or follow-up training classes. The evaluation should address the effectiveness
of both the course and the trainer because both are crucial to the success of a training program.

As an example, the curriculum for a two-day training class on a data integration solution might look
something like this:

2-Day Data Integration Solution Training Class

Curriculum Duration

Day 1

Introduction and orientation 1 hour

High-level description & conceptualization tour 2 hours


of the data integration architecture

Lunch 1 hour

Data content training 2 hours

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 447 of 1017


Introduction to the presentation layer 1 hour

Day 2

Introduction to the application 1 hour

Introduction to metadata 2 hours

Lunch 1 hour

Integrated application & presentation layer 3 hours


laboratory

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 448 of 1017


Phase 7: Deploy
Subtask 7.1.2 Plan
Metadata Documentation
and Rollout

Description

Whether a data integration project is being implemented as a “single-use effort” such


as a legacy migration project, or as a longer-term initiative such as data
synchronization (e.g., “Single View of Customer”), metadata documentation is critical to
the overall success of the project. Metadata is the information map for any data
integration effort. Proper use and enforcement of metadata standards will, for example,
help ensure that future audit requirements are met, and that business users have the
ability to learn exactly how their data is migrated, transformed, and stored throughout
various systems. When metadata management systems are built, thorough metadata
documentation provides end users with an even clearer picture of the potentially vast
impact of seemingly minor changes in data structures.

This subtask uses the example of a PowerCenter development environment to discuss


the importance of documenting metadata. However, it is important to remember that
metadata documentation is just as important for metadata management and
presentation-layer development efforts.

On the front-end, the PowerCenter development environment is graphical, easy-to-


understand, and intuitive. On the back-end, it is possible to capture each step of the
data integration process in the metadata, using manual and automatic entries into the
metadata repository. Manual entries may include descriptions and business names, for
example; automatic entries are produced while importing a source or saving a mapping.

Because every aspect of design can potentially be captured in the PowerCenter


repository, careful planning is required early in the development process to properly
capture the desired metadata. Although it is not always easy to capture important
metadata, every effort must be expended to satisfy this component of business
documentation requirements.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 449 of 1017


Database Administrator (DBA) (Primary)

Metadata Manager (Primary)

System Administrator (Secondary)

Technical Project Manager (Review Only)

Considerations

During this subtask, it is important to decide what metadata to capture, how to access
it, and when to place change control check points in the process to maintain all the
changes in the metadata.

The decision about which kinds of metadata to capture is driven by business


requirements and project timelines. While it may be beneficial for a developer to enter
detailed descriptions of each column, expression, variable, and so forth, it would also
be very time-consuming. The decision, therefore, should be based on how much
metadata is actually required by the systems that use metadata.

From the developer's perspective, PowerCenter provides the ability to enter descriptive
information for all repository objects, sources, targets, and transformations. Moreover,
column level descriptions of the columns in a table, as well as all information about
column size and scale, datatypes, and primary keys are stored in the repository. This
enables business users to maintain information on the actual business name and
description of a field on a particular table. This ability helps users in a number of ways:
for example, it eliminates confusion about which columns should be used for a
calculation. For example, 'C_Year' and 'F_Year' might be column names on a table, but
'Calendar Year' and 'Fiscal Year' are more useful to business users trying to calculate
market share for the company's fiscal year.

Informatica does not recommend accessing the repository tables directly, even for
select access, because the repository structure can change with any product release.
Informatica provides several methods of gaining access to this data:.

● The PowerCenter Metadata Reporter (PCMR) provides Web-based access to


the PowerCenter repository. With PCMR, developers and administrators can
perform both operational and impact analysis on their data integration projects.
● Informatica continues to provide the “MX Views”, a set of views that are

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 450 of 1017


installed with the PowerCenter repository. The MX Views are meant to provide
query-level access to repository metadata.

MX2 is a set of encapsulated objects that can communicate with the metadata
repository through a standard interface. These MX2 objects offer developers an
advanced object-based API for accessing and manipulating the PowerCenter
Repository from a variety of programming languages.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 451 of 1017


Phase 7: Deploy
Subtask 7.1.3 Plan User
Documentation Rollout

Description

Good system and user documentation is invaluable for a number of data integration
system users, such as:

● New data integration or presentation layer developers;


● Enterprise architects trying to develop a clear picture of how systems, data,
and metadata are connected throughut an organization;
● Management users who are learning to navigate reports and dashboards; and
● Business users trying to pull together analytical information for an executive
report.

A well-documented project can save development and production team members both
time and effort getting the new system into production and the new employee(s) up-to-
speed.

User documentation usually consists of two sets: one geared toward ad-hoc users,
providing details about the data integration architecture and configuration; and another
geared toward "push button" users, focusing on understanding the data, and providing
details on how and where they can find information within the system. This increasingly
includes documentation on how to use and/or access metadata.

Prerequisites
None

Roles

Business Analyst (Review Only)

Quality Assurance Manager (Review Only)

Considerations

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 452 of 1017


Good documentation cannot be implemented in a haphazard manner. It requires
careful planning and frequent review to ensure that it meets users' needs and is easily
accessible to everyone that needs it. In addition, it should incorporate a feedback
mechanism that encourages users to evaluate it and recommend changes or additions.

To improve users' ability to effectively access information in, and increase their
understanding of, the content, many companies create resource groups within the
business organization. Group members attend detailed training sessions and work with
the documentation and training specialists to develop materials that are geared toward
the needs of typical, or frequent, system users like themselves. Such groups have two
benefits: they help to ensure that training and documentation materials are on-target for
the needs of the users, and they serve as in-house experts on the data integration
architecture, reducing users' reliance on the central support organization.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 453 of 1017


Phase 7: Deploy
Subtask 7.1.5 Develop
Communication Plan

Description

A communication plan should be developed that discusses the details


of communications and coordination for the production rollout of the data integration
solution. The plan should discuss where key communication information will be stored,
who will be communicated to, and how much communication will be provided. This
information will be initially in a stand-alone document but upon project management
approval this information will be added to the run book.

A comprehensive communication plan can ensure that all required people in the
organization are ready for the production deployment. Since many of them can be
outside of the immediate data integration project team, it cannot be assumed
that everyone is always up to date on the production go-live planning and timing. For
example you may need to communicate with DBA's, IT infrastructure, web support
teams, and other system owners that may have assigned tasks and
monitoring activities during the first production run. The communication plan will
ensure proper and timely communication across the organization so there are no
surprises when the production run is initiated.

Prerequisites

7.1.4 Develop Punch List

Roles

Application Specialist (Secondary)

Data Integration Developer (Secondary)

Database Administrator (DBA) (Secondary)

Production Supervisor (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 454 of 1017


Project Sponsor (Review Only)

System Administrator (Secondary)

Technical Project Manager (Secondary)

Considerations

The communication plan should provide details about communication. It must include
steps to take if a specific person on the plan is unresponsive, escalation procedures
and emergency communication protocols (i.e., how would the entire core project team
communicate in a dire emergency). Since many go-live events occur over weekends, it
is also important to retain not only business contact information but also weekend
contact information such as cell phones or pagers in the event a key contact needs to
be reached on a non-business day.

Best Practices
None

Sample Deliverables

Data Migration Communication Plan

Last updated: 01-Feb-07 18:49

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 455 of 1017


Phase 7: Deploy
Subtask 7.1.6 Develop Run
Book

Description

The Run Book contains detailed descriptions of the tasks from the punch list that
was used for the first production run. It details the tasks more explicitly for the
individual mock-run and final go-live production run.

Typically the punch list will be created for the first trial cutover or mock-run and the run
book will be developed during the first and second trial cutovers and completed by the
start of the final production go-live.

Prerequisites

7.1.4 Develop Punch List

7.1.5 Develop Communication Plan

Roles

Application Specialist (Secondary)

Data Integration Developer (Secondary)

Database Administrator (DBA) (Secondary)

Production Supervisor (Primary)

Project Sponsor (Review Only)

System Administrator (Secondary)

Technical Project Manager (Secondary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 456 of 1017


Considerations

One of the biggest challenges for completing a run book (like completing an operations
manual) is to provide an adequate level of detail. It is important to find a balance
between providing too much information making it unwieldy and unlikely to be
used, versus providing too little detail that could jeopardize the successful execution of
the tasks.

For Data Migration projects this is even more imperative, since you normally have only
one critical go-live event. This is the one chance to have a successful production go-
live without negatively impacting operational systems that depend on the migrated
data. The run book is developed and leveraged on trial cutovers and should have all
the necessary information to ensure a successful migration. Go/No-Go Procedure
Information will also be included in the run-book. The run book for a data migration
project eliminates the need for an operations manual that is present for most other data
integration solutions.

Best Practices
None

Sample Deliverables

Data Migration Run Book

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 457 of 1017


Phase 7: Deploy
Task 7.2 Deploy Solution

Description

The challenges involved in successfully deploying a data integration solution involve


managing the migration from development through production, training end-users, and
providing clear and consistent documentation. These are all critical factors in
determining the success (or failure) of an implementation effort.

Before the deployment tasks are undertaken however, it is necessary to determine the
organization's level of preparedness for the deployment and thoroughly plan end-user
training materials and documentation. If all prerequisites are not satisfactorily
completed, it may be advisable to delay the migration, training, and delivery of finalized
documentation rather than hurrying through these tasks solely to meet a predetermined
target delivery date.

For data migration projects it is important to understand that some packaged


applications such as SAP have their own deployment strategies. The deployment
strategies for Informatica processes should take this into account and when applicable
match up with those deployment strategies.

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Primary)

Data Architect (Secondary)

Data Integration Developer (Primary)

Data Warehouse Administrator (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 458 of 1017


Database Administrator (DBA) (Primary)

Presentation Layer Developer (Primary)

Production Supervisor (Approve)

Quality Assurance Manager (Approve)

Repository Administrator (Primary)

System Administrator (Primary)

Technical Architect (Secondary)

Technical Project Manager (Approve)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 459 of 1017


Phase 7: Deploy
Subtask 7.2.1 Train Users

Description

Before training can begin, company management must work with the development
team to review the training curricula to ensure that it meets the needs of the various
application users. First, however, management and the development team need to
understand just who the users are and how they are likely to use the application.
Application users may include individuals who have reporting needs and need to
understand the presentation layer; operational users who need to review the content
being delivered by a data conversion system; administrative users managing the
sourcing and delivery of metadata across the enterprise; production operations
personnel responsible for day-to-day operations and maintenance; and more.

After the training curricula is planned and users are scheduled to attend classes
appropriate to their needs, a training environment must be prepared for the training
sessions. This involves ensuring that a “laboratory environment” is set-up properly for
multiple concurrent users, and that data is clean and available to that environment. If
the presentation layer is not ready or the data appears incomplete or inaccurate, users
may lose interest in the application and choose not to use it for their regular business
tasks. This lack of interest can result in an underutilized resource critical to business
success.

It is also important to prevent untrained users from accessing the system, otherwise the
support staff is likely to be overburdened and spend a significant amount of time
providing on-the-job training to uneducated users.

Prerequisites
None

Roles

Business Analyst (Primary)

Business Project Manager (Primary)

Data Integration Developer (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 460 of 1017


Data Warehouse Administrator (Secondary)

Presentation Layer Developer (Primary)

Technical Project Manager (Review Only)

Considerations

It is important to consider the many and varied roles of all application users when
planning user training. The user roles should be defined up-front to ensure that
everyone who needs training receives it. If the roles are not defined up-front, some key
users may not be properly trained, resulting in a less-than-optimal hand-off to the user
departments. For example, in addition to training obvious users such as the operational
staff, it may be important to consider users such as DBAs, data modelers, and
metadata managers, at least from a high-level perspective, and ensure that they
receive appropriate training.

The training curricula should educate users about the data content as well as the
effective use of the data integration system. While correct and effective use of the
system is important, a thorough understanding of the data content helps to ensure that
training moves along smoothly without interruption for ad-hoc questions about the
meaning or significance of the data itself. Additionally, it is important to remember that
no one training curriculum can address all needs of all users. The basic training class
should be geared toward the average user with follow-up classes scheduled for those
users needing training on the application's advanced features.

It is also wise to schedule follow-up training for data and tool issues that are likely to
arise after the deployment is complete and the end-users have had time to work with
the tools and data. This type of training can be held in informal "question and answer"
sessions rather than formal classes.

Finally, be sure that training objectives are clearly communicated between company
management and the development team to ensure complete satisfaction with the
training deliverable. If the training needs of the various user groups vary widely, it may
be necessary to obtain additional training staff or services from a vendor or consulting
firm.

Best Practices
None

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 461 of 1017


Sample Deliverables

Training Evaluation

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 462 of 1017


Phase 7: Deploy
Subtask 7.2.2 Migrate
Development to Production

Description

To successfully migrate PowerCenter or Data Analyzer from one environment to


another one (from development to production, for example), some tasks must be
completed. These tasks are dispatched within three phases:

● Pre-deployment phase
● Deployment phase
● Post-deployment phase

Each phase is detailed in the ‘Considerations’ section.

While there are multiple tasks to perform in the deployment process, the actual
migration phase consists of moving objects from one environment to another. A
migration can include the following objects:

● PowerCenter - mappings, sessions, workflows, scripts, parameters files,


stored procedures, etc.
● Data Analyzer - schemas, reports, dashboards, schedules, global variables.
● PowerExchange/CDC - datamaps and registrations.
● Data Quality - plans and dictionaries.

Prerequisites
None

Roles

Data Warehouse Administrator (Primary)

Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 463 of 1017


Production Supervisor (Approve)

Quality Assurance Manager (Approve)

Repository Administrator (Primary)

System Administrator (Primary)

Technical Project Manager (Approve)

Considerations

The tasks below should be completed before, during, and after the migration to ensure
a successful deployment. Failure to complete one or more of these tasks can result in
an incomplete or incorrect deployment.

Pre-deployment tasks

● Ensure all objects have been successfully migrated and tested in the Quality
Assurance environment.
● Ensure the Production environment is compliant with specifications and is
ready to receive the deployment.
● Obtain sign-off from the deployment team and project teams to deploy to the
Production environment.
● Obtain sign-off from the business units to migrate to the Production
environment.

Deployment tasks:

● Verify the consistency of the connection objects names across environments


to ensure that the connections are being made to the production sources/
targets. If not, manually change the connections for each incorrect session to
source and target the production environment.
● Determine the method of migration (i.e., folder copy or deployment group)
to use. If you are going to use the folder copy method, make sure the shared
folders are copied before the non-shared folders. If you are going to use the
deployment group method, make sure all the objects to be migrated are
checked-in and refresh the deployment group as it is done.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 464 of 1017


● Data Analyzer objects that reference new tables require that schemas be
migrated before the reports. Make sure the new tables are associated with the
proper data source and that the data connectors are plugged to the news
schemas.
● Synchronize the deployment window with the maintenance window to
minimize the impact on end-users. If the deployment window is longer that the
regular maintenance window, it may be necessary to coordinate with the
business unit to minimize the impact on the end-users.

Post-deployment tasks:

● Communicate with the management team members on all aspects of the


migration (i.e., problems encountered, solutions, tips and tricks, etc.).
● Finalize and deliver the documentation.
● Obtain final user and project sponsor acceptance.

Finally, when deployment is complete, develop a project close document to evaluate


the overall effectiveness of the project (i.e., successes, recommended improvements,
lessons learned, etc.).

Best Practices

Deployment Groups

Migration Procedures - PowerCenter

Using PowerCenter Labels

Migration Procedures - PowerExchange

Deploying Data Analyzer Objects

Sample Deliverables

Project Close Report

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 465 of 1017


Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 466 of 1017


Phase 7: Deploy
Subtask 7.2.3 Package
Documentation

Description

The final tasks in deploying the new application are:

● Gathering all of the various documents that have been created during the life
of the project;
● Updating and/or revising them as necessary, and
● Distributing them to the departments and individuals that will need them to use
or supervise use of the application. By this point, management should have
reviewed and approved all of the documentation.

Documentation types and content varies widely among projects, depending on the type
of engagement, expectations, scope of project, and so forth. Some typical deliverables
include all of those listed in the Sample Deliverables section.

Prerequisites
None

Roles

Business Analyst (Approve)

Business Project Manager (Primary)

Data Architect (Secondary)

Data Integration Developer (Primary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Secondary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 467 of 1017


Presentation Layer Developer (Primary)

Production Supervisor (Approve)

Technical Architect (Primary)

Technical Project Manager (Review Only)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 468 of 1017


Phase 8: Operate

8 Operate

● 8.1 Define Production


Support Procedures
❍ 8.1.1 Develop Operations Manual
● 8.2 Operate Solution
❍ 8.2.1 Execute First Production Run
❍ 8.2.2 Monitor Load Volume
❍ 8.2.3 Monitor Load Processes
❍ 8.2.4 Track Change Control Requests
❍ 8.2.5 Monitor Usage
❍ 8.2.6 Monitor Data Quality
● 8.3 Maintain and Upgrade Environment
❍ 8.3.1 Maintain Repository
❍ 8.3.2 Upgrade Software

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 469 of 1017


Phase 8: Operate

Description

The Operate Phase is the final


step in the development of a data integration solution. This phase is sometimes
referred to as production support.

During its day-to-day operations the system continually faces new challenges such as
increased data volumes, hardware and software upgrades, and network or other
physical constraints. The goal of this phase is to keep the system operating smoothly
by anticipating these challenges before they occur and planning for their resolution.

Planning is probably the most important task in the Operate Phase. Often, the project
team plans the system's development and deployment, but does not allow adequate
time to plan and execute the turnover to day-to-day operations. Many companies have
dedicated production support staff with both the necessary tools for system monitoring
and a standard escalation process. This team requires only the appropriate system
documentation and lead time to be ready to provide support. Thus, it is imperative for
the project team to acknowledge this support capability by providing ample time to
create, test, and turn over the deliverables discussed throughout this phase.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Integration Developer (Secondary)

Data Steward/Data Quality Steward (Primary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 470 of 1017


Presentation Layer Developer (Secondary)

Repository Administrator (Primary)

System Administrator (Primary)

System Operator (Primary)

Technical Project Manager (Review Only)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 471 of 1017


Phase 8: Operate
Task 8.1 Define Production
Support Procedures

Description

In this task, the project team produces an Operations Manual, which tells system
operators how to run the system on a day-to-day basis. The manual should include
information on how to restart failed processes and who to contact in the event of a
failure. In addition, this task should produce guidelines for performing system upgrades
and other necessary changes to the system throughout the project's lifetime. Note that
this task must occur prior to the system actually going live. The production support
procedures should be clear to system operators even before the system is in
production, because any production issues that are going to arise will probably do so
very shortly after the system goes live.

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Production Supervisor (Primary)

System Operator (Review Only)

Considerations

The watchword here is: Plan Ahead. Most organizations have well-established and
documented system support procedures in-place. The support procedures for the
solution should fit into these existing procedures, deviating only where absolutely
necessary - and then, only with the prior knowledge and approval of the Project
Manager and Production Supervisor. Any such deviations should be determined and
documented as early as possible in the development effort, preferably before the
system actually goes live. Be sure to thoroughly document specific procedures and
contact information for problem escalation, especially if the procedures or contacts

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 472 of 1017


differ from the existing problem escalation plan.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 473 of 1017


Phase 8: Operate
Subtask 8.1.1 Develop
Operations Manual

Description

After the system is deployed, the Operations Manual is likely to be the most frequently-
used document in the operations environment. The system operators - the individuals
who monitor the system on a day-to-day basis - use this manual to determine how to
run the various pieces of the implemented solution. In addition, the manual provides the
operators with error processing information, as well as reprocessing steps in the event
of a system failure.

The Operations Manual should contain a high-level overview of the system in order to
familiarize the operations staff with new concepts along with the specific details
necessary to successfully execute day-to-day operations. For data visualization, the
Operations Manual should contain high-level explanations of reports, dashboards, and
shared objects in order to familiarize the operations staff with those concepts.

For a data integration/migration/consolidation solution, the manual should provide


operators with the necessary information to perform the following tasks:

● Run workflows, worklets, tasks and any external code


● Recover and restart workflows
● Notify the appropriate second-tier support personnel in the event of a serious
system malfunction
● Record the appropriate monitoring data during and after workflow execution (i.
e., load times, data volumes, etc.)

For a data visualization or metadata reporting solution the manual should include the
details on the following:

● Run reports, schedules


● Rerun scheduled reports
● Source, target, database, web server and application server information
● Notify the appropriate second-tier support personnel in the event of a serious
system malfunction

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 474 of 1017


● Record the appropriate monitoring data (i.e., report run times, frequency, data
volumes, etc.)

Operations manuals for all projects should provide information for performing the
following tasks:

● Start servers
● Stop servers
● Notify the appropriate second-tier support personnel in the event of a serious
system malfunction
● Test the health of the reporting and/or data integration environment (i.e., check
DB connections to the repositories, source and target databases / files and
real time feeds, check CPU and memory usage on the PowerCenter and Data
Analyzer servers).

Prerequisites
None

Roles

Data Integration Developer (Secondary)

Production Supervisor (Primary)

System Operator (Review Only)

Considerations

A draft version of the Operations Manual can be started during the Build Phase as the
developers document the individual components. Documents such as mapping
specifications, report specifications, and unit and integration testing plans contain a
great deal of information that can be transferred into the Operations Manual. Bear in
mind that data quality processes are executed earlier, during the Design Phase,
although the Data Quality Developer and Data Integration Developer will be available
during the Build Phase to agree on any data quality measures (such as ongoing run-
time data quality process deployment) that need to be added to the Operations Manual.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 475 of 1017


The Operations Manual serves as the handbook for the production support team.
Therefore, it is imperative that it be accurate and kept up-to-date. For example, an
Operations Manual typically contains names and phone numbers for on-call support
personnel. Keeping this information consolidated in a central place in the document
makes it easier to maintain.

Restart and recovery procedures should be thoroughly tested and documented, and
the processing window should be calculated and published. Escalation procedures
should be thoroughly discussed and distributed so that members of the development
and operations staff are fully familiar with them. In addition, the manual should include
information on any manual procedures that may be required, along with step-by-step
instructions for implementing the procedures. This attention to detail helps to ensure a
smooth transition into the Operate Phase.

Although it is important, the Operations Manual is not meant to replace user manuals
and other support documentation. Rather, it is intended to provide system operators
with a consolidated source of documentation to help them support the system. The
Operations Manual also does not replace proper training on PowerCenter, Data
Analyzer, and supporting products.

Best Practices
None

Sample Deliverables

Operations Manual

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 476 of 1017


Phase 8: Operate
Task 8.2 Operate Solution

Description

After the data integration solution has been built and deployed, the job of running it
begins. For a data migration or consolidation solution, the system must be monitored to
ensure that data is being loaded into the database. A data visualization or metadata
reporting solution should be monitored to ensure that the system is accessible to the
end users. The goal of this task is to ensure that the necessary processes are in place
to facilitate the monitoring of and the reporting on the system's daily processes.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Steward/Data Quality Steward (Primary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Primary)

Presentation Layer Developer (Secondary)

Project Sponsor (Primary)

Repository Administrator (Review Only)

System Administrator (Primary)

System Operator (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 477 of 1017


Technical Project Manager (Review Only)

Considerations

None

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 478 of 1017


Phase 8: Operate
Subtask 8.2.1 Execute First
Production Run

Description

Once a Data Integration solution is fully developed, tested and signed off for production
it is time to execute the first run in the production environment. During the
implementation, the first run is a key to a successful deployment. While the first run is
often similar to the on-going load process, it can be distinctively different. There are
often specific one-time setup tasks that need to be executed on the first run that will not
be part of the regular daily data integration process.

In most cases the first production run is a high-profile set of activities that must be
executed, documented, and improved for all future production runs. This run should
leverage a Punch List and should execute a set of tested workflows or scripts
(not manual steps such as executing a specific SQL statement for set-up).

It is important that the first run is executed successfully with limited manual interactions.
Any manual steps should be closely monitored, controlled, documented and
communicated.

This first run should be executed following the Punch List and should be revisited upon
completion of the execution.

Prerequisites

6.3.2 Execute Complete System Test

7.2.2 Migrate Development to Production

Roles

Database Administrator (DBA) (Primary)

Production Supervisor (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 479 of 1017


System Administrator (Primary)

System Operator (Primary)

Technical Project Manager (Review Only)

Considerations

For some projects (such as a data migration effort) the first production run is the
production system. It will not go on beyond the first production run since a data
migration by its nature requires a single movement of the production data. Further, the
set of tasks that make up the production run may not be executed again. Any future
runs will be a part of the execution that addresses a specific data problem, not the
entire batch.

For data warehouses, often the first production run may include loading historical data
as well as initial loads of code tables and dimension tables. The load process may
execute much longer than a typical on-going load due to the extra amount of data and
the different criteria it is run against to pick up the historical data. There may be extra
data validation and verification at the end of the first production run to ensure that the
system is properly initialized and ready for on-going loads. It is important to
appropriately plan and execute the first load properly as the subsequent periodic
refreshes of the data warehouse (daily, hourly, real time) depend on the setup and
success of the first production run.

Best Practices
None

Sample Deliverables

Data Migration Run Book

Operations Manual

Punch List

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 480 of 1017


Phase 8: Operate
Subtask 8.2.2 Monitor Load
Volume

Description

Increasing data volume is a challenge throughout the life of a data integration solution.
As the data migration or consolidation system matures and new data sources are
introduced, the amount of data processed and loaded into the database continues to
grow. Similarly, as a data visualization or metadata management system matures, the
amount of data processed and presented increases. One of the operations team's
greatest tasks is to monitor the data volume processed by the system to determine any
trends that are developing.

If generated correctly, the data volume estimates used by the Technical Architect and
the development team in building the architecture, should ensure that it is capable of
growing to meet ever-changing business requirements. By continuously monitoring
volumes, however, the development and operations teams can act proactively as data
volumes increase. Monitoring affords team members the time necessary to determine
how best to accommodate the increased volumes.

Prerequisites
None

Roles

Production Supervisor (Secondary)

System Operator (Primary)

Considerations

Installing PowerCenter Reporting using Data Analyzer with Repository and


Administrative reports can help monitor load volumes. The Session Run Details report
can be configured to provide the following:

● Sucessful rows sourced

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 481 of 1017


● Sucessful rows written
● Failed rows sourced
● Failed rows written
● Session duration

The Session Run Details report can also be configured to display data over ranges of
time for trending. This information provides the project team with both a measure of the
increased volume over time and an understanding of the increased volume's impact on
the data load window.

Dashboards and alerts can be set to monitor loads on an on-going basis, alerting data
integration administrators if load times exceed specified threshholds. By customizing
the standard reports, Data Integration support staff can create any variety of monitoring
levels -- from individual projects to full daily load processing statistics -- across all
projects.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 482 of 1017


Phase 8: Operate
Subtask 8.2.3 Monitor Load
Processes

Description

After the data integration solution is deployed, the system operators begin the task of
monitoring the daily processes. For data migration and consolidation solutions, this
includes monitoring the processes that load the database. For presentation layers and
metadata management reporting solutions, this includes monitoring the processes that
create the end-user reports. This monitoring is necessary to ensure that the system is
operating at peak efficiency. It is important to ensure that any processes that stop, are
delayed, or simply fail to run are noticed and appropriate steps are taken.

It is important to recognize in data migration and consolidation solutions that the


processing time may increase as the system matures, new data sources are used, and
existing sources mature. For data visualization and metadata management reporting
solutions, it is important to note that processing time can increase as the system
matures, more users access the system and reports are run more frequently. If the
processes are not monitored, they may cause problems as the daily load processing
begins to overlap the system's user availability. Therefore, the system operator needs
to monitor and report on processing times as well as data volumes.

Prerequisites
None

Roles

Presentation Layer Developer (Secondary)

System Operator (Primary)

Considerations

Data Analyzer with Repository and Administration Reports installed can provide
information about session run details, average loading times, and server load trends by
day. Administrative and operational dashboards can display all vital metrics needing to
be monitored. They can also provide the project management team with a high-level

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 483 of 1017


understanding of the health of the analytic support system.

Large installations may already have monitoring software in place that can be adapted
to monitor the load processes of the analytic solution. This software typically includes
both visual monitors for the client desktop of the System Operator as well as electronic
alerts than can be programmed to contact various project team members.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 484 of 1017


Phase 8: Operate
Subtask 8.2.4 Track
Change Control Requests

Description

The process of tracking change control requests is integral to the Operate Phase. It is
here that any production issues are documented and resolved. The change control
process allows the project team to prioritize the problems and create schedules for their
resolution and eventual promotion into the production environment.

Prerequisites
None

Roles

Business Project Manager (Primary)

Project Sponsor (Primary)

Considerations

Ideally, a change control process was implemented during the Architect Phase,
enabling the developers to follow a well-established process during the Operate
Phase. Many companies rely on a Configuration Control Board to prioritize and
approve work for the various maintenance releases.

The Change Control Procedure document, created in conjunction with the Change
Control Procedures in the Architect Phase should describe precisely how the project
team is going to identify and resolve problems that come to light during system
development or operation.

Most companies use a Change Request Form to kick-off the Change Control
procedure. These forms should include the following:

● Identify the individual or department requesting the change.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 485 of 1017


● A clear description of the change requested.
● Define the problem or issue that the requested change addresses.
● The priority level of the change requested.
● The expected release date.
● An estimation of the development time.
● The impact of the change requested to project(s) in development, if any.
● Include a Resolutions section to be filled in after the Change Request is
resolved, specifying whether the change was implemented, in what release,
and by whom.

This type of change control documentation can be invaluable if questions subsequently


arise as to why a system operates the way that it does, or why it doesn't function like an
earlier version.

Best Practices
None

Sample Deliverables

Change Request Form

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 486 of 1017


Phase 8: Operate
Subtask 8.2.5 Monitor
Usage

Description

One of the most important aspects of the Operate Phase is monitoring how and when
the organization's end users use the data integration solution. This subtask enables the
project team to gauge what information is the most useful, how often it is retrieved, and
what type of user generally requests it. All of this information can then be used to
gauge the system's return on investment and to plan future enhancements.

Monitoring the use of the presentation layer during User Acceptance Testing can
indicate bottlenecks. When the project is complete, Operations continues to monitor the
tasks to maintain system performance. The monitoring results can be used to plan for
changes in hardware and/or network facilities to support increased requests to the
presentation layer. For example, new requirements may be determined by the number
of users requesting a particular report or by requests for more or different information in
the report. These requirements may trigger changes in hardware capabilities and/or
network bandwidth.

Prerequisites
None

Roles

Business Project Manager (Primary)

Data Warehouse Administrator (Secondary)

Database Administrator (DBA) (Primary)

Production Supervisor (Primary)

Project Sponsor (Review Only)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 487 of 1017


Repository Administrator (Review Only)

System Administrator (Approve)

System Operator (Review Only)

Considerations

Most business organizations have tools in place to monitor the use of their production
systems. Some end-user reporting tools have built-in reports for such purposes. The
project team should review the available tools, as well as software that may be bundled
with the RDBMS, and determine which tools best suit the project's monitoring needs.
Informatica provides tools and sources to metadata that meet the need for monitoring
information from the presentation layer, as well as the metadata on processes used to
provide the presentation layer with data. This information can be extracted using
Informatica tools to provide a complete view of information presentation usage.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 488 of 1017


Phase 8: Operate
Subtask 8.2.6 Monitor Data
Quality

Description

This subtask is concerned with data quality processes that may have been scoped into
the project for late-project or post-project use. Such processes are an optional
deliverable for most projects. However, there is a strong argument for building into the
project plan data quality initiatives that will outlast the project. This argument is based
upon the concept that the decision to incorporate ongoing monitoring should be
considered a key deliverable, as it provides a means to monitor the existing data to
ensure that previously identified data quality issues do not reoccur. For new data
entering the system, monitoring provides a means to ensure that any new feeds do not
compromise the integrity of the existing data. Moreover, the processes created for the
Data Quality Audit task in the Analyze Phase may still be suitable for application to the
data in the Operate Phase, or may be suitable with a reasonable amount of tuning.

There are three types of data quality process relevant in this context:

● Processes that can be scheduled to monitor data quality on an ongoing basis


● Processes that can address or repair any data quality issues discovered
● Processes that can run at the point of data entry to prevent bad data from
entering the system

This subtask is concerned with agreeing to a strategy to use any or all such processes
to validate the continuing quality of the business’ data and to safeguard against lapses
in data quality in the future.

Prerequisites
None

Roles

Data Steward/Data Quality Steward (Primary)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 489 of 1017


Production Supervisor (Secondary)

Considerations

Ongoing data quality initiatives bring the data quality process full-circle. This subtask is
the logical conclusion to a process that began with the performance of a Data Quality
Audit in the Analyze Phase and the creation of data quality processes (called plans in
Informatica Data Quality terminology) in the Design Phase.

The plans created during and after the Operate Phase are likely to be runtime or real-
time plans. A runtime plan is one that can be scheduled for automated, regular
execution (e.g., nightly or weekly). A real-time plan is one that can accept a live data
feed, for example, from a third-party application, and write output data back to a live
application.

Real-time plans are useful in data entry scenarios; they can be used to capture data
problems at the point of keyboard entry and thus before they are saved to the data
system. The real-time plan can be used to check data entries, pass them if accurate,
cleanse them of error, or reject them as unusable.

Runtime plans can be used to monitor the data stored to the system; these plans can
be run during periods of relative inactivity (e.g., weekends). For example, the Data
Quality Developer may design a plan to identify duplicate records in the system, and
the Developer or the system administrator can schedule the plan to run overnight. Any
duplication issues found in the system can be addressed manually or by other data
quality plans.

The Data Quality Developer must discuss the importance of ongoing data quality
management with the business early in the project, so that the business can decide
what data quality management steps to take within the project or outside of it.

The Data Quality Developer must also consider the impact that ongoing data quality
initiatives are likely to have on the business systems. Should the data quality plans be
deployed to several locations or centralized? Will the reference data be updated at
regular intervals and by whom? Can plan resource files be moved easily across the
enterprise? Once the project resources are unwound, these matters require a
committed strategy from the business. However, the results — clean, complete,
compliant data — are well worth it.

Best Practices

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 490 of 1017


None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 491 of 1017


Phase 8: Operate
Task 8.3 Maintain and
Upgrade Environment

Description

The goal in this task is to develop and implement an upgrade procedure to facilitate
upgrading the hardware, software, and/or network hardware that supports the overall
analytic solution. This plan should enable both the development and operations staff to
plan for and execute system upgrades in an efficient, timely manner, with as little
impact on the system's end users as possible.The deployed system incorporates
multiple components, many of which are likely to undergo upgrades during the system's
lifetime. Ideally, upgrading system components should be treated as a system change
and as such, use many of the techniques discussed in 8.2.4 Track Change Control
Requests. After these changes are prioritized and authorized by the Project Manager,
an upgrade plan should be developed and executed. This plan should include the tasks
necessary to perform the upgrades as well as the tasks necessary to update system
documentation and the Operations Manual, when appropriate.

Prerequisites
None

Roles

Database Administrator (DBA) (Primary)

Repository Administrator (Primary)

System Administrator (Secondary)

Considerations

Once the Build Phase has been completed, the development and operations staff
should begin determining how upgrades should be carried out. The team should
consider all aspects of the systems' architecture including any software and hardware
being used. Special attention should be paid to software release schedules, hardware
limitations, network limitations, and vendor release support schedules. This information

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 492 of 1017


will give the team an idea of how often and when various upgrades are likely to be
required. When combined with knowledge of the data load windows, this will allow the
operations team to schedule upgrades without adversely affecting the end users.
Upgrading the Informatica software has some special implications. Many times, the
software upgrade requires a repository upgrade as well. Thus, the operations team
should factor in the time required to backup the repository, along with the time to
perform the upgrade itself. In addition, the development staff should be involved in
order to ensure that all current sessions are running as designed after the upgrade
occurs.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 493 of 1017


Phase 8: Operate
Subtask 8.3.1 Maintain
Repository

Description

A key operational aspect of maintaining PowerCenter repositories involves creating and


implementing backup policies. These backups become invaluable if some catastrophic
event occurs that requires the repository to be restored. Another key operational aspect
is monitoring the size and growth of these repository databases, since daily use of
these applications adds metadata to the repositories.

The Administration Console manages Repository Services and repository content


including backup and restoration. The following repository-related functions can be
performed through the Administration Console:

● Enable or disable a Repository Service or service process.


● Alter the operating mode of a Repository Service.
● Create and delete repository content.
● Backup, copy, restore, or delete a repository.
● Promote a local repository to a global repository.
● Register and unregister a local repository.
● Manage user connections and locks.
● Send repository notification messages.
● Manage repository plug-ins.
● Upgrade a repository and upgrade a Repository Service to a Repository
Service.

Additional information about upgrades is available in the "Upgrading PowerCenter"


chapter of the PowerCenter Installation and Configuration Guide.

Prerequisites
None

Roles

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 494 of 1017


Database Administrator (DBA) (Secondary)

Repository Administrator (Primary)

System Administrator (Secondary)

Considerations

Enabling and Disabling the Repository Service

A service process starts on a designated node when a Repository Service is


enabled. PowerCenter's High Availability (HA) feature enables a service to fail-over to
another node if the original node become unavailable. Administrative duties can be
performed through the Administration Console only when the Repository Service is
enabled.

Exclusive Mode

The Repository Service executes in normal or exclusive mode. Running the Repository
Service in exclusive mode allows only one user to access the repository through the
Administrative Console or pmrep command line program.

It is advisable to set the Repository Service mode to exclusive when performing


administrative tasks that require configuration updates involving deleting repository
content or enabling version control, repository promotion, plug-in registration, or
repository upgrades.

Running in exclusive mode requires full privileges and permissions on a Repository


Service. Precautions to take before switching to exclusive mode include user intent
notification and disconnect verification. The Repository Service must be stopped and
restarted to complete the mode switch.

Repository Backup

Although PowerCenter database tables may be included in Database Administration


backup procedures, PowerCenter repository backup procedures and schedules are
established to prevent data loss due to hardware, software, or user mishaps.

The Repository Service provides backup processing for repositories through the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 495 of 1017


Administrative Console or the pmrep command line program. The Repository Service
backup function saves repository objects, connection information, and code page
information in a file stored on the server in the backup location.

PowerCenter backup scheduling should account for repository change


frequency. Because development repositories typically change more frequently than
production repositories, it may be desirable to backup the development repository
nightly during heavy development efforts. Production repositories, on the other
hand, may only need backup processing after development promotions are
registered. Preserve the repository dates as part of the backup file name and, as new
repositories are added, delete the older ones.

TIP
A simple approach to automating PowerCenter repository backups is to use the
pmrep command line program. Commands can be packaged and scheduled so
that backups occur on a desired schedule without manual intervention. The
backup file name should minimally include repository name and backup date
(yyyymmdd).

A repository backup file is invaluable for reference when, as occasionally happens,


questions arise as to the integrity of the repository or users encounter problems using
it. A backup file enables technical support staff to validate repository integrity to, for
example, eliminate the repository as a source of user problems. In addition, if the
development or production repository is corrupted, the backup repository can be used
to recover quickly.

TIP
Keep in mind that you cannot restore a single folder or mapping from a
repository backup. If, for example, a single important mapping is deleted by
accident, you need to obtain a temporary database space from the DBA in order
to restore the backup to a temporary repository DB. With the PowerCenter client
tools, copy the lost metadata, and then remove the temporary repository from
the database and the cache.

If the developers need this service often, it may be prudent to keep the
temporary database around all the time and copy over the development
repository to the backup repository on a daily basis in addition to backing up to a
file. Only the DBA should have access to the backup repository and requests
should be made through him/her.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 496 of 1017


Repository Performance

Repositories may grow in size due to the execution of workflows, especially in large
projects. As the repository grows, response may become slower. Consider these
techniques to maintain a repository for better performance:

● Delete Old Session/Workflow Logs Information. Write a simple SQL


script to delete old log information. Assuming that repository backups are
taken on a consistent basis, you can always get old log information from the
repository backup, if necessary.
● Perform Defragmentation. Much like any other database, repository
databases should go undergo periodic "housecleaning" through statistics and
defragmentation. Work with the DBAs to schedule this as a regular job.

Audit Trail

The SecurityAuditTrail configuration option in the Repository Service properties in the


Administrative Console allows tracking changes to repository users, groups, privileges,
and permissions. Enabling the audit trail causes the Repository Service to record
security changes to the Repository Service log. Security audit changes logged include
owner, owner's group or folder permissions, passwords changes of another user, user
maintenance, group maintenance, global object permissions, and privileges.

Best Practices

Disaster Recovery Planning with PowerCenter HA Option

Sample Deliverables
None

Last updated: 04-Dec-07 18:21

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 497 of 1017


Phase 8: Operate
Subtask 8.3.2 Upgrade Software

Description

Upgrading the application software of a data integration solution to a new release is a continuous operations task as
new releases are offered periodically by every software vendor. New software releases offer expanded functionality,
new capabilities, and fixes to existing functionality that can benefit the data integration environment and future
integration work. However, an upgrade can be a disruptive event since project work may halt while the upgrade
process is in progress.

Given that data integration environments often contain a host of different applications including Informatica
software, database systems, operating systems, EAI tools, BI tools, and other related technologies – an upgrade in
any one of these technologies may require an upgrade in any number of other software programs for the full system
to function properly. System architects and administrators must continually evaluate the new software offerings
across the various products in their data integration environment and balance the desire to upgrade with the impact
of an upgrade.

Software upgrades require a continuous assessment and planning process. A regular schedule should be defined
where new releases are evaluated on functionality and need in the environment. Once approved, upgrades must be
coordinated with on-going development work and on-going production data integration. Appropriate planning and
coordination of software upgrades allow a data integration environment to stay current on its technology stack with
minimal disruptions to production data integration efforts and development projects.

Prerequisites
None

Roles

Database Administrator (DBA) (Secondary)

Repository Administrator (Primary)

System Administrator (Secondary)

Considerations

When faced with a new software release, the first consideration is to decide whether the upgrade is appropriate for
the data integration environment. The pro’s and con’s of every upgrade decision typically include the following:

Pro Con
New functionality and features Disruptive to development environment
Bug fixes and refinements of existing functionalityDisruptive to production environment
Often provides enhanced performance May require new training and adversely affect
productivity
Support for older releases of software is dropped, May require other pieces of software to be
forcing an upgrade to maintain support upgraded to function properly
May be required to support newer releases of
other software in the environment

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 498 of 1017


The upgrade decision can be to:

● Upgrade to the latest software release immediately.


● Upgrade at some time in the future.
● Do not upgrade to this software version at all.

Architects sometimes decide to forgo a particular software version and skip ahead to the future releases if the
current release does not provide enough benefit to warrant the disruption to the environment. It is not uncommon
for data integration teams to skip minor releases (and sometimes even major releases) if they aren’t appropriate for
their environment or when the upgrade effort outweighs the benefits.

Whether you are in a production environment or still in development mode, an upgrade requires careful planning to
ensure a successful transition and minimal disruption. The following issues need to be factored into the overall
upgrade plan:

● Training - New releases of software often include new features and functionality that are likely to require
some level of training for administrators and developers. Proper planning of the necessary training can
ensure that employees are trained ahead of the upgrade so that productivity does not suffer once the new
software is in place. Because it is impossible to properly estimate and plan the upgrade effort if you do not
have knowledge of the new features and potential environment changes, best practice dictates training a
core set of architects and system administrators early in the upgrade process so they can assist in the
upgrade planning process.
● Environment Assessment - A future release of software may range from minimal architectural changes to
major changes in the overall data integration architecture. Investigation and strategy around potential
architecture changes should occur early. In PowerCenter for example, as the architecture has moved to a
Service-Oriented-Architecure with high availability and failover, the underlying physical setup and location
of software components has changed from release to release. Planning for these architecture changes
allows users to take full advantage of the new features when the software upgrade is deployed. Often
these changes provide an opportunity to redesign and improve the existing architecture in coordination of
the software upgrade.
● Testing - Often more than 60 percent of the total upgrade time is devoted to testing the data integration
environment with the new software release. Ensuring that data continues to flow correctly, software
versions are compatible, and new features do not cause unexpected results requires detailed
testing. Developing a well thought-out test plan is crucial to a successful upgrade.
● New Features - A new software release likely includes new and expanded features that may create a need
to alter the current data integration processes. During the upgrade process, existing processes may be
altered to incorporate and implement the new features. Time is required to make and test these changes
as well. Reviewing the new features and assessing the impact on the upgrade process is a key pre-
planning step.
● Sandbox Upgrade - In environments with production systems, it is advisable to copy the production
environment to a ‘sandbox’ instance. The ‘sandbox’ environment should be as close to an exact copy of
production as possible, including production data. A software upgrade is then performed on the ‘sandbox
instance’ and data integration processes run on both the current production and the sandbox instance for a
period of time. In this way, results can be compared over time to ensure that no unforeseen differences
occur in the new software version. If differences do occur, they can be investigated, resolved,
and accounted for in the final upgrade plan.

Once a comprehensive plan for the upgrade is in place, the time comes to perform the actual upgrade on the
development, test, and production environments. The Installation Guides for each of the Informatica products and
online help provide instructions on upgrading and the step-by-step process for applying the new version of the
software. However, there are a few important steps to emphasize in the upgrade process:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 499 of 1017


● Make a copy of the current database instance housing the repository prior to any upgrade.
● In addition to the copy, ALWAYS make multiple backups of the current version of the repository before
attempting the upgrade. Upgrades have been known to fail in production environments, making the
partially upgraded repositories unusable. The only recourse at that point is to restore from the backup. The
backups created using the Repository Manager are reliable and can be used to successfully restore the
original repository. Restoring from backups may be slower than restoring from the copy, but provide a fail-
safe insurance policy.
● Always remove all repository locks through Repository Manager before attempting an upgrade.
● Carefully monitor the upgraded systems for a period of time after the upgrade to ensure the success of the
upgrade.

A well-planned upgrade process is key to ensuring success during the transition from the current version to a new
version, with minimal disruption to the development and production environments. A smooth upgrade process
enables data integration teams to take advantage of the latest technologies and advances in data integration.

Best Practices
None

Sample Deliverables
None

Last updated: 01-Feb-07 18:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 500 of 1017


Best Practices

● Configuration Management and Security


❍ Data Analyzer Security
❍ Database Sizing
❍ Deployment Groups
❍ Migration Procedures - PowerCenter
❍ Migration Procedures - PowerExchange
❍ Running Sessions in Recovery Mode
❍ Using PowerCenter Labels
● Data Quality and Profiling
❍ Build Data Audit/Balancing Processes
❍ Data Cleansing
❍ Data Profiling
❍ Data Quality Mapping Rules
❍ Effective Data Matching Techniques
❍ Effective Data Standardizing Techniques
❍ Integrating Data Quality Plans with PowerCenter
❍ Managing Internal and External Reference Data
❍ Real-Time Matching Using PowerCenter
❍ Testing Data Quality Plans
❍ Tuning Data Quality Plans
❍ Using Data Explorer for Data Discovery and Analysis
❍ Working with Pre-Built Plans in Data Cleanse and Match
● Development Techniques
❍ Designing Data Integration Architectures
❍ Development FAQs
❍ Event Based Scheduling

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 501 of 1017


❍ Key Management in Data Warehousing Solutions
❍ Mapping Auto-Generation
❍ Mapping Design
❍ Mapping Templates
❍ Naming Conventions
❍ Naming Conventions - Data Quality
❍ Performing Incremental Loads
❍ Real-Time Integration with PowerCenter
❍ Session and Data Partitioning
❍ Using Parameters, Variables and Parameter Files
● Error Handling
❍ Error Handling Process
❍ Error Handling Strategies - Data Warehousing
❍ Error Handling Strategies - General
❍ Error Handling Techniques - PowerCenter Mappings
❍ Error Handling Techniques - PowerCenter Workflows and Data
Analyzer
● Metadata and Object Management
❍ Creating Inventories of Reusable Objects & Mappings
❍ Metadata Reporting and Sharing
❍ Repository Tables & Metadata Management
❍ Using Metadata Extensions
● Operations
❍ Daily Operations
❍ Third Party Scheduler
● Performance and Tuning
❍ Determining Bottlenecks
❍ Performance Tuning Databases (Oracle)
❍ Performance Tuning Databases (SQL Server)
❍ Performance Tuning Databases (Teradata)
❍ Performance Tuning in a Real-Time Environment

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 502 of 1017


❍ Performance Tuning UNIX Systems
❍ Performance Tuning Windows 2000/2003 Systems
❍ Recommended Performance Tuning Procedures
❍ Tuning and Configuring Data Analyzer and Data Analyzer Reports
❍ Tuning Mappings for Better Performance
❍ Tuning Sessions for Better Performance
❍ Tuning SQL Overrides and Environment for Better Performance
● PowerCenter Configuration
❍ Advanced Client Configuration Options
❍ Advanced Server Configuration Options
❍ Organizing and Maintaining Parameter Files & Variables
❍ Platform Sizing
● PowerExchange Configuration
❍ PowerExchange for Oracle CDC
❍ PowerExchange for SQL Server CDC
❍ PowerExchange Installation (for Mainframe)
● Project Management
❍ Assessing the Business Case
❍ Defining and Prioritizing Requirements
❍ Developing a Work Breakdown Structure (WBS)
❍ Developing and Maintaining the Project Plan
❍ Developing the Business Case
❍ Managing the Project Lifecycle
❍ Using Interviews to Determine Corporate Data Integration
Requirements

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 503 of 1017


Data Analyzer Security

Challenge

Using Data Analyzer's sophisticated security architecture to establish a robust security system to
safeguard valuable business information against a range of technologies and security models. Ensuring
that Data Analyzer security provides appropriate mechanisms to support and augment the security
infrastructure of a Business Intelligence environment at every level.

Description

Four main architectural layers must be completely secure: user layer, transmission layer, application
layer and data layer.

Users must be authenticated and authorized to access data. Data Analyzer integrates seamlessly with the
following LDAP-compliant directory servers:

SunOne/iPlanet
4.1
Directory Server

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 504 of 1017


Sun Java System
5.2
Directory Server

Novell eDirectory Server 8.7

IBM SecureWay
3.2
Directory

IBM SecureWay
4.1
Directory

IBM Tivoli Directory


5.2
Server

Microsoft Active Directory 2000

Microsoft Active Directory 2003

In addition to the directory server, Data Analyzer supports Netegrity SiteMinder for centralizing
authentication and access control for the various web applications in the organization.

Transmission Layer

The data transmission must be secure and hacker-proof. Data Analyzer supports the standard security
protocol Secure Sockets Layer (SSL) to provide a secure environment.

Application Layer

Only appropriate application functionality should be provided to users with associated privileges. Data
Analyzer provides three basic types of application-level security:

● Report, Folder and Dashboard Security. Restricts access for users or groups to specific
reports, folders, and/or dashboards.
● Column-level Security. Restricts users and groups to particular metric and attribute columns.
● Row-level Security. Restricts users to specific attribute values within an attribute column of a
table.

Components for Managing Application Layer Security

Data Analyzer users can perform a variety of tasks based on the privileges that you grant them. Data
Analyzer provides the following components for managing application layer security:

● Roles. A role can consist of one or more privileges. You can use system roles or create custom
roles. You can grant roles to groups and/or individual users. When you edit a custom role, all

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 505 of 1017


groups and users with the role automatically inherit the change.
● Groups. A group can consist of users and/or groups. You can assign one or more roles to a
group. Groups are created to organize logical sets of users and roles. After you create groups,
you can assign users to the groups. You can also assign groups to other groups to organize
privileges for related users. When you edit a group, all users and groups within the edited group
inherit the change.
● Users. A user has a user name and password. Each person accessing Data Analyzer must have
a unique user name. To set the tasks a user can perform, you can assign roles to the user or
assign the user to a group with predefined roles.

Types of Roles

● System roles - Data Analyzer provides a set of roles when the repository is created. Each role
has sets of privileges assigned to it.
● Custom roles - The end user can create and assign privileges to these roles.

Managing Groups

Groups allow you to classify users according to a particular function. You may organize users into groups
based on their departments or management level. When you assign roles to a group, you grant the same
privileges to all members of the group. When you change the roles assigned to a group, all users in the
group inherit the changes. If a user belongs to more than one group, the user has the privileges from all
groups. To organize related users into related groups, you can create group hierarchies. With hierarchical
groups, each subgroup automatically receives the roles assigned to the group it belongs to. When you
edit a group, all subgroups contained within it inherit the changes.

For example, you may create a Lead group and assign it the Advanced Consumer role. Within the Lead
group, you create a Manager group with a custom role Manage Data Analyzer. Because the Manager
group is a subgroup of the Lead group, it has both the Manage Data Analyzer and Advanced Consumer
role privileges.

Belonging to multiple groups has an inclusive effect. For example, if group 1 has access to something but
group 2 is excluded from that object, a user belonging to both groups 1 and 2 will have access to the
object.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 506 of 1017


Preventing Data Analyzer from Updating Group Information

If you use Windows Domain or LDAP authentication, you typically modify the users or groups in Data
Analyzer. However, some organizations keep only user accounts in the Windows Domain or LDAP
directory service, but set up groups in Data Analyzer to organize the Data Analyzer users. Data Analyzer
provides a way for you to keep user accounts in the authentication server and still keep the groups in
Data Analyzer.

Ordinarily, when Data Analyzer synchronizes the repository with the Windows Domain or LDAP directory
service, it updates the users and groups in the repository and deletes users and groups that are not found
in the Windows Domain or LDAP directory service.

To prevent Data Analyzer from deleting or updating groups in the repository, you can set a property in the
web.xml file so that Data Analyzer updates only user accounts, not groups. You can then create and
manage groups in Data Analyzer for users in the Windows Domain or LDAP directory service.

The web.xml file is in stored in the Data Analyzer EAR file. To access the files in the Data Analyzer EAR
file, use the EAR Repackager utility provided with Data Analyzer.

Note: Be sure to back-up the web.xml file before you modify it.

To prevent Data Analyzer from updating group information in the repository:

1. In the directory where you extracted the Data Analyzer EAR file, locate the web.xml file in the
following directory:

/custom/properties

2. Open the web.xml file with a text editor and locate the line containing the following property:

enableGroupSynchronization

The enableGroupSynchronization property determines whether Data Analyzer updates the groups
in the repository.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 507 of 1017


3. To prevent Data Analyzer from updating group information in the Data Analyzer repository,
change the value of the enableGroupSynchronization property to false:

<init-param>

<param-name>
InfSchedulerStartup.com.informatica.ias.
scheduler.enableGroupSynchronization
</param-name>

<param-value>false</param-value>

</init-param>

When the value of enableGroupSynchronization property is false, Data Analyzer does not
synchronize the groups in the repository with the groups in the Windows Domain or LDAP
directory service.

4. Save the web.xml file and add it back to the Data Analyzer EAR file.

5. Restart Data Analyzer.

When the enableGroupSynchronization property in the web.xml file is set to false, Data Analyzer
updates only the user accounts in Data Analyzer the next time it synchronizes with the Windows
Domain or LDAP authentication server. You must create and manage groups, and assign users to
groups in Data Analyzer.

Managing Users

Each user must have a unique user name to access Data Analyzer. To perform Data Analyzer tasks, a
user must have the appropriate privileges. You can assign privileges to a user with roles or groups.

Data Analyzer creates a System Administrator user account when you create the repository. The default
user name for the System Administrator user account is admin. The system daemon, ias_scheduler/
padaemon, runs the updates for all time-based schedules. System daemons must have a unique user
name and password in order to perform Data Analyzer system functions and tasks. You can change the
password for a system daemon, but you cannot change the system daemon user name via the GUI. Data
Analyzer permanently assigns the daemon role to system daemons. You cannot assign new roles to
system daemons or assign them to groups.

To change the password for a system daemon, complete the following steps:

1. Change the password in the Administration tab in Data Analyzer


2. Change the password in the web.xml file in the Data Analyzer folder.
3. Restart Data Analyzer.

Access LDAP Directory Contacts

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 508 of 1017


To access contacts in the LDAP directory service, you can add the LDAP server on the LDAP Settings
page. After you set up the connection to the LDAP directory service, users can email reports and shared
documents to LDAP directory contacts.

When you add an LDAP server, you must provide a value for the BaseDN (distinguished name) property.
In the BaseDN property, enter the Base DN entries for your LDAP directory. The Base distinguished
name entries define the type of information that is stored in the LDAP directory. If you do not know the
value for BaseDN, contact your LDAP system administrator.

Customizing User Access

You can customize Data Analyzer user access with the following security options:

● Access permissions. Restrict user and/or group access to folders, reports, dashboards,
attributes, metrics, template dimensions, or schedules. Use access permissions to restrict access
to a particular folder or object in the repository.
● Data restrictions. Restrict user and/or group access to information in fact and dimension tables
and operational schemas. Use data restrictions to prevent certain users or groups from
accessing specific values when they create reports.
● Password restrictions. Restrict users from changing their passwords. Use password restrictions
when you do not want users to alter their passwords.

When you create an object in the repository, every user has default read and write permissions for that
object. By customizing access permissions for an object, you determine which users and/or groups can
read, write, delete, or change access permissions for that object.

When you set data restrictions, you determine which users and groups can view particular attribute
values. If a user with a data restriction runs a report, Data Analyzer does not display the restricted data to
that user.

Types of Access Permissions

Access permissions determine the tasks that you can perform for a specific repository object. When you
set access permissions, you determine which users and groups have access to the folders and repository
objects. You can assign the following types of access permissions to repository objects:

● Read. Allows you to view a folder or object.


● Write. Allows you to edit an object. Also allows you to create and edit folders and objects within a
folder.
● Delete. Allows you to delete a folder or an object from the repository.
● Change permission. Allows you to change the access permissions on a folder or object.

By default, Data Analyzer grants read and write access permissions to every user in the repository. You
can use the General Permissions area to modify default access permissions for an object, or turn off
default access permissions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 509 of 1017


Data Restrictions

You can restrict access to data based on the values of related attributes. Data restrictions are set to keep
sensitive data from appearing in reports. For example, you may want to restrict data related to the
performance of a new store from outside vendors. You can set a data restriction that excludes the store
ID from their reports.

You can set data restrictions using one of the following methods:

● Set data restrictions by object. Restrict access to attribute values in a fact table, operational
schema, real-time connector, and real-time message stream. You can apply the data restriction
to users and groups in the repository. Use this method to apply the same data restrictions to
more than one user or group.
● Set data restrictions for one user at a time. Edit a user account or group to restrict user or
group access to specified data. You can set one or more data restrictions for each user or group.
Use this method to set custom data restrictions for different users or groups

Types of Data Restrictions

You can set two kinds of data restrictions:

● Inclusive. Use the IN option to allow users to access data related to the attributes you select. For
example, to allow users to view only data from the year 2001, create an “IN 2001” rule.
● Exclusive. Use the NOT IN option to restrict users from accessing data related to the attributes
you select. For example, to allow users to view all data except from the year 2001, create a “NOT
IN 2001” rule.

Restricting Data Access by User or Group

You can edit a user or group profile to restrict the data the user or group can access in reports. When you
edit a user profile, you can set data restrictions for any schema in the repository, including operational
schemas and fact tables.

You can set a data restriction to limit user or group access to data in a single schema based on the
attributes you select. If the attributes apply to more than one schema in the repository, you can also
restrict the user or group access from related data across all schemas in the repository. For example, you
may have a Sales fact table and Salary fact table. Both tables use the Region attribute. You can set one
data restriction that applies to both the Sales and Salary fact tables based on the region you select.

To set data restrictions for a user or group, you need the following role or privilege:

● System Administrator role


● Access Management privilege

When Data Analyzer runs scheduled reports that have provider-based security, it runs reports against the
data restrictions for the report owner. However, if the reports have consumer-based security, the Data
Analyzer Server creates a separate report for each unique security profile.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 510 of 1017


The following information applies to the required steps for changing admin user for weblogic only.

To change the Data Analyzer system administrator username on Weblogic 8.1(DA


8.1)

● Repository authentication. You must use the Update System Accounts utility to change the
system administrator account name in the repository.
● LDAP or Windows Domain Authentication. Set up the new system administrator account in
Windows Domain or LDAP directory service. Then use the Update System Accounts utility to
change the system administrator account name in the repository.

To change the Data Analyzer default users from admin, ias_scheduler/padaemon

1. Back up the repository.

2. Go to the Web Logic library directory: .\bea\wlserver6.1\lib

3. Open the file ias.jar and locate the file entry called InfChangeSystemUserNames.class

4. Extract the file "InfChangeSystemUserNames.class" into a temporary directory (example: d:


\temp)

5. This extracts the file as 'd:\temp\repository tils\Refresh\InfChangeSystemUserNames.class'

6. Create a batch file (change_sys_user.bat) with the following commands in the directory D:\Temp
\Repository Utils\Refresh\

REM To change the system user name and password


REM *******************************************
REM Change the BEA home here
REM ************************
set JAVA_HOME=E:\bea\wlserver6.1\jdk131_06
set WL_HOME=E:\bea\wlserver6.1
set CLASSPATH=%WL_HOME%\sql
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\jconn2.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\classes12.zip
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\weblogic.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\lib\ias_securityadapter.jar
set CLASSPATH=%CLASSPATH%;%WL_HOME%\infalicense
REM Change the DB information here and also
REM the user Dias_scheduler and -Dadmin to values of your choice
REM *************************************************************
%JAVA_HOME%\bin\java-Ddriver=com.informatica.jdbc.sqlserver.SQLServerDriver-Durl=jdbc:
informatica:sqlserver://host_name:port;SelectMethod=cursor;DatabaseName=database_name -
Duser=userName -Dpassword=userPassword -Dias_scheduler=pa_scheduler -Dadmin=paadmin
repositoryutil.refresh.InfChangeSystemUserNames
REM END OF BATCH FILE

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 511 of 1017


7. Make changes in the batch file as directed in the remarks [REM lines]

8. Save the file and open up a command prompt window and navigate to D:\Temp\Repository Utils
\Refresh\

9. At the prompt, type change_sys_user.bat and press Enter.

The user "ias_scheduler" and "admin" will be changed to "pa_scheduler" and "paadmin",
respectively.

10. Modify web.xml, and weblogic.xml (located at .\bea\wlserver6.1\config\informatica\applications\ias


\WEB-INF) by replacing ias_scheduler with 'pa_scheduler'

11. Replace ias_scheduler with pa_scheduler in the xml file weblogic-ejb-jar.xml

This file is in iasEjb.jar file located in the directory .\bea\wlserver6.1\config\informatica\applications\

To edit the file

Make a copy of the iasEjb.jar:

● mkdir \tmp
● cd \tmp
● jar xvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar META-INF
● cd META-INF
● Update META-INF/weblogic-ejb.jar.xml replace ias_scheduler with pa_scheduler
● cd \
● jar uvf \bea\wlserver6.1\config\informatica\applications\iasEjb.jar -C \tmp .

Note: There is a tailing period at the end of the command above.

12. Restart the server.

Last updated: 04-Jun-08 15:51

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 512 of 1017


Database Sizing

Challenge

Database sizing involves estimating the types and sizes of the components of a data architecture.
This is important for determining the optimal configuration for the database servers in order to
support the operational workloads. Individuals involved in a sizing exercise may be data architects,
database administrators, and/or business analysts.

Description

The first step in database sizing is to review system requirements to define such things as:

● Expected data architecture elements (will there be staging areas? operational data stores?
centralized data warehouse and/or master data? data marts?)

Each additional database element requires more space. This is even more true in situations
where data is being replicated across multiple systems, such as a data warehouse
maintaining an operational data store as well. The same data in the ODS will be present in
the warehouse as well, albeit in a different format.

● Expected source data volume

It is useful to analyze how each row in the source system translates into the target system. In
most situations the row count in the target system can be calculated by following the data
flows from the source to the target. For example, say a sales order table is being built by
denormalizing a source table. The source table holds sales data for 12 months in a single row
(one column for each month). Each row in the source translates to 12 rows in the target. So a
source table with one million rows ends up as a 12 million row table.

● Data granularity and periodicity

Granularity refers to the lowest level of information that is going to be stored in a fact table.
Granularity affects the size of a database to a great extent, especially for aggregate tables.
The level at which a table has been aggregated increases or decreases a table's row count.
For example, a sales order fact table's size is likely to be greatly affected by whether the
table is being aggregated at a monthly level or at a quarterly level. The granularity of fact
tables is determined by the dimensions linked to that table. The number of dimensions that
are connected to the fact tables affects the granularity of the table and hence the size of the
table.

● Load frequency and method (full refresh? incremental updates?)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 513 of 1017


Load frequency affects the space requirements for the staging areas. A load plan that
updates a target less frequently is likely to load more data at one go. Therefore, more space
is required by the staging areas. A full refresh requires more space for the same reason.
Estimated growth rates over time and retained history.

Determining Growth Projections

One way to estimate projections of data growth over time is to use scenario analysis. As an example,
for scenario analysis of a sales tracking data mart you can use the number of sales transactions to
be stored as the basis for the sizing estimate. In the first year, 10 million sales transactions are
expected; this equates to 10 million fact-table records.

Next, use the sales growth forecasts for the upcoming years for database growth calculations. That
is, an annual sales growth rate of 10 percent translates into 11 million fact table records for the next
year. At the end of five years, the fact table is likely to contain about 60 million records. You may
want to calculate other estimates based on five-percent annual sales growth (case 1) and 20-percent
annual sales growth (case 2). Multiple projections for best and worst case scenarios can be very
helpful.

Oracle Table Space Prediction Model

Oracle (10g and onwards) provides a mechanism to predict the growth of a database. This feature
can be useful in predicting table space requirements.

Oracle incorporates a table space prediction model in the database engine that provides projected
statistics for space used by a table. The following Oracle 10g query returns projected space usage
statistics:

SELECT *
FROM TABLE(DBMS_SPACE.object_growth_trend ('schema','tablename','TABLE'))
ORDER BY timepoint;
The results of this query are shown below:
TIMEPOINT SPACE_USAGE SPACE_ALLOC QUALITY
------------------------------ ----------- ----------- --------------------

11-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED


12-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED
13-APR-04 02.55.14.116000 PM 6372 65536 INTERPOLATED
13-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
14-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
15-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED
16-MAY-04 02.55.14.116000 PM 6372 65536 PROJECTED

The QUALITY column indicates the quality of the output as follows:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 514 of 1017


● GOOD - The data for the timepoint relates to data within the AWR repository with a
timestamp within 10 percent of the interval.
● INTERPOLATED - The data for this timepoint did not meet the GOOD criteria but was
based on data gathered before and after the timepoint.
● PROJECTED - The timepoint is in the future, so the data is estimated based on previous
growth statistics.

Baseline Volumetric

Next, use the physical data models for the sources and the target architecture to develop a baseline
sizing estimate. The administration guides for most DBMSs contain sizing guidelines for the various
database structures such as tables, indexes, sort space, data files, log files, and database cache.

Develop a detailed sizing using a worksheet inventory of the tables and indexes from the physical
data model, along with field data types and field sizes. Various database products use different
storage methods for data types. For this reason, be sure to use the database manuals to determine
the size of each data type. Add up the field sizes to determine row size. Then use the data volume
projections to determine the number of rows to multiply by the table size.

The default estimate for index size is to assume same size as the table size. Also estimate the
temporary space for sort operations. For data warehouse applications where summarizations are
common, plan on large temporary spaces. The temporary space can be as much as 1.5 times larger
than the largest table in the database.

Another approach that is sometimes useful is to load the data architecture with representative data
and determine the resulting database sizes. This test load can be a fraction of the actual data and is
used only to gather basic sizing statistics. You then need to apply growth projections to these
statistics. For example, after loading ten thousand sample records to the fact table, you determine
the size to be 10MB. Based on the scenario analysis, you can expect this fact table to contain 60
million records after five years. So, the estimated size for the fact table is about 60GB [i.e., 10 MB *
(60,000,000/10,000)]. Don't forget to add indexes and summary tables to the calculations.

Guesstimating

When there is not enough information to calculate an estimate as described above, use educated
guesses and “rules of thumb” to develop as reasonable an estimate as possible.

● If you don’t have the source data model, use what you do know of the source data to
estimate average field size and average number of fields in a row to determine table size.
Based on your understanding of transaction volume over time, determine your growth
metrics for each type of data and calculate out your source data volume (SDV) from table
size and growth metrics.

● If your target data architecture is not completed so that you can determine table sizes, base
your estimates on multiples of the SDV:

❍ If it includes staging areas: add another SDV for any source subject area that you will

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 515 of 1017


stage multiplied by the number of loads you’ll retain in staging.
❍ If you intend to consolidate data into an operational data store, add the SDV multiplied
by the number of loads to be retained in the ODS for historical purposes (e.g.,
keeping one year’s worth of monthly loads = 12 x SDV)
❍ Data warehouse architectures are based on the periodicity and granularity of
the warehouse; this may be another SDV + (.3n x SDV where n = number of time
periods loaded in the warehouse over time)
❍ If your data architecture includes aggregates, add a percentage of the warehouse
volumetrics based on how much of the warehouse data will be aggregated and to what
level (e.g., if the rollup level represents 10 percent of the dimensions at the details
level, use 10 percent).
❍ Similarly, for data marts add a percentage of the data warehouse based on how much
of the warehouse data is moved into the data mart.
❍ Be sure to consider the growth projections over time and the history to be retained in all
of your calculations.

And finally, remember that there is always much more data than you expect so you may want to add
a reasonable fudge-factor to the calculations for a margin of safety.

Last updated: 19-Jul-07 14:14

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 516 of 1017


Deployment Groups

Challenge

In selectively migrating objects from one repository folder to another, there is a need for
a versatile and flexible mechanism that can overcome such limitations as confinement
to a single source folder.

Description

Regulations such as Sarbanes-Oxley (SOX) and HIPAA require tracking, monitoring,


and reporting of changes in information technology systems. Automation of change
control processes using deployment groups and pmrep commands provide
organizations with a means to comply with regulations for configuration management
of software artifacts in a PowerCenter repository.

Deployment Groups are containers that hold references to objects that need to be
migrated. This includes objects such as mappings, mapplets, reusable transformations,
sources, targets, workflows, sessions and tasks, as well as the object holders (i.e., the
repository folders). Deployment groups are faster and more flexible than folder moves
for incremental changes. In addition, they allow for migration “rollbacks” if necessary.
Migrating a deployment group involves moving objects in a single copy operation from
across multiple folders in the source repository into multiple folders in the target
repository. When copying a deployment group, individual objects to be copied can be
selected as opposed to the entire contents of a folder.

There are two types of deployment groups - static and dynamic.

● Static deployment groups contain direct references to versions of objects


that need to be moved. Users explicitly add the version of the object to be
migrated to the deployment group. If the set of deployment objects is not
expected to change between deployments, static deployment groups can be
created.
● Dynamic deployment groups contain a query that is executed at the time of
deployment. The results of the query (i.e., object versions in the repository) are
then selected and copied to the deployment group. If the set of deployment
objects is expected to change frequently between deployments, dynamic
deployment groups should be used.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 517 of 1017


Dynamic deployment groups are generated from a query. While any available criteria
can be used, it is advisable to have developers use labels to simplify the query. For
more information, refer to the “Strategies for Labels” section of Using PowerCenter
Labels. When generating a query for deployment groups with mappings and mapplets
that contain non-reusable objects, in addition to specific selection criteria, a query
condition should be used. The query must include a condition for Is Reusable and use
a qualifier of either Reusable and Non-Reusable. Without this qualifier, the deployment
may encounter errors if there are non-reusable objects held within the mapping or
mapplet.

A deployment group exists in a specific repository. It can be used to move items to any
other accessible repository/folder. A deployment group maintains a history of all
migrations it has performed. It tracks what versions of objects were moved from which
folders in which source repositories, and into which folders in which target repositories
those versions were copied (i.e., it provides a complete audit trail of all migrations
performed). Given that the deployment group knows what it moved and to where, then
if necessary, an administrator can have the deployment group “undo” the most recent
deployment, reverting the target repository to its pre-deployment state. Using labels (as
described in the Using PowerCenter Labels Best Practice) allows objects in the
subsequent repository to be tracked back to a specific deployment.

It is important to note that the deployment group only migrates the objects it contains to
the target repository/folder. It does not, itself, move to the target repository. It still
resides in the source repository.

Deploying via the GUI

Migrations can be performed via the GUI or the command line (pmrep). In order to
migrate objects via the GUI, simply drag a deployment group from the repository it
resides in onto the target repository where the referenced objects are to be moved. The
Deployment Wizard appears and steps the user through the deployment process. Once
the wizard is complete, the migration occurs, and the deployment history is created.

Deploying via the Command Line

Alternatively, the PowerCenter pmrep command can be used to automate both Folder
Level deployments (e.g., in a non-versioned repository) and deployments using
Deployment Groups. The commands DeployFolder and DeployDeploymentGroup in
pmrep are used respectively for these purposes. Whereas deployment via the GUI
requires stepping through a wizard and answering a series of questions to deploy, the
command-line deployment requires an XML control file that contains the same

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 518 of 1017


information that the wizard requests. This file must be present before the deployment is
executed.

The following steps can be used to create a script to wrap pmrep commands and
automate PowerCenter deployments:

1. Use pmrep ListObjects to return the object metadata to be parsed in another


pmrep command.
2. Use pmrep CreateDeploymentGroup to create a dynamic or static deployment
group.
3. Use pmrep ExecuteQuery to output the results to a persistent input file. This
input file can also be used for AddToDeploymentGroup command.
4. Use DeployDeploymentGroup to copy a deployment group to a different
repository. A control file with all the specifications is required for this command.

Additionally, a web interface can be built for entering/approving/rejecting code


migration requests. This can provide additional traceability and reporting capabilities to
the automation of PowerCenter code migrations.

Considerations for Deployment and Deployment Groups

Simultaneous Multi-Phase Projects

If multiple phases of a project are being developed simultaneously in separate folders,


it is possible to consolidate them by mapping folders appropriately through the
deployment group migration wizard. When migrating with deployment groups in this
way, the override buttons in the migration wizard are used to select specific folder
mappings.

Rolling Back a Deployment

Deployment groups help to ensure that there is a back-out methodology and that the
latest version of a deployment can be rolled back. To do this:

In the target repository (where the objects were migrated to), go to:

Versioning>>Deployment>>History>>View History>>Rollback.

The rollback purges all objects (of the latest version) that were in the deployment
group. Initiate a rollback on a deployment in order to roll back only the latest versions of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 519 of 1017


the objects. The rollback ensures that the check-in time for the repository objects is the
same as the deploy time. Also, pmrep command RollBackDeployment can be used for
automating rollbacks. Remember that you cannot rollback part of the deployment, you
will have to rollback all the objects in a deployment group.

Managing Repository Size

As objects are checked in and objects are deployed to target repositories, the number
of object versions in those repositories increases, as does the size of the repositories.

In order to manage repository size, use a combination of Check-in Date and Latest
Status (both are query parameters) to purge the desired versions from the repository
and retain only the very latest version. Also all the deleted versions of the objects
should be purged to reduce the size of the repository.

If it is necessary to keep more than the latest version, labels can be included in the
query. These labels are ones that have been applied to the repository for the specific
purpose of identifying objects for purging.

Off-Shore, On-Shore Migration

In an off-shore development environment to an on-shore migration situation, other


aspects of the computing environment may make it desirable to generate a dynamic
deployment group. Instead of migrating the group itself to the next repository, a query
can be used to select the objects for migration and save them to a single XML file
which can be then be transmitted to the on-shore environment through alternative
methods. If the on-shore repository is versioned, it activates the import wizard as if a
deployment group was being received.

Code Migration from Versioned Repository to a Non-Versioned


Repository

In some instances, it may be desirable to migrate objects to a non-versioned repository


from a versioned repository. Note that when migrating in this manner, this changes the
wizards used, and that the export from the versioned repository must take place using
XML export.

Last updated: 27-May-08 13:20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 520 of 1017


Migration Procedures - PowerCenter

Challenge

Develop a migration strategy that ensures clean migration between development, test, quality assurance (QA), and
production environments, thereby protecting the integrity of each of these environments as the system evolves.

Description

Ensuring that an application has a smooth migration process between development, QA, and production
environments is essential for the deployment of an application. Deciding which migration strategy works best for a
project depends on two primary factors.

● How is the PowerCenter repository environment designed? Are there individual repositories for
development, QA, and production or are there just one or two environments that share one or all of these
phases.
● How has the folder architecture been defined?

Each of these factors plays a role in determining the migration procedure that is most beneficial to the project.

PowerCenter offers flexible migration options that can be adapted to fit the need of each application. PowerCenter
migration options include repository migration, folder migration, object migration, and XML import/export. In
versioned PowerCenter repositories, users can also use static or dynamic deployment groups for migration, which
provides the capability to migrate any combination of objects within the repository with a single command.

This Best Practice is intended to help the development team decide which technique is most appropriate for the
project. The following sections discuss various options that are available, based on the environment and architecture
selected. Each section describes the major advantages of its use, as well as its disadvantages.

Repository Environments

The following section outlines the migration procedures for standalone and distributed repository environments. The
distributed environment section touches on several migration architectures, outlining the pros and cons of each.
Also, please note that any methods described in the Standalone section may also be used in a Distributed
environment.

Standalone Repository Environment

In a standalone environment, all work is performed in a single PowerCenter repository that serves as the metadata
store. Separate folders are used to represent the development, QA, and production workspaces and segregate work.
This type of architecture within a single repository ensures seamless migration from development to QA, and from
QA to production.

The following example shows a typical architecture. In this example, the company has chosen to create separate
development folders for each of the individual developers for development and unit test purposes. A single shared or
common development folder, SHARED_MARKETING_DEV, holds all of the common objects, such as sources,
targets, and reusable mapplets. In addition, two test folders are created for QA purposes. The first contains all of the
unit-tested mappings from the development folder. The second is a common or shared folder that contains all of the
tested shared objects. Eventually, as the following paragraphs explain, two production folders will also be built.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 521 of 1017


Proposed Migration Process – Single Repository

DEV to TEST – Object Level Migration

Now that we've described the repository architecture for this organization, let's discuss how it will migrate mappings
to test, and then eventually to production.

After all mappings have completed their unit testing, the process for migration to test can begin. The first step in this
process is to copy all of the shared or common objects from the SHARED_MARKETING_DEV folder to the
SHARED_MARKETING_TEST folder. This can be done using one of two methods:

● The first, and most common method, is object migration via an object copy. In this case, a user opens the
SHARED_MARKETING_TEST folder and drags the object from the SHARED_MARKETING_DEV into the
appropriate workspace (i.e., Source Analyzer, Warehouse Designer, etc.). This is similar to dragging a file
from one folder to another using Windows Explorer.
● The second approach is object migration via object XML import/export. A user can export each of the
objects in the SHARED_MARKETING_DEV folder to XML, and then re-import each object into the
SHARED_MARKETING_TEST via XML import. With the XML import/export, the XML files can be uploaded
to a third-party versioning tool, if the organization has standardized on such a tool. Otherwise, versioning
can be enabled in PowerCenter. Migrations with versioned PowerCenter repositories is covered later in this
document.

After you've copied all common or shared objects, the next step is to copy the individual mappings from each
development folder into the MARKETING_TEST folder. Again, you can use either of the two object-level migration
methods described above to copy the mappings to the folder, although the XML import/export method is the most
intuitive method for resolving shared object conflicts. However, the migration method is slightly different here when
you're copying the mappings because you must ensure that the shortcuts in the mapping are associated with the
SHARED_MARKETING_TEST folder. Designer prompts the user to choose the correct shortcut folder that you
created in the previous example, which point to the SHARED_MARKETING_TEST (see image below). You can then
continue the migration process until all mappings have been successfully migrated. In PowerCenter 7 and later
versions, you can export multiple objects into a single XML file, and then import them at the same time.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 522 of 1017


The final step in the process is to migrate the workflows that use those mappings. Again, the object-level migration
can be completed either through drag-and-drop or by using XML import/export. In either case, this process is very
similar to the steps described above for migrating mappings, but differs in that the Workflow Manager provides a
Workflow Copy Wizard to guide you through the process. The following steps outline the full process for successfully
copying a workflow and all of its associated tasks.

1. The Wizard prompts for the name of the new workflow. If a workflow with the same name exists in the
destination folder, the Wizard prompts you to rename it or replace it. If no such workflow exists, a default
name is used. Then click “Next” to continue the copy process.
2. The next step for each task is to see if it exists (as shown below). If the task is present, you can rename or
replace the current one. If it does not exist, then the default name is used (see below). Then click “Next.”

3. Next, the Wizard prompts you to select the mapping associated with each session task in the workflow.
Select the mapping and continue by clicking “Next".

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 523 of 1017


4. If connections exist in the target repository, the Wizard prompts you to select the connection to use for the
source and target. If no connections exist, the default settings are used. When this step is completed, click
"Finish" and save the work.

Initial Migration – New Folders Created

The move to production is very different for the initial move than for subsequent changes to mappings and workflows.
Since the repository only contains folders for development and test, we need to create two new folders to house the
production-ready objects. Create these folders after testing of the objects in SHARED_MARKETING_TEST and
MARKETING_TEST has been approved.

The following steps outline the creation of the production folders and, at the same time, address the initial test to
production migration.

1. Open the PowerCenter Repository Manager client tool and log into the repository.
2. To make a shared folder for the production environment, highlight the SHARED_MARKETING_TEST folder,
drag it, and drop it on the repository name.
3. The Copy Folder Wizard appears to guide you through the copying process.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 524 of 1017


4. The first Wizard screen asks if you want to use the typical folder copy options or the advanced options. In this
example, we'll use the advanced options.

5. The second Wizard screen prompts you to enter a folder name. By default, the folder name that appears on
this screen is the folder name followed by the date. In this case, enter the name as
“SHARED_MARKETING_PROD.”

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 525 of 1017


6. The third Wizard screen prompts you to select a folder to override. Because this is the first time you are
transporting the folder, you won’t need to select anything.

7. The final screen begins the actual copy process. Click "Finish" when the process is complete.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 526 of 1017


Repeat this process to create the MARKETING_PROD folder. Use the MARKETING_TEST folder as the
original to copy and associate the shared objects with the SHARED_MARKETING_PROD folder that you just
created.

At the end of the migration, you should have two additional folders in the repository environment for
production: SHARED_MARKETING_PROD and MARKETING_ PROD (as shown below). These folders
contain the initially migrated objects. Before you can actually run the workflow in these production folders, you
need to modify the session source and target connections to point to the production environment.

When you copy or replace a PowerCenter repository folder, the Copy Wizard copies the permissions for the
folder owner to the target folder. The wizard does not copy permissions for users, groups, or all others in the
repository to the target folder. Previously, the Copy Wizard copied the permissions for the folder owner,
owner’s group, and all users in the repository to the target folder.

Incremental Migration – Object Copy Example

Now that the initial production migration is complete, let's take a look at how future changes will be migrated into the
folder.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 527 of 1017


Any time an object is modified, it must be re-tested and migrated into production for the actual change to occur.
These types of changes in production take place on a case-by-case or periodically-scheduled basis. The following
steps outline the process of moving these objects individually.

1. Log into PowerCenter Designer. Open the destination folder and expand the source folder. Click on the object
to copy and drag-and-drop it into the appropriate workspace window.
2. Because this is a modification to an object that already exists in the destination folder, Designer prompts you
to choose whether to Rename or Replace the object (as shown below). Choose the option to Replace the
object.

3. In PowerCenter 7 and later versions, you can choose to compare conflicts whenever migrating any object in
Designer or Workflow Manager. By comparing the objects, you can ensure that the changes that you are
making are what you intend. See below for an example of the mapping compare window.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 528 of 1017


4. After the object has been successfully copied, save the folder so the changes can take place.
5. The newly copied mapping is now tied to any sessions that the replaced mapping was tied to.
6. Log into Workflow Manager and make the appropriate changes to the session or workflow so it can update
itself with the changes.

Standalone Repository Example

In this example, we look at moving development work to QA and then from QA to production, using multiple
development folders for each developer, with the test and production folders divided into the data mart they
represent. For this example, we focus solely on the MARKETING_DEV data mart, first explaining how to move
objects and mappings from each individual folder to the test folder and then how to move tasks, worklets, and
workflows to the new area.

Follow these steps to copy a mapping from Development to QA:

1. If using shortcuts, first follow these steps; if not using shortcuts, skip to step 2
❍ Copy the tested objects from the SHARED_MARKETING_DEV folder to the

SHARED_MARKETING_TEST folder.
❍ Drag all of the newly copied objects from the SHARED_MARKETING_TEST folder to
MARKETING_TEST.
❍ Save your changes.
2. Copy the mapping from Development into Test.
❍ In the PowerCenter Designer, open the MARKETING_TEST folder, and drag and drop the mapping

from each development folder into the MARKETING_TEST folder.


❍ When copying each mapping in PowerCenter, Designer prompts you to either Replace, Rename,
or Reuse the object, or Skip for each reusable object, such as source and target definitions. Choose
to Reuse the object for all shared objects in the mappings copied into the MARKETING_TEST folder.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 529 of 1017


❍ Save your changes.
3. If a reusable session task is being used, follow these steps. Otherwise, skip to step 4.
❍ In the PowerCenter Workflow Manager, open the MARKETING_TEST folder and drag and drop

each reusable session from the developers’ folders into the MARKETING_TEST folder. A Copy
Session Wizard guides you through the copying process.
❍ Open each newly copied session and click on the Source tab. Change the source to point to the
source database for the Test environment.
❍ Click the Target tab. Change each connection to point to the target database for the Test
environment. Be sure to double-check the workspace from within the Target tab to ensure that the
load options are correct.
❍ Save your changes.
4. While the MARKETING_TEST folder is still open, copy each workflow from Development to Test.
❍ Drag each workflow from the development folders into the MARKETING_TEST folder. The Copy

Workflow Wizard appears. Follow the same steps listed above to copy the workflow to the new
folder.
❍ As mentioned earlier, in PowerCenter 7 and later versions, the Copy Wizard allows you to compare
conflicts from within Workflow Manager to ensure that the correct migrations are being made.
❍ Save your changes.
5. Implement the appropriate security.
❍ In Development, the owner of the folders should be a user(s) in the development group.

❍ In Test, change the owner of the test folder to a user(s) in the test group.
❍ In Production, change the owner of the folders to a user in the production group.
❍ Revoke all rights to Public other than Read for the production folders.

Rules to Configure Folder and Global Object Permissions

Rules in 8.5 Rules in Previous Versions

The folder or global object owner or a user assigned the Users with the appropriate repository privileges could grant
Administrator role for the Repository Service can grant folder folder and global object permissions.
and global object permissions.

Permissions can be granted to users, groups, and all others in Permissions could be granted to the owner, owner’s group,
the repository. and all others in the repository.

The folder or global object owner and a user assigned the You could change the permissions for the folder or global
Administrator role for the Repository Service have all object owner.
permissions which you cannot change.

Disadvantages of a Single Repository Environment

The biggest disadvantage or challenge with a single repository environment is migration of repository objects with
respect to database connections. When migrating objects from Dev to Test to Prod you can’t use the same database
connection as those that will be pointing to dev or test environment. A single repository structure can also
create confusion as the same users and groups exist in all environments and the number of folders can increase
exponentially.

Distributed Repository Environment

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 530 of 1017


A distributed repository environment maintains separate, independent repositories, hardware, and software for
development, test, and production environments. Separating repository environments is preferable for handling
development to production migrations. Because the environments are segregated from one another, work performed
in development cannot impact QA or production.

With a fully distributed approach, separate repositories function much like the separate folders in a standalone
environment. Each repository has a similar name, like the folders in the standalone environment. For instance, in our
Marketing example we would have three repositories, INFADEV, INFATEST, and INFAPROD. In the following
example, we discuss a distributed repository architecture.

There are four techniques for migrating from development to production in a distributed repository architecture, with
each involving some advantages and disadvantages.

● Repository Copy
● Folder Copy
● Object Copy
● Deployment Groups

Repository Copy

So far, this document has covered object-level migrations and folder migrations through drag-and-drop object
copying and object XML import/export. This section discusses migrations in a distributed repository environment
through repository copies.

The main advantages of this approach are:

● The ability to copy all objects (i.e., mappings, workflows, mapplets, reusable transformation, etc.) at once
from one environment to another.
● The ability to automate this process using pmrep commands, thereby eliminating many of the manual
processes that users typically perform.
● The ability to move everything without breaking or corrupting any of the objects.

This approach also involves a few disadvantages.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 531 of 1017


● The first is that everything is moved at once (which is also an advantage). The problem with this is that
everything is moved -- ready or not. For example, we may have 50 mappings in QA, but only 40 of them are
production-ready. The 10 untested mappings are moved into production along with the 40 production-ready
mappings, which leads to the second disadvantage.
● Significant maintenance is required to remove any unwanted or excess objects.
● There is also a need to adjust server variables, sequences, parameters/variables, database connections,
etc. Everything must be set up correctly before the actual production runs can take place.
● Lastly, the repository copy process requires that the existing Production repository be deleted, and then the
Test repository can be copied. This results in a loss of production environment operational metadata such
as load statuses, session run times, etc. High-performance organizations leverage the value of operational
metadata to track trends over time related to load success/failure and duration. This metadata can be a
competitive advantage for organizations that use this information to plan for future growth.

Now that we've discussed the advantages and disadvantages, we'll look at three ways to accomplish the Repository
Copy method:

● Copying the Repository


● Repository Backup and Restore
● PMREP

Copying the Repository

Copying the Test repository to Production through the GUI client tools is the easiest of all the migration
methods. First, ensure that all users are logged out of the destination repository and then connect to the
PowerCenter Repository Administration Console (as shown below).

If the Production repository already exists, you must delete the repository before you can copy the Test repository.
Before you can delete the repository, you must run the repository in the ‘exclusive mode’.

1. Click on the “INFA_PROD Repository on the left pane to select it and change the running mode to “exclusive
mode’ by clicking on the edit button on the right pane under the properties tab.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 532 of 1017


2. Delete the Production repository by selecting it and choosing “Delete” from the context menu.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 533 of 1017


3. Click on the Action drop-down list and choose Copy contents from

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 534 of 1017


4. In the new window, choose the domain name, repository service “INFA_TEST” from the drop-down
menu. Enter the username and password of the Test repository.

5. Click OK to begin the copy process.


6. When you've successfully copied the repository to the new location, exit from the PowerCenter Administration

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 535 of 1017


Console.
7. In the Repository Manager, double-click on the newly copied repository and log-in with a valid username and
password.
8. Verify connectivity, then highlight each folder individually and rename them. For example, rename the
MARKETING_TEST folder to MARKETING_PROD, and the SHARED_MARKETING_TEST to
SHARED_MARKETING_PROD.
9. Be sure to remove all objects that are not pertinent to the Production environment from the folders before
beginning the actual testing process.
10. When this cleanup is finished, you can log into the repository through the Workflow Manager. Modify the
server information and all connections so they are updated to point to the new Production locations for all
existing tasks and workflows.

Repository Backup and Restore

Backup and Restore Repository is another simple method of copying an entire repository. This process backs up the
repository to a binary file that can be restored to any new location. This method is preferable to the repository copy
process because if any type of error occurs, the file is backed up to the binary file on the repository server.

From 8.5 onwards, security information is maintained at the domain level. Before you back up a repository and
restore it in a different domain, verify that users and groups with privileges for the source Repository Service exist in
the target domain. The Service Manager periodically synchronizes the list of users and groups in the repository with
the users and groups in the domain configuration database. During synchronization, users and groups that do not
exist in the target domain are deleted from the repository.

You can use infacmd to export users and groups from the source domain and import them into the target domain.
Use infacmd ExportUsersAndGroups to export the users and groups to a file. Use infacmd ImportUsersAndGroups to
import the users and groups from the file to a different PowerCenter domain

The following steps outline the process of backing up and restoring the repository for migration.

1. Launch the PowerCenter Administration Console, and highlight the INFA_TEST repository service. Select
Action -> Backup Contents from the drop-down menu.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 536 of 1017


2. A screen appears and prompts you to supply a name for the backup file as well as the Administrator
username and password. The file is saved to the Backup directory within the repository server’s home
directory.

3. After you've selected the location and file name, click OK to begin the backup process.

4. The backup process creates a .rep file containing all repository information. Stay logged into the Manage
Repositories screen. When the backup is complete, select the repository connection to which the backup will
be restored to (i.e., the Production repository).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 537 of 1017


5. The system will prompt you to supply a username, password, and the name of the file to be restored. Enter
the appropriate information and click OK.

When the restoration process is complete, you must repeat the steps listed in the copy repository option in order to
delete all of the unused objects and renaming of the folders.

PMREP

Using the PMREP commands is essentially the same as the Backup and Restore Repository method except that it is
run from the command line rather than through the GUI client tools. pmrep is installed in the PowerCenter Client and
PowerCenter Services bin directories. PMREP utilities can be used from the Informatica Server or from any client
machine connected to the server. Refer to the Repository Manager Guide for a list of PMREP commands.

PMREP backup backs up the repository to the file specified with the -o option. You must provide the backup file
name. Use this command when the repository is running. You must be connected to a repository to use this
command.

The BackUp command uses the following syntax:

backup
-o <output_file_name>
[-d <description>]
[-f (overwrite existing output file)]
[-b (skip workflow and session logs)]
[-j (skip deploy group history)]
[-q (skip MX data)]
[-v (skip task statistics)]

The following is a sample of the command syntax used within a Windows batch file to connect to and backup a
repository. Using this code example as a model, you can write scripts to be run on a daily basis to perform functions

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 538 of 1017


such as connect, backup, restore, etc:

backupproduction.bat

REM This batch file uses pmrep to connect to and back up the repository Production on the server Central

@echo off

echo Connecting to Production repository...

“<Informatica Installation Directory>\Server\bin\pmrep” connect -r INFAPROD -n Administrator -x Adminpwd –


h infarepserver –o 7001

echo Backing up Production repository...

“<Informatica Installation Directory>\Server\bin\pmrep” backup -o c:\backup\Production_backup.rep

Alternatively, the following steps can be used:

1. Use infacmd commands to run repository service in ‘Exclusive’ mode


2. Use pmrep backup command to backup the source repository
3. Use pmrep delete command to delete the content of target repository (if contect already exists in the target
repository)
4. Use pmrep restore command to restore the backup file into target repostiory

Post-Repository Migration Cleanup

After you have used one of the repository migration procedures to migrate into Production, follow these steps to
convert the repository to Production:

1. Disable workflows that are not ready for Production or simply delete the mappings, tasks, and workflows.

❍ Disable the workflows not being used in the Workflow Manager by opening the workflow
properties, then checking the Disabled checkbox under the General tab.
❍ Delete the tasks not being used in the Workflow Manager and the mappings in the Designer

2. Modify the database connection strings to point to the production sources and targets.

❍ In the Workflow Manager, select Relational connections from the Connections menu.
❍ Edit each relational connection by changing the connect string to point to the production
sources and targets.
❍ If you are using lookup transformations in the mappings and the connect string is anything
other than $SOURCE or $TARGET, you will need to modify the connect strings appropriately.

3. Modify the pre- and post-session commands and SQL as necessary.

❍ In the Workflow Manager, open the session task properties, and from the Components tab
make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 539 of 1017


❍ In Development, ensure that the owner of the folders is a user in the development group.
❍ In Test, change the owner of the test folders to a user in the test group.
❍ In Production, change the owner of the folders to a user in the production group.
❍ Revoke all rights to Public other than Read for the Production folders.

Folder Copy

Although deployment groups are becoming a very popular migration method, the folder copy method has historically
been the most popular way to migrate in a distributed environment. Copying an entire folder allows you to quickly
promote all of the objects located within that folder. All source and target objects, reusable transformations,
mapplets, mappings, tasks, worklets and workflows are promoted at once. Because of this, however, everything in
the folder must be ready to migrate forward. If some mappings or workflows are not valid, then developers (or the
Repository Administrator) must manually delete these mappings or workflows from the new folder after the folder is
copied.

The three advantages of using the folder copy method are:

● The Repository Managers Folder Copy Wizard makes it almost seamless to copy an entire folder and all the
objects located within it.
● If the project uses a common or shared folder and this folder is copied first, then all shortcut relationships
are automatically converted to point to this newly copied common or shared folder.
● All connections, sequences, mapping variables, and workflow variables are copied automatically.

The primary disadvantage of the folder copy method is that the repository is locked while the folder copy is being
performed. Therefore, it is necessary to schedule this migration task during a time when the repository is least
utilized. Remember that a locked repository means than no jobs can be launched during this process. This can be a
serious consideration in real-time or near real-time environments.

The following example steps through the process of copying folders from each of the different environments. The first
example uses three separate repositories for development, test, and production.

1. If using shortcuts, follow these sub steps; otherwise skip to step 2:

● Open the Repository Manager client tool.


● Connect to both the Development and Test repositories.
● Highlight the folder to copy and drag it to the Test repository.
● The Copy Folder Wizard appears to step you through the copy process.
● When the folder copy process is complete, open the newly copied folder in both the
Repository Manager and Designer to ensure that the objects were copied properly.

2. Copy the Development folder to Test. If you skipped step 1, follow these sub-steps:

● Open the Repository Manager client tool.


● Connect to both the Development and Test repositories.
● Highlight the folder to copy and drag it to the Test repository.

The Copy Folder Wizard will appear.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 540 of 1017


3. Follow these steps to ensure that all shortcuts are reconnected.

● Use the advanced options when copying the folder across.


● Select Next to use the default name of the folder

4. If the folder already exists in the destination repository, choose to replace the folder.

The following screen appears to prompt you to select the folder where the new shortcuts are located.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 541 of 1017


In a situation where the folder names do not match, a folder compare will take place. The Copy
Folder Wizard then completes the folder copy process. Rename the folder as appropriate and
implement the security.

5. When testing is complete, repeat the steps above to migrate to the Production repository.

When the folder copy process is complete, log onto the Workflow Manager and change the connections to point to
the appropriate target location. Ensure that all tasks updated correctly and that folder and repository security is
modified for test and production.

Object Copy

Copying mappings into the next stage in a networked environment involves many of the same advantages and
disadvantages as in the standalone environment, but the process of handling shortcuts is simplified in the networked
environment. For additional information, see the earlier description of Object Copy for the standalone environment.

One advantage of Object Copy in a distributed environment is that it provides more granular control over objects.

Two distinct disadvantages of Object Copy in a distributed environment are:

● Much more work to deploy an entire group of objects


● Shortcuts must exist prior to importing/copying mappings

Below are the steps to complete an object copy in a distributed repository environment:

1. If using shortcuts, follow these sub-steps, otherwise skip to step 2:

● In each of the distributed repositories, create a common folder with the exact same name and case.
● Copy the shortcuts into the common folder in Production, making sure the shortcut has the exact same
name.

2. Copy the mapping from the Test environment into Production.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 542 of 1017


● In the Designer, connect to both the Test and Production repositories and open the appropriate folders in
each.
● Drag-and-drop the mapping from Test into Production.
● During the mapping copy process, PowerCenter 7 and later versions allow a comparison of this mapping to
an existing copy of the mapping already in Production. Note that the ability to compare objects is not limited
to mappings, but is available for all repository objects including workflows, sessions, and tasks.

3. Create or copy a workflow with the corresponding session task in the Workflow Manager to run the mapping
(first ensure that the mapping exists in the current repository).

● If copying the workflow, follow the Copy Wizard.


● If creating the workflow, add a session task that points to the mapping and enter all the appropriate
information.

4. Implement appropriate security.

● In Development, ensure the owner of the folders is a user in the development group.
● In Test, change the owner of the test folders to a user in the test group.
● In Production, change the owner of the folders to a user in the production group.
● Revoke all rights to Public other than Read for the Production folders.

Deployment Groups

For versioned repositories, the use of Deployment Groups for migrations between distributed environments allows
the most flexibility and convenience. With Deployment Groups, you can migrate individual objects as you would in
an object copy migration, but can also have the convenience of a repository- or folder-level migration as all objects
are deployed at once. The objects included in a deployment group have no restrictions and can come from one or
multiple folders. Additionally, for additional convenience, you can set up a dynamic deployment group that allows the
objects in the deployment group to be defined by a repository query, rather than being added to the deployment
group manually. Lastly, because deployment groups are available on versioned repositories, they also have the
ability to be rolled back, reverting to the previous versions of the objects, when necessary.

Advantages of Using Deployment Groups


● Backup and restore of the Repository needs to be performed only once.
● Copying a Folder replaces the previous copy.
● Copying a Mapping allows for different names to be used for the same object.
● Uses for Deployment Groups

❍ Deployment Groups are containers that hold references to objects that need to be migrated.
❍ Allows for version-based object migration.
❍ Faster and more flexible than folder moves for incremental changes.
❍ Allows for migration “rollbacks”
❍ Allows specifying individual objects to copy, rather than the entire contents of a folder.

Types of Deployment Groups


● Static

❍ Contain direct references to versions of objects that need to be moved.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 543 of 1017


❍ Users explicitly add the version of the object to be migrated to the deployment group.

● Dynamic

❍ Contain a query that is executed at the time of deployment.


❍ The results of the query (i.e. object versions in the repository) are then selected and copied to the
target repository

Pre-Requisites

Create required folders in the Target Repository

Creating Labels

A label is a versioning object that you can associate with any versioned object or group of versioned objects in a
repository.

● Advantages

❍ Tracks versioned objects during development.


❍ Improves query results.
❍ Associates groups of objects for deployment.
❍ Associates groups of objects for import and export.

● Create label

❍ Create labels through the Repository Manager.


❍ After creating the labels, go to edit mode and lock them.
❍ The "Lock" option is used to prevent other users from editing or applying the label.
❍ This option can be enabled only when the label is edited.
❍ Some Standard Label examples are:

■ Development
■ Deploy_Test
■ Test
■ Deploy_Production
■ Production

● Apply Label

❍ Create a query to identify the objects that are needed to be queried.


❍ Run the query and apply the labels.

Note: By default, the latest version of the object gets labeled.

Queries

A query is an object used to search for versioned objects in the repository that meet specific conditions.

● Advantages

❍ Tracks objects during development

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 544 of 1017


❍ Associates a query with a deployment group
❍ Finds deleted objects you want to recover
❍ Finds groups of invalidated objects you want to validate

● Create a query

❍ The Query Browser allows you to create, edit, run, or delete object queries

● Execute a query

❍ Execute through Query Browser


❍ EXECUTE QUERY: ExecuteQuery -q query_name -t query_type -u persistent_output_file_name -a
append -c column_separator -r end-of-record_separator -l end-oflisting_indicator -b verbose

Creating a Deployment Group

Follow these steps to create a deployment group:

1. Launch the Repository Manager client tool and log in to the source repository.
2. Expand the repository, right-click on “Deployment Groups” and choose “New Group.”

3. In the dialog window, give the deployment group a name, and choose whether it should be static or
dynamic. In this example, we are creating a static deployment group. Click OK.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 545 of 1017


Adding Objects to a Static Deployment Group

Follow these steps to add objects to a static deployment group:

1. In Designer, Workflow Manager, or Repository Manger, right-click an object that you want to add to the
deployment group and choose “Versioning” -> “View History.” The “View History” window appears.

2. In the “View History” window, right-click the object and choose “Add to Deployment Group.”

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 546 of 1017


3. In the Deployment Group dialog window, choose the deployment group that you want to add the object to,
and click OK.

4. In the final dialog window, choose whether you want to add dependent objects. In most cases, you will want
to add dependent objects to the deployment group so that they will be migrated as well. Click OK.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 547 of 1017


NOTE: The “All Dependencies” option should be used for any new code that is migrating forward. However, this
option can cause issues when moving existing code forward because “All Dependencies” also flags shortcuts. During
the deployment, PowerCenter tries to re-insert or replace the shortcuts. This does not work, and causes the
deployment to fail.

The object will be added to the deployment group at this time.

Although the deployment group allows the most flexibility, the task of adding each object to the deployment group is
similar to the effort required for an object copy migration. To make deployment groups easier to use, PowerCenter
allows the capability to create dynamic deployment groups.

Adding Objects to a Dynamic Deployment Group

Dynamic Deployment groups are similar in function to static deployment groups, but differ in the way that objects are
added. In a static deployment group, objects are manually added one by one. In a dynamic deployment group, the
contents of the deployment group are defined by a repository query. Don’t worry about the complexity of writing a
repository query, it is quite simple and aided by the PowerCenter GUI interface.

Follow these steps to add objects to a dynamic deployment group:

1. First, create a deployment group, just as you did for a static deployment group, but in this case, choose the
dynamic option. Also, select the “Queries” button.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 548 of 1017


2. The “Query Browser” window appears. Choose “New” to create a query for the dynamic deployment group.

3. In the Query Editor window, provide a name and query type (Shared). Define criteria for the objects that
should be migrated. The drop-down list of parameters lets you choose from 23 predefined metadata
categories. In this case, the developers have assigned the “RELEASE_20050130” label to all objects that
need to be migrated, so the query is defined as “Label Is Equal To ‘RELEASE_20050130’”. The creation and
application of labels are discussed in Using PowerCenter Labels.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 549 of 1017


4. Save the Query and exit the Query Editor. Click OK on the Query Browser window, and close the Deployment
Group editor window.

Executing a Deployment Group Migration

A Deployment Group migration can be executed through the Repository Manager client tool, or through the pmrep
command line utility. With the client tool, you simply drag the deployment group from the source repository and drop
it on the destination repository. This opens the Copy Deployment Group Wizard, which guides you through the step-
by-step options for executing the deployment group.

Rolling Back a Deployment

To roll back a deployment, you must first locate the Deployment via the TARGET Repositories menu bar (i.
e., Deployments -> History -> View History -> Rollback).

Automated Deployments

For the optimal migration method, you can set up a UNIX shell or Windows batch script that calls the pmrep
DeployDeploymentGroup command, which can execute a deployment group migration without human intevention.
This is ideal since the deployment group allows ultimate flexibility and convenience as the script can be scheduled to
run overnight, thereby causing minimal impact on developers and the PowerCenter administrator. You can also use
the pmrep utility to automate importing objects via XML.

Recommendations

Informatica recommends using the following process when running in a three-tiered environment with development,
test, and production servers.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 550 of 1017


Non-Versioned Repositories

For migrating from development into test, Informatica recommends using the Object Copy method. This method
gives you total granular control over the objects that are being moved. It also ensures that the latest development
mappings can be moved over manually as they are completed. For recommendations on performing this copy
procedure correctly, see the steps listed in the Object Copy section.

Versioned Repositories

For versioned repositories, Informatica recommends using the Deployment Groups method for repository migration in
a distributed repository environment. This method provides the greatest flexibility in that you can promote any object
from within a development repository (even across folders) into any destination repository. Also, by using labels,
dynamic deployment groups, and the enhanced pmrep command line utility, the use of the deployment group
migration method results in automated migrations that can be executed without manual intervention.

Third-Party Versioning

Some organizations have standardized on third-party version control software. PowerCenter’s XML import/export
functionality offers integration with such software and provides a means to migrate objects. This method is most
useful in a distributed environment because objects can be exported into an XML file from one repository and
imported into the destination repository.

The XML Object Copy Process allows you to copy nearly all repository objects, including sources, targets, reusable
transformations, mappings, mapplets, workflows, worklets, and tasks. Beginning with PowerCenter 7 and later
versions, the export/import functionality allows the export/import of multiple objects to a single XML file. This can
significantly cut down on the work associated with object level XML import/export.

The following steps outline the process of exporting the objects from source repository and importing them into the
destination repository:

Exporting

1. From Designer or Workflow Manager, login to the source repository. Open the folder and highlight the object
to be exported.
2. Select Repository -> Export Objects

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 551 of 1017


3. The system prompts you to select a directory location on the local workstation. Choose the directory to save
the file. Using the default name for the XML file is generally recommended.
4. Open Windows Explorer and go to the C:\Program Files\Informatica PowerCenter 7 and later versions x
\Client directory. (This may vary depending on where you installed the client tools.)
5. Find the powrmart.dtd file, make a copy of it, and paste the copy into the directory where you saved the XML
file.
6. Together, these files are now ready to be added to the version control software

Importing

Log into Designer or the Workflow Manager client tool and login to the destination repository. Open the folder where
the object is to be imported.

1. Select Repository -> Import Objects.


2. The system prompts you to select a directory location and file to import into the repository.
3. The following screen appears with the steps for importing the object.

4. Select the mapping and add it to the Objects to Import list.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 552 of 1017


5. Click "Next", and then click "Import". Since the shortcuts have been added to the folder, the mapping will now
point to the new shortcuts and their parent folder.
6. It is important to note that the pmrep command line utility was greatly enhanced in PowerCenter 7 and later
versions, allowing the activities associated with XML import/export to be automated through pmrep.
7. Click on the destination repository service on the left pane and choose the “Action drop-down list box “ ->
“Restore.” Remember, if the destination repository has content, it has to be deleted prior to restoring).

Last updated: 04-Jun-08 16:18

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 553 of 1017


Migration Procedures - PowerExchange

Challenge

To facilitate the migration of PowerExchange definitions from one environment to another.

Description

There are two approaches to perform a migration.

● Using the DTLURDMO utility


● Using the Power Exchange Client tool (Detail Navigator)

DTLURDMO Utility

Step 1: Validate connectivity between the client and listeners

● Test communication between clients and all listeners in the production environment with:

dtlrexeprog=ping <loc>=<nodename>.

● Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Run DTLURDMO to copy PowerExchange objects.

At this stage, if PowerExchange is to run against new versions of the PowerExchange objects rather than
existing libraries, you need to copy the datamaps. To do this, use the PowerExchange Copy Utility
DTLURDMO. The following section assumes that the entire datamap set is to be copied. DTLURDMO
does have the ability to copy selectively, however, and the full functionality of the utility is documented in
the PowerExchange Utilities Guide.

The types of definitions that can be managed with this utility are:

● PowerExchange data maps

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 554 of 1017


● PowerExchange capture registrations
● PowerExchange capture extraction data maps

On MVS, the input statements for this utility are taken from SYSIN.

On non-MVS platforms, the input argument point to a file containing the input definition. If no input
argument is provided, it looks for a file dtlurdmo.ini in the current path.

The utility runs on all capture platforms.

Windows and UNIX Command Line

Syntax: DTLURDMO <dtlurdmo definition file>

For example: DTLURDMO e:\powerexchange\bin\dtlurdmo.ini

● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates. If no definition file is specified, it looks for a file dtlurdmo.ini in the current path.

MVS DTLURDMO job utility

Run the utility by submitting the DTLURDMO job, which can be found in the RUNLIB library.

● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates and is read from the SYSIN card.

AS/400 utility

Syntax: CALL PGM(<location and name of DTLURDMO executable file>)

For example: CALL PGM(dtllib/DTLURDMO)

● DTLURDMO Definition file specification - This file is used to specify how the DTLURDMO
utility operates. By default, the definition is in the member CFG/DTLURDMO in the current datalib
library.

If you want to create a separate DTLURDMO definition file rather than use the default location, you must
give the library and filename of the definition file as a parameter. For example: CALL PGM(dtllib/
DTLURDMO) parm ('datalib/deffile(dtlurdmo)')

Running DTLURDMO

The utility should be run extracting information from the files locally, then writing out the datamaps through
the new PowerExchange V8.x.x Listener. This causes the datamaps to be written out in the format
required for the upgraded PowerExchange. DTLURDMO must be run once for the datamaps, then again
for the registrations, and then the extract maps if this is a capture environment. Commands for mixed
datamaps, registrations, and extract maps cannot be run together.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 555 of 1017


If only a subset of the PowerExchange datamaps, registrations, and extract maps are required, then
selective copies can be carried out. Details of performing selective copies are documented fully in the
PowerExchange Utilities Guide. This document assumes that everything is going to be migrated from the
existing environment to the new V8.x.x format.

Definition File Example

The following example shows a definition file to copy all datamaps from the existing local datamaps (the
local datamaps are defined in the DATAMAP DD card in the MVS JCL or by the path on Windows or
UNIX) to the V8.x.x listener (defined by the TARGET location node1):

USER DTLUSR;

EPWD A3156A3623298FDC;

SOURCE LOCAL;

TARGET NODE1;

DETAIL;

REPLACE;

DM_COPY;

SELECT schema=*;

Note: The encrypted password (EPWD) is generated from the FILE, ENCRYPT PASSWORD option from
the PowerExchange Navigator.

Power Exchange Client tool (Detail Navigator)

Step 1: Validate connectivity between the client and listeners

● Test communication between clients and all listeners in the production environment with:

dtlrexeprog=ping loc=<nodename>.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 556 of 1017


● Run selected jobs to exercise data access through PowerExchange data maps.

Step 2: Start the Power Exchange Navigator

● Select the datamap that is going to be promoted to production.


● On the menu bar, select a file to send to the remote node.

On the drop-down list box, choose the appropriate location ( in this case mvs_prod).

Supply the user name and password and click OK.


A confirmation message for successful migration is displayed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 557 of 1017


Last updated: 06-Feb-07 11:39

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 558 of 1017


Running Sessions in Recovery Mode

Challenge

Understanding the recovery options that are available for PowerCenter when errors are
encountered during the load.

Description

When a task in the workflow fails at any point, one option is to truncate the target and
run the workflow again from the beginning. As an alternative, the workflow can be
suspended and the error can be fixed, rather than re-processing the portion of the
workflow with no errors. This option, "Suspend on Error", results in accurate and
complete target data, as if the session completed successfully with one run. There are
also recovery options available for workflows and tasks that can be used to handle
different failure scenarios.

Configure Mapping for Recovery

For consistent recovery, the mapping needs to produce the same result, and in the
same order, in the recovery execution as in the failed execution. This can be achieved
by sorting the input data using either the sorted ports option in Source Qualifier (or
Application Source Qualifier) or by using a sorter transformation with distinct rows
option immediately after source qualifier transformation. Additionally, ensure that all the
targets received data from transformations that produce repeatable data.

Configure Session for Recovery

The recovery strategy can be configured on the Properties page of the Session task.
Enable the session for recovery by selecting one of the following three Recovery
Strategies:

● Resume from the last checkpoint

❍ The Integration Service saves the session recovery information and


updates recovery tables for a target database.
❍ If a session interrupts, the Integration Service uses the saved recovery
information to recover it.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 559 of 1017


❍ The Integration Service recovers a stopped, aborted or terminated
session from the last checkpoint.

● Restart task

❍ The Integration Service does not save session recovery information.


❍ If a session interrupts, the Integration Service reruns the session during
recovery.

● Fail task and continue workflow

❍ The Integration Service recovers a workflow; it does not recover the


session. The session status becomes failed and the Integration Service
continues running the workflow.

Configure Workflow for Recovery

The Suspend on Error option directs the Integration Service to suspend the workflow
while the error is being fixed and then it resumes the workflow. The workflow is
suspended when any of the following tasks fail:

● Session
● Command
● Worklet
● Email

When a task fails in the workflow, the Integration Service stops running tasks in the
path. The Integration Service does not evaluate the output link of the failed task. If no
other task is running in the workflow, the Workflow Monitor displays the status of the
workflow as "Suspended."

If one or more tasks are still running in the workflow when a task fails, the Integration
Service stops running the failed task and continues running tasks in other paths. The
Workflow Monitor displays the status of the workflow as "Suspending." When the status
of the workflow is "Suspended" or "Suspending," you can fix the error, such as a target
database error, and recover the workflow in the Workflow Monitor. When you recover a
workflow, the Integration Service restarts the failed tasks and continues evaluating the
rest of the tasks in the workflow. The Integration Service does not run any task that
already completed successfully.

Truncate Target Table

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 560 of 1017


If the truncate table option is enabled in a recovery-enabled session, the target table is
not truncated during recovery process.

Session Logs

In a suspended workflow scenario, the Integration Service uses the existing session log
when it resumes the workflow from the point of suspension. However, the earlier runs
that caused the suspension are recorded in the historical run information in the
repository.

Suspension Email

The workflow can be configured to send an email when the Integration Service
suspends the workflow. When a task fails, the workflow is suspended and suspension
email is sent. The error can be fixed and the workflow can be resumed subsequently.
If another task fails while the Integration Service is suspending the workflow, another
suspension email is not sent. The Integration Service only sends out another
suspension email if another task fails after the workflow resumes. Check the "Browse
Emails" button on the General tab of the Workflow Designer Edit sheet to configure the
suspension email.

Suspending Worklets

When the "Suspend On Error" option is enabled for the parent workflow, the Integration
Service also suspends the worklet if a task within the worklet fails. When a task in the
worklet fails, the Integration Service stops executing the failed task and other tasks in
its path. If no other task is running in the worklet, the status of the worklet is
"Suspended". If other tasks are still running in the worklet, the status of the worklet is
"Suspending". The parent workflow is also suspended when the worklet is "Suspended"
or "Suspending".

Starting Recovery

The recovery process can be started using Workflow Manager or Workflow Monitor .
Alternately, the recovery process can be started by using pmcmd in command line
mode or by using a script.

Recovery Tables and Recovery Process

When the Integration Service runs a session that has a resume recovery strategy, it

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 561 of 1017


writes to recovery tables on the target database system. When the Integration Service
recovers the session, it uses information in the recovery tables to determine where to
begin loading data to target tables. If you want the Integration Service to create the
recovery tables, grant table creation privilege to the database user name that is
configured in the target database connection. If you do not want the Integration Service
to create the recovery tables, create the recovery tables manually. The Integration
Service creates the following recovery tables in the target database:

PM_RECOVERY - Contains target load information for the session run. The Integration
Service removes the information from this table after each successful session and
initializes the information at the beginning of subsequent sessions.

PM_TGT_RUN_ID - Contains information that the Integration Service uses to identify


each target on the database. The information remains in the table between session
runs. If you manually create this table, you must create a row and enter a value other
than zero for LAST_TGT_RUN_ID to ensure that the session recovers successfully.

PM_REC_STATE - When the Integration Service runs a real-time session that uses the
recovery table and that has recovery enabled, it creates a recovery table,
PM_REC_STATE, on the target database to store message IDs and commit numbers.
When the Integration Service recovers the session, it uses information in the recovery
tables to determine if it needs to write the message to the target table. The table
contains information that the Integration Service uses to determine if it needs to write
messages to the target table during recovery for a real-time session.

If you edit or drop the recovery tables before you recover a session, the Integration
Service cannot recover the session. If you disable recovery, the Integration Service
does not remove the recovery tables from the target database and you must manually
remove them

Session Recovery Considerations

The following options affect whether the session is incrementally recoverable:

● Output is deterministic. A property that determines if the transformation


generates the same set of data for each session run.
● Output is repeatable. A property that determines if the transformation
generates the data in the same order for each session run. You can set this
property for Custom transformations.
● Lookup source is static. A Lookup transformation property that determines if
the lookup source is the same between the session and recovery. The

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 562 of 1017


Integration Service uses this property to determine if the output is
deterministic.

Inconsistent Data During Recovery Process

For recovery to be effective, the recovery session must produce the same set of rows;
and in the same order. Any change after initial failure (in mapping, session and/or in the
Integration Service) that changes the ability to produce repeatable data, results in
inconsistent data during the recovery process. The following situations may produce
inconsistent data during a recovery session:

● Session performs incremental aggregation and the Integration Service stops


unexpectedly.
● Mapping uses sequence generator transformation.
● Mapping uses a normalizer transformation.
● Source and/or target changes after initial session failure.
● Data movement mode change after initial session failure.
● Code page (server, source or target) changes, after initial session failure.
● Mapping changes in a way that causes server to distribute or filter or
aggregate rows differently.
● Session configurations are not supported by PowerCenter for session
recovery.
● Mapping uses a lookup table and the data in the lookup table changes
between session runs.
● Session sort order changes, when server is running in Unicode mode.

HA Recovery

Highly-available recovery allows the workflow to resume automatically in the case of


Integration Service failover. The following options are available in the properties tab of
the workflow:

● Enable HA recovery Allows the workflow to be configured for Highly


Availability.
● Automatically recover terminated tasks Recover terminated Session or
Command tasks without user intervention.
● Maximum automatic recovery attempts When you automatically recover
terminated tasks, you can choose the number of times the Integration Service

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 563 of 1017


attempts to recover the task. The default setting is 5.

Last updated: 26-May-08 11:28

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 564 of 1017


Using PowerCenter Labels

Challenge

Using labels effectively in a data warehouse or data integration project to assist with
administration and migration.

Description

A label is a versioning object that can be associated with any versioned object or group of
versioned objects in a repository. Labels provide a way to tag a number of object versions with
a name for later identification. Therefore, a label is a named object in the repository, whose
purpose is to be a “pointer” or reference to a group of versioned objects. For example, a label
called “Project X version X” can be applied to all object versions that are part of that project and
release.

Labels can be used for many purposes:

● Track versioned objects during development


● Improve object query results.
● Create logical groups of objects for future deployment.
● Associate groups of objects for import and export.

Note that labels apply to individual object versions, and not objects as a whole. So if a mapping
has ten versions checked in, and a label is applied to version 9, then only version 9 has that
label. The other versions of that mapping do not automatically inherit that label. However,
multiple labels can point to the same object for greater flexibility.

The “Use Repository Manager” privilege is required in order to create or edit labels, To create a
label, choose Versioning-Labels from the Repository Manager.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 565 of 1017


When creating a new label, choose a name that is as descriptive as possible. For example, a
suggested naming convention for labels is: Project_Version_Action. Include comments for
further meaningful description.

Locking the label is also advisable. This prevents anyone from accidentally associating
additional objects with the label or removing object references for the label.

Labels, like other global objects such as Queries and Deployment Groups, can have user and
group privileges attached to them. This allows an administrator to create a label that can only
be used by specific individuals or groups. Only those people working on a specific project
should be given read/write/execute permissions for labels that are assigned to that project.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 566 of 1017


Once a label is created, it should be applied to related objects. To apply the label to objects,
invoke the “Apply Label” wizard from the Versioning >> Apply Label menu option from the menu
bar in the Repository Manager (as shown in the following figure).

Applying Labels

Labels can be applied to any object and cascaded upwards and downwards to parent and/or
child objects. For example, to group dependencies for a workflow, apply a label to all children
objects. The Repository Server applies labels to sources, targets, mappings, and tasks
associated with the workflow. Use the “Move label” property to point the label to the latest
version of the object(s).

Note: Labels can be applied to any object version in the repository except checked-out
versions. Execute permission is required for applying labels.

After the label has been applied to related objects, it can be used in queries and deployment
groups (see the Best Practice on Deployment Groups ). Labels can also be used to manage the
size of the repository (i.e. to purge object versions).

Using Labels in Deployment

An object query can be created using the existing labels (as shown below). Labels can be
associated only with a dynamic deployment group. Based on the object query, objects
associated with that label can be used in the deployment.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 567 of 1017


Strategies for Labels

Repository Administrators and other individuals in charge of migrations should develop their
own label strategies and naming conventions in the early stages of a data integration project.
Be sure that developers are aware of the uses of these labels and when they should apply
labels.

For each planned migration between repositories, choose three labels for the development and
subsequent repositories:

● The first is to identify the objects that developers can mark as ready for migration.
● The second should apply to migrated objects, thus developing a migration audit trail.
● The third is to apply to objects as they are migrated into the receiving repository,
completing the migration audit trail.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 568 of 1017


When preparing for the migration, use the first label to construct a query to build a dynamic
deployment group. The second and third labels in the process are optionally applied by the
migration wizard when copying folders between versioned repositories. Developers and
administrators do not need to apply the second and third labels manually.

Additional labels can be created with developers to allow the progress of mappings to be
tracked if desired. For example, when an object is successfully unit-tested by the developer, it
can be marked as such. Developers can also label the object with a migration label at a later
time if necessary. Using labels in this fashion along with the query feature allows complete or
incomplete objects to be identified quickly and easily, thereby providing an object-based view of
progress.

Last updated: 04-Jun-08 13:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 569 of 1017


Build Data Audit/Balancing Processes

Challenge

Data Migration and Data Integration projects are often challenged to verify that the data in an
application is complete. More specifically, to identify that all the appropriate data was extracted
from a source system and propagated to its final target. This best practice illustrates how to do this
in an efficient and a repeatable fashion for increased productivity and reliability. This is particularly
important in businesses that are either highly regulated internally and externally or that have to
comply with a host of government compliance regulations such as Sarbanes-Oxley, BASEL II,
HIPAA, Patriot Act, and many others.

Description

The common practice for audit and balancing solutions is to produce a set of common tables that
can hold various control metrics regarding the data integration process. Ultimately, business
intelligence reports provide insight at a glance to verify that the correct data has been pulled from
the source and completely loaded to the target. Each control measure that is being tracked will
require development of a corresponding PowerCenter process to load the metrics to the Audit/
Balancing Detail table.

To drive out this type of solution execute the following tasks:

1. Work with business users to identify what audit/balancing processes are needed. Some
examples of this may be:
a. Customers – (Number of Customers or Number of Customers by Country)
b. Orders – (Qty of Units Sold or Net Sales Amount)
c. Deliveries – (Number of shipments or Qty of units shipped of Value of all shipments)
d. Accounts Receivable – (Number of Accounts Receivable Shipments or Total
Accounts Receivable Outstanding)
2. Define for each process defined in #1 which columns should be used for tracking purposes
for both the source and target system.
3. Develop a data integration process that will read from the source system and populate the
detail audit/balancing table with the control totals.
4. Develop a data integration process that will read from the target system and populate the
detail audit/balancing table with the control totals.
5. Develop a reporting mechanism that will query the audit/balancing table and identify the the
source and target entries match or if there is a discrepancy.

An example audit/balance table definition looks like this :

Audit/Balancing Details

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 570 of 1017


Column Name Data Type Size

AUDIT_KEY NUMBER 10

CONTROL_AREA VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

CONTROL_COUNT_1 NUMBER 10

CONTROL_COUNT_2 NUMBER 10

CONTROL_COUNT_3 NUMBER 10

CONTROL_COUNT_4 NUMBER 10

CONTROL_COUNT_5 NUMBER 10

CONTROL_SUM_1 NUMBER (p,s) 10,2

CONTROL_SUM_2 NUMBER (p,s) 10,2

CONTROL_SUM_3 NUMBER (p,s) 10,2

CONTROL_SUM_4 NUMBER (p,s) 10,2

CONTROL_SUM_5 NUMBER (p,s) 10,2

UPDATE_TIMESTAMP TIMESTAMP

UPDATE_PROCESS VARCHAR2 50

Control Column Definition by Control Area/Control Sub Area

Column Name Data Type Size

CONTROL_AREA VARCHAR2 50

CONTROL_SUB_AREA VARCHAR2 50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 571 of 1017


CONTROL_COUNT_1 VARCHAR2 50

CONTROL_COUNT_2 VARCHAR2 50

CONTROL_COUNT_3 VARCHAR2 50

CONTROL_COUNT_4 VARCHAR2 50

CONTROL_COUNT_5 VARCHAR2 50

CONTROL_SUM_1 VARCHAR2 50

CONTROL_SUM_2 VARCHAR2 50

CONTROL_SUM_3 VARCHAR2 50

CONTROL_SUM_4 VARCHAR2 50

CONTROL_SUM_5 VARCHAR2 50

UPDATE_TIMESTAMP TIMESTAMP

UPDATE_PROCESS VARCHAR2 50

The following is a screenshot of a single mapping that will populate both the source and target
values in a single mapping:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 572 of 1017


The following two screenshots show how two mappings could be used to provide the same results:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 573 of 1017


Note: One key challenge is how to capture the appropriate control values from the source system
if it is continually being updated. The first example with one mapping will not work due to the
changes that occur in the time between the extraction of the data from the source and the
completion of the load to the target application. In those cases you may want to take advantage of
an aggregator transformation to collect the appropriate control totals as illustrated in this screenshot:

The following are two Straw-man Examples of an Audit/Balancing Report which is the end-result of
this type of process:

Data Area Leg count TT count Diff Leg amt TT amt

Customer 11000 10099 1 0

Orders 9827 9827 0 11230.21 11230.21 0

Deliveries 1298 1288 0 21294.22 21011.21 283.01

In summary, there are two big challenges in building audit/balancing processes:

1. Identifying what the control totals should be


2. Building processes that will collect the correct information at the correct granularity

There are also a set of basic tasks that can be leveraged and shared across any audit/balancing
needs. By building a common model for meeting audit/balancing needs, projects can lower the
time needed to develop these solutions and still provide risk reductions by having this type of
solution in place.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 574 of 1017


Last updated: 04-Jun-08 18:17

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 575 of 1017


Data Cleansing

Challenge

Poor data quality is one of the biggest obstacles to the success of many data integration projects. A 2005
study by the Gartner Group stated that the majority of currently planned data warehouse projects will suffer
limited acceptance or fail outright. Gartner declared that the main cause of project problems was a lack of
attention to data quality.

Moreover, once in the system, poor data quality can cost organizations vast sums in lost revenues.
Defective data leads to breakdowns in the supply chain, poor business decisions, and inferior customer
relationship management. It is essential that data quality issues are tackled during any large-scale data
project to enable project success and future organizational success.

Therefore, the challenge is twofold: to cleanse project data, so that the project succeeds, and to ensure
that all data entering the organizational data stores provides for consistent and reliable decision-making.

Description

A significant portion of time in the project development process should be dedicated to data quality,
including the implementation of data cleansing processes. In a production environment, data quality
reports should be generated after each data warehouse implementation or when new source systems are
integrated into the environment. There should also be provision for rolling back if data quality testing
indicates that the data is unacceptable.

Informatica offers two application suites for tackling data quality issues: Informatica Data Explorer (IDE)
and Informatica Data Quality (IDQ). IDE focuses on data profiling, and its results can feed into the data
integration process. However, its unique strength is its metadata profiling and discovery capability. IDQ
has been developed as a data analysis, cleansing, correction, and de-duplication tool, one that provides a
complete solution for identifying and resolving all types of data quality problems and preparing data for the
consolidation and load processes.

Concepts

Following are some key concepts in the field of data quality. These data quality concepts provide a
foundation that helps to develop a clear picture of the subject data, which can improve both efficiency and
effectiveness. The list of concepts can be read as a process, leading from profiling and analysis to
consolidation.

Profiling and Analysis - whereas data profiling and data analysis are often synonymous terms, in
Informatica terminology these tasks are assigned to IDE and IDQ respectively. Thus, profiling is primarily
concerned with metadata discovery and definition, and IDE is ideally suited to these tasks. IDQ can
discover data quality issues at a record and field level, and Velocity best practices recommends the use of
IDQ for such purposes.

Note: The remaining items in this document will therefore, focus in the context of IDQ usage.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 576 of 1017


Parsing - the process of extracting individual elements within the records, files, or data entry forms in
order to check the structure and content of each field and to create discrete fields devoted to specific
information types. Examples may include: name, title, company name, phone number, and SSN.

Cleansing and Standardization - refers to arranging information in a consistent manner or preferred


format. Examples include the removal of dashes from phone numbers or SSNs. For more information, see
the Best Practice Effective Data Standardizing Techniques.

Enhancement - refers to adding useful, but optional, information to existing data or complete data.
Examples may include: sales volume, number of employees for a given business, and zip+4 codes.

Validation - the process of correcting data using algorithmic components and secondary reference data
sources, to check and validate information. Example: validating addresses with postal directories.

Matching and de-duplication - refers to removing, or flagging for removal, redundant or poor-quality
records where high-quality records of the same information exist. Use matching components and business
rules to identify records that may refer, for example, to the same customer. For more information, see the
Best Practice Effective Data Matching Techniques.

Consolidation - using the data sets defined during the matching process to combine all cleansed or
approved data into a single, consolidated view. Examples are building best record, master record, or
house-holding.

Informatica Applications

The Informatica Data Quality software suite has been developed to resolve a wide range of data quality
issues, including data cleansing. The suite comprises the following elements:

● IDQ Workbench - a stand-alone desktop tool that provides a complete set of data quality
functionality on a single computer (Windows only).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 577 of 1017


● IDQ Server- a set of processes that enables the deployment and management of data quality
procedures and resources across a network of any size through TCP/IP.
● IDQ Integration - a plug-in component that integrates Workbench with PowerCenter, enabling
PowerCenter users to embed data quality procedures defined in IDQ in their mappings.
● IDQ stores all its processes as XML in the Data Quality Repository (MySQL). IDQ Server
enables the creation and management of multiple repositories.

Using IDQ in Data Projects

IDQ can be used effectively alongside PowerCenter in data projects, to run data quality procedures in its
own applications or to provide them for addition to PowerCenter transformations.

Through its Workbench user-interface tool, IDQ tackles data quality in a modular fashion. That is,
Workbench enables you to build discrete procedures (called plans in Workbench) which contain data input
components, output components, and operational components. Plans can perform analysis, parsing,
standardization, enhancement, validation, matching, and consolidation operations on the specified data.
Plans are saved into projects that can provide a structure and sequence to your data quality endeavors.

The following figure illustrates how data quality processes can function in a project setting:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 578 of 1017


In stage 1, you analyze the quality of the project data according to several metrics, in consultation with the
business or project sponsor. This stage is performed in Workbench, which enables the creation of versatile
and easy to use dashboards to communicate data quality metrics to all interested parties.

In stage 2, you verify the target levels of quality for the business according to the data quality
measurements taken in stage 1, and in accordance with project resourcing and scheduling.

In stage 3, you use Workbench to design the data quality plans and projects to achieve the
targets. Capturing business rules and testing the plans are also covered in this stage.

In stage 4, you deploy the data quality plans. If you are using IDQ Workbench and Server, you can deploy
plans and resources to remote repositories and file systems through the user interface. If you are running
Workbench alone on remote computers, you can export your plans as XML. Stage 4 is the phase in which
data cleansing and other data quality tasks are performed on the project data.

In stage 5, you’ll test and measure the results of the plans and compare them to the initial data quality
assessment to verify that targets have been met. If targets have not been met, this information feeds into
another iteration of data quality operations in which the plans are tuned and optimized.

In a large data project, you may find that data quality processes of varying sizes and impact are necessary
at many points in the project plan. At a high level, stages 1 and 2 ideally occur very early in the project, at
a point defined as the Manage Phase within Velocity. Stages 3 and 4 typically occur during the Design
Phase of Velocity. Stage 5 can occur during the Design and/or Build Phase of Velocity, depending on the
level of unit testing required.

Using the IDQ Integration

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 579 of 1017


Data Quality Integration is a plug-in component that enables PowerCenter to connect to the Data Quality
repository and import data quality plans to a PowerCenter transformation. With the Integration component,
you can apply IDQ plans to your data without necessarily interacting with or being aware of IDQ
Workbench or Server.

The Integration interacts with PowerCenter in two ways:

● On the PowerCenter client side, it enables you to browse the Data Quality repository and add
data quality plans to custom transformations. The data quality plans’ functional details are saved
as XML in the PowerCenter repository.
● On the PowerCenter server side, it enables the PowerCenter Server (or Integration service) to
send data quality plan XML to the Data Quality engine for execution.

The Integration requires that at least the following IDQ components are available to PowerCenter:

● Client side: PowerCenter needs to access a Data Quality repository from which to import plans.
● Server side: PowerCenter needs an instance of the Data Quality engine to execute the plan
instructions.

An IDQ-trained consultant can build the data quality plans, or you can use the pre-built plans provided by
Informatica. Currently, Informatica provides a set of plans dedicated to cleansing and de-duplicating North
American name and postal address records.

The Integration component enables the following process:

● Data quality plans are built in Data Quality Workbench and saved from there to the Data Quality
repository.
● The PowerCenter Designer user opens a Data Quality Integration transformation and configures it
to read from the Data Quality repository. Next, the users selects a plan from the Data Quality
repository and adds it to the transformation.
● The PowerCenter Designer user saves the transformation and the mapping containing it to the
PowerCenter repository. The plan information is saved with the transformation as XML.

The PowerCenter Integration service can then run a workflow containing the saved mapping. The relevant
source data and plan information will be sent to the Data Quality engine, which processes the data (in
conjunction with any reference data files used by the plan) and returns the results to PowerCenter.

Last updated: 06-Feb-07 12:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 580 of 1017


Data Profiling

Challenge

Data profiling is an option in PowerCenter version 7.0 and later that leverages existing PowerCenter functionality and a data
profiling GUI front-end to provide a wizard-driven approach to creating data profiling mappings, sessions, and workflows. This
Best Practice is intended to provide an introduction on usage for new users.

Bear in mind that Informatica’s Data Quality (IDQ) applications also provide data profiling capabilities. Consult the following
Velocity Best Practice documents for more information:

● Data Cleansing
● Using Data Explorer for Data Discovery and Analysis

Description
Creating a Custom or Auto Profile

The data profiling option provides visibility into the data contained in source systems and enables users to measure changes
in the source data over time. This information can help to improve the quality of the source data.

An auto profile is particularly valuable when you are data profiling a source for the first time, since auto profiling offers a good
overall perspective of a source. It provides a row count, candidate key evaluation, and redundancy evaluation at the source
level, and domain inference, distinct value and null value count, and min, max, and average (if numeric) at the column level.
Creating and running an auto profile is quick and helps to gain a reasonably thorough understanding of a source in a short
amount of time.

A custom data profile is useful when there is a specific question about a source. Custom profiling is useful for validating
business rules and/or verifying that data matches a particular pattern. For example, use custom profiling if you have a
business rule that you want to validate, or if you want to test whether data matches a particular pattern.

Setting Up the Profile Wizard

To customize the profile wizard for your preferences:

● Open the Profile Manager and choose Tools > Options.


● If you are profiling data using a database user that is not the owner of the tables to be sourced, check the “Use
source owner name during profile mapping generation” option.
● If you are in the analysis phase of your project, choose “Always run profile interactively” since most of your data-
profiling tasks will be interactive. (In later phases of the project, uncheck this option because more permanent data
profiles are useful in these phases.)

Running and Monitoring Profiles

Profiles are run in one of two modes: interactive or batch. Choose the appropriate mode by checking or unchecking “Configure
Session” on the "Function-Level Operations” tab of the wizard.

● Use Interactive to create quick, single-use data profiles. The sessions are created with default configuration
parameters.
● For data-profiling tasks that are likely to be reused on a regular basis, create the sessions manually in Workflow
Manager and configure and schedule them appropriately.

Generating and Viewing Profile Reports

Use Profile Manager to view profile reports. Right-click on a profile and choose View Report.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 581 of 1017


For greater flexibility, you can also use Data Analyzer to view reports. Each PowerCenter client includes a Data Analyzer
schema and reports xml file. The xml files are located in the \Extensions\DataProfile\IPAReports subdirectory of the client
installation.

You can create additional metrics, attributes, and reports in Data Analyzer to meet specific business requirements. You can
also schedule Data Analyzer reports and alerts to send notifications in cases where data does not meet preset quality limits.

Sampling Techniques

Four types of sampling techniques are available with the PowerCenter data profiling option:

Technique Description Usage

No sampling Uses all source data Relatively small data sources

Automatic random sampling PowerCenter determines the Larger data sources where you
appropriate percentage to sample, then want a statistically significant data
samples random rows. analysis

Manual random sampling PowerCenter samples random rows of Samples more or fewer rows than
the source data based on a user- the automatic option chooses.
specified percentage.

Sample first N rows Samples the number of user-selected Provides a quick readout of a
rows source (e.g., first 200 rows)

Profile Warehouse Administration

Updating Data Profiling Repository Statistics

The Data Profiling repository contains nearly 30 tables with more than 80 indexes. To ensure that queries run optimally, be
sure to keep database statistics up to date. Run the query below as appropriate for your database type, then capture the script
that is generated and run it.

ORACLE

select 'analyze table ' || table_name || ' compute statistics;' from user_tables where table_name like 'PMDP%';

select 'analyze index ' || index_name || ' compute statistics;' from user_tables where index_name like 'DP%';

Microsoft SQL Server

select 'update statistics ' + name from sysobjects where name like 'PMDP%'

SYBASE

select 'update statistics ' + name from sysobjects where name like 'PMDP%'

INFORMIX

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 582 of 1017


select 'update statistics low for table ', tabname, ' ; ' from systables where table_name like 'PMDP%'

IBM DB2

select 'runstats on table ' || rtrim(tabschema) || '. ' || tabname || ' and indexes all; ' from syscat.tables where tabname like 'PMDP
%'

TERADATA

select 'collect statistics on ', tablename, ' index ', indexname from dbc.indices where tablename like 'PMDP%' and
databasename = 'database_name'

where database_name is the name of the repository database.

Purging Old Data Profiles

Use the Profile Manager to purge old profile data from the Profile Warehouse. Choose Target Warehouse>Connect and
connect to the profiling warehouse. Choose Target Warehouse>Purge to open the purging tool.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 583 of 1017


Data Quality Mapping Rules

Challenge

Use PowerCenter to create data quality mapping rules to enhance the usability of the
data in your system.

Description

The issue of poor data quality is one that frequently hinders the success of data
integration projects. It can produce inconsistent or faulty results and ruin the credibility
of the system with the business users.

This Best Practice focuses on techniques for use with PowerCenter and third-party or
add-on software. Comments that are specific to the use of PowerCenter are enclosed
in brackets.

Bear in mind that you can augment or supplant the data quality handling capabilities of
PowerCenter with Informatica Data Quality (IDQ), the Informatica application suite
dedicated to data quality issues. Data analysis and data enhancement processes, or
plans, defined in IDQ can deliver significant data quality improvements to your project
data. A data project that has built-in data quality steps, such as those described in the
Analyze and Design phases of Velocity, enjoys a significant advantage over a project
that has not audited and resolved issues of poor data quality. If you have added these
data quality steps to your project, you are likely to avoid the issues described below.

A description of the range of IDQ capabilities is beyond the scope of this document. For
a summary of Informatica’s data quality methodology, as embodied in IDQ, consult the
Best Practice Data Cleansing.

Common Questions to Consider

Data integration/warehousing projects often encounter general data problems that may
not merit a full-blown data quality project, but which nonetheless must be addressed.
This document discusses some methods to ensure a base level of data quality; much
of the content discusses specific strategies to use with PowerCenter.

The quality of data is important in all types of projects, whether it be data warehousing,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 584 of 1017


data synchronization, or data migration. Certain questions need to be considered for all
of these projects, with the answers driven by the project’s requirements and the
business users that are being serviced. Ideally, these questions should be addressed
during the Design and Analyze Phases of the project because they can require a
significant amount of re-coding if identified later.

Some of the areas to consider are:

Text Formatting

The most common hurdle here is capitalization and trimming of spaces. Often, users
want to see data in its “raw” format without any capitalization, trimming, or formatting
applied to it. This is easily achievable as it is the default behavior, but there is danger in
taking this requirement literally since it can lead to duplicate records when some of
these fields are used to identify uniqueness and the system is combining data from
various source systems.

One solution to this issue is to create additional fields that act as a unique key to a
given table, but which are formatted in a standard way. Since the “raw” data is stored in
the table, users can still see it in this format, but the additional columns mitigate the risk
of duplication.

Another possibility is to explain to the users that “raw” data in unique, identifying fields
is not as clean and consistent as data in a common format. In other words, push back
on this requirement.

This issue can be particularly troublesome in data migration projects where matching
the source data is a high priority. Failing to trim leading/trailing spaces from data can
often lead to mismatched results since the spaces are stored as part of the data value.
The project team must understand how spaces are handled from the source systems to
determine the amount of coding required to correct this. (When using PowerCenter and
sourcing flat files, the options provided while configuring the File Properties may be
sufficient.). Remember that certain RDBMS products use the data type CHAR, which
then stores the data with trailing blanks. These blanks need to be trimmed before
matching can occur. It is usually only advisable to use CHAR for 1-character flag fields.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 585 of 1017


Note that many fixed-width files do not use a null as space. Therefore, developers must
put one space beside the text radio button, and also tell the product that the space is
repeating to fill out the rest of the precision of the column. The strip trailing blanks
facility then strips off any remaining spaces from the end of the data value. Embedding
database text manipulation functions in lookup transformations is not recommended
because a developer must then cache the lookup table due to the presence of a SQL
override. (In PowerCenter, avoid embedding database text manipulation functions in
lookup transformations.) On very large tables, caching is not always realistic or feasible.

Datatype Conversions

It is advisable to use explicit tool functions when converting the data type of a particular
data value.

[In PowerCenter, if the TO_CHAR function is not used, an implicit conversion is


performed, and 15 digits are carried forward, even when they are not needed or
desired. PowerCenter can handle some conversions without function calls (these are
detailed in the product documentation), but this may cause subsequent support or

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 586 of 1017


maintenance headaches.]

Dates

Dates can cause many problems when moving and transforming data from one place
to another because an assumption must be made that all data values are in a
designated format.

[Informatica recommends first checking a piece of data to ensure it is in the proper


format before trying to convert it to a Date data type. If the check is not performed first,
then a developer increases the risk of transformation errors, which can cause data to
be lost].

An example piece of code would be: IIF(IS_DATE(in_RECORD_CREATE_DT,


‘YYYYMMDD'), TO_DATE(in_RECORD_CREATE_DT, 'YYYYMMDD'), NULL)

If the majority of the dates coming from a source system arrive in the same format, then
it is often wise to create a reusable expression that handles dates, so that the proper
checks are made. It is also advisable to determine if any default dates should be
defined, such as a low date or high date. These should then be used throughout the
system for consistency. However, do not fall into the trap of always using default dates
as some are meant to be NULL until the appropriate time (e.g., birth date or death date).

The NULL in the example above could be changed to one of the standard default dates
described here.

Decimal Precision

With numeric data columns, developers must determine the expected or required
precisions of the columns. (By default, to increase performance, PowerCenter treats all
numeric columns as 15 digit floating point decimals, regardless of how they are defined
in the transformations. The maximum numeric precision in PowerCenter is 28 digits.)

If it is determined that a column realistically needs a higher precision, then the Enable
Decimal Arithmetic in the Session Properties option needs to be checked. However, be
aware that enabling this option can slow performance by as much as 15 percent. The
Enable Decimal Arithmetic option must be enabled when comparing two numbers for
equality.

Trapping Poor Data Quality Techniques

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 587 of 1017


The most important technique for ensuring good data quality is to prevent incorrect,
inconsistent, or incomplete data from ever reaching the target system. This goal may
be difficult to achieve in a data synchronization or data migration project, but it is very
relevant when discussing data warehousing or ODS. This section discusses techniques
that you can use to prevent bad data from reaching the system.

Checking Data for Completeness Before Loading

When requesting a data feed from an upstream system, be sure to request an audit file
or report that contains a summary of what to expect within the feed. Common requests
here are record counts or summaries of numeric data fields. If you have performed a
data quality audit, as specified in the Analyze Phase these metrics and others should
be readily available.

Assuming that the metrics can be obtained from the source system, it is advisable to
then create a pre-process step that ensures your input source matches the audit file. If
the values do not match, stop the overall process from loading into your target system.
The source system can then be alerted to verify where the problem exists in its feed.

Enforcing Rules During Mapping

Another method of filtering bad data is to have a set of clearly defined data rules built
into the load job. The records are then evaluated against these rules and routed to an
Error or Bad Table for further re-processing accordingly. An example of this is to check
all incoming Country Codes against a Valid Values table. If the code is not found, then
the record is flagged as an Error record and written to the Error table.

A pitfall of this method is that you must determine what happens to the record once it
has been loaded to the Error table. If the record is pushed back to the source system to
be fixed, then a delay may occur until the record can be successfully loaded to the
target system. In fact, if the proper governance is not in place, the source system may
refuse to fix the record at all. In this case, a decision must be made to either: 1) fix the
data manually and risk not matching with the source system; or 2) relax the business
rule to allow the record to be loaded.

Often times, in the absence of an enterprise data steward, it is a good idea to assign a
team member the role of data steward. It is this person’s responsibility to patrol these
tables and push back to the appropriate systems as necessary, as well as help to make
decisions about fixing or filtering bad data. A data steward should have a good
command of the metadata, and he/she should also understand the consequences to
the user community of data decisions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 588 of 1017


Another solution applicable in cases with a small number of code values is to try to
anticipate any mistyped error codes and translate them back to the correct codes. The
cross-reference translation data can be accumulated over time. Each time an error is
corrected, both the incorrect and correct values should be put into the table and used to
correct future errors automatically.

Dimension Not Found While Loading Fact

The majority of current data warehouses are built using a dimensional model. A
dimensional model relies on the presence of dimension records existing before loading
the fact tables. This can usually be accomplished by loading the dimension tables
before loading the fact tables. However, there are some cases where a corresponding
dimension record is not present at the time of the fact load. When this occurs,
consistent rules need to handle this so that data is not improperly exposed to, or hidden
from, the users.

One solution is to continue to load the data to the fact table, but assign the foreign key
a value that represents Not Found or Not Available in the dimension. These keys must
also exist in the dimension tables to satisfy referential integrity, but they provide a clear
and easy way to identify records that may need to be reprocessed at a later date.

Another solution is to filter the record from processing since it may no longer be
relevant to the fact table. The team will most likely want to flag the row through the use
of either error tables or process codes so that it can be reprocessed at a later time.

A third solution is to use dynamic caches and load the dimensions when a record is not
found there, even while loading the fact table. This should be done very carefully since
it may add unwanted or junk values to the dimension table. One occasion when this
may be advisable is in cases where dimensions are simply made up of the distinct
combination values in a data set. Thus, this dimension may require a new record if a
new combination occurs.

It is imperative that all of these solutions be discussed with the users before making
any decisions since they will eventually be the ones making decisions based on the
reports.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 589 of 1017


Effective Data Matching Techniques

Challenge

Identifying and eliminating duplicates is a cornerstone of effective marketing efforts and customer resource
management initiatives, and it is an increasingly important driver of cost-efficient compliance with regulatory
initiatives such as KYC (Know Your Customer).

Once duplicate records are identified, you can remove them from your dataset, and better recognize key
relationships among data records (such as customer records from a common household). You can also match
records or values against reference data to ensure data accuracy and validity.

This Best Practice is targeted toward Informatica Data Quality (IDQ) users familiar with Informatica's matching
approach. It has two high-level objectives:

● To identify the key performance variables that affect the design and execution of IDQ matching plans.
● To describe plan design and plan execution actions that will optimize plan performance and results.

To optimize your data matching operations in IDQ, you must be aware of the factors that are discussed below.

Description

All too often, an organization's datasets contain duplicate data in spite of numerous attempts to cleanse the data or
prevent duplicates from occurring. In other scenarios, the datasets may lack common keys (such as customer
numbers or product ID fields) that, if present, would allow clear ‘joins’ between the datasets and improve business
knowledge.

Identifying and eliminating duplicates in datasets can serve several purposes. It enables the creation of a single
view of customers; it can help control costs associated with mailing lists by preventing multiple pieces of mail from
being sent to the same person or household; and it can assist marketing efforts by identifying households or
individuals who are heavy users of a product or service.

Data can be enriched by matching across production data and reference data sources. Business intelligence
operations can be improved by identifying links between two or more systems to provide a more complete picture
of how customers interact with a business.

IDQ’s matching capabilities can help to resolve dataset duplications and deliver business results. However, a
user’s ability to design and execute a matching plan that meets the key requirements of performance and match
quality depends on understanding the best-practice approaches described in this document.

An integrated approach to data matching involves several steps that prepare the data for matching and improve the
overall quality of the matches. The following table outlines the processes in each step.

Step Description
Typically the first stage of the data quality process, profiling generates
Profiling a picture of the data and indicates the data elements that can comprise
effective group keys. It also highlights the data elements that require
standardizing to improve match scores.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 590 of 1017


Standardization Removes noise, excess punctuation, variant spellings, and other
extraneous data elements. Standardization reduces the likelihood that
match quality will be affected by data elements that are not relevant to
match determination.
Grouping A post-standardization function in which the groups' key fields identified
in the profiling stage are used to segment data into logical groups that
facilitate matching plan performance.
Matching The process whereby the data values in the created groups are
compared against one another and record matches are identified
according to user-defined criteria.
Consolidation The process whereby duplicate records are cleansed. It identifies the
master record in a duplicate cluster and permits the creation of a new
dataset or the elimination of subordinate records. Any child data
associated with subordinate records is linked to the master record.

The sections below identify the key factors that affect the performance (or speed) of a matching plan and the
quality of the matches identified. They also outline the best practices that ensure that each matching plan is
implemented with the highest probability of success. (This document does not make any recommendations on
profiling, standardization or consolidation strategies. Its focus is grouping and matching.)

The following table identifies the key variables that affect matching plan performance and the quality of matches
identified.

Factor Impact Impact summary

Group size Plan performance The number and size of groups


have a significant impact on plan
execution speed.

Group keys Quality of matches The proper selection of group


keys ensures that the maximum
number of possible matches are
identified in the plan.
Hardware resources Plan performance Processors, disk performance,
and memory require
consideration.

Size of dataset(s) Plan performance This is not a high-priority issue.


However, it should be considered
when designing the plan.

Informatica Data Quality Plan performance The plan designer must weigh
components file-based versus database
matching approaches when
considering plan requirements.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 591 of 1017


The time taken for a matching
Time window and frequency of Plan performance plan to complete execution
execution depends on its scale. Timing
requirements must be
understood up-front.
Match identification Quality of matches The plan designer must weigh
deterministic versus probabilistic
approaches.

Group Size

Grouping breaks large datasets down into smaller ones to reduce the number of record-to-record comparisons
performed in the plan, which directly impacts the speed of plan execution. When matching on grouped data, a
matching plan compares the records within each group with one another. When grouping is implemented properly,
plan execution speed is increased significantly, with no meaningful effect on match quality.

The most important determinant of plan execution speed is the size of the groups to be processed — that is, the
number of data records in each group.

For example, consider a dataset of 1,000,000 records, for which a grouping strategy generates 10,000 groups. If
9,999 of these groups have an average of 50 records each, the remaining group will contain more than 500,000
records; based on this one large group, the matching plan would require 87 days to complete, processing
1,000,000 comparisons a minute! In comparison, the remaining 9,999 groups could be matched in about 12
minutes if the group sizes were evenly distributed.

Group size can also have an impact on the quality of the matches returned in the matching plan. Large groups
perform more record comparisons, so more likely matches are potentially identified. The reverse is true for small
groups. As groups get smaller, fewer comparisons are possible, and the potential for missing good matches is
increased. The goal of grouping is to optimize performance while minimizing the possibility that valid matches will
be overlooked because like records are assigned to different groups. Therefore, groups must be defined
intelligently through the use of group keys.

Group Keys

Group keys determine which records are assigned to which groups. Group key selection, therefore, has a
significant affect on the success of matching operations.

Grouping splits data into logical chunks and thereby reduces the total number of comparisons performed by the
plan. The selection of group keys, based on key data fields, is critical to ensuring that relevant records are
compared against one another.

When selecting a group key, two main criteria apply:

● Candidate group keys should represent a logical separation of the data into distinct units where there is a
low probability that matches exist between records in different units. This can be determined by
profiling the data and uncovering the structure and quality of the content prior to grouping.
● Candidate group keys should also have high scores in three keys areas of data quality: completeness,
conformity, and accuracy. Problems in these data areas can be improved by standardizing the data prior
to grouping.

For example, geography is a logical separation criterion when comparing name and address data. A record for a

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 592 of 1017


person living in Canada is unlikely to match someone living in Ireland. Thus, the country-identifier field can provide
a useful group key. However, if you are working with national data (e.g. Swiss data), duplicate data may exist for
an individual living in Geneva, who may also be recorded as living in Genf or Geneve. If the group key in this case
is based on city name, records for Geneva, Genf, and Geneve will be written to different groups and never
compared — unless variant city names are standardized.

Size of Dataset

In matching, the size of the dataset typically does not have as significant an impact on plan performance as the
definition of the groups within the plan. However, in general terms, the larger the dataset, the more time required to
produce a matching plan — both in terms of the preparation of the data and the plan execution.

IDQ Components

All IDQ components serve specific purposes, and very little functionality is duplicated across the components.
However, there are performance implications for certain component types, combinations of components, and the
quantity of components used in the plan.

Several tests have been conducted on IDQ (version 2.11) to test source/sink combinations and various operational
components. In tests comparing file-based matching against database matching, file-based matching outperformed
database matching in UNIX and Windows environments for plans containing up to 100,000 groups. Also, matching
plans that wrote output to a CSV Sink outperformed plans with a DB Sink or Match Key Sink. Plans with a Mixed
Field Matcher component performed more slowly than plans without a Mixed Field Matcher.

Raw performance should not be the only consideration when selecting the components to use in a matching plan.
Different components serve different needs and may offer advantages in a given scenario.

Time Window

IDQ can perform millions or billions of comparison operations in a single matching plan. The time available for the
completion of a matching plan can have a significant impact on the perception that the plan is running correctly.

Knowing the time window for plan completion helps to determine the hardware configuration choices, grouping
strategy, and the IDQ components to employ.

Frequency of Execution

The frequency with which plans are executed is linked to the time window available. Matching plans may need to
be tuned to fit within the cycle in which they are run. The more frequently a matching plan is run, the more the
execution time will have to be considered.

Match Identification

The method used by IDQ to identify good matches has a significant effect on the success of the plan. Two key
methods for assessing matches are:

● deterministic matching
● probabilistic matching

Deterministic matching applies a series of checks to determine if a match can be found between two records. IDQ’s
fuzzy matching algorithms can be combined with this method. For example, a deterministic check may first check if

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 593 of 1017


the last name comparison score was greater than 85 percent. If this is true, it next checks the address. If an 80
percent match is found, it then checks the first name. If a 90 percent match is found on the first name, then the
entire record is considered successfully matched.

The advantages of deterministic matching are: (1) it follows a logical path that can be easily communicated to
others, and (2) it is similar to the methods employed when manually checking for matches. The disadvantages to
this method are its rigidity and its requirement that each dependency be true. This can result in matches being
missed, or can require several different rule checks to cover all likely combinations.

Probabilistic matching takes the match scores from fuzzy matching components and assigns weights to them in
order to calculate a weighted average that indicates the degree of similarity between two pieces of information.

The advantage of probabilistic matching is that it is less rigid than deterministic matching. There are no
dependencies on certain data elements matching in order for a full match to be found. Weights assigned to
individual components can place emphasis on different fields or areas in a record. However, even if a heavily-
weighted score falls below a defined threshold, match scores from less heavily-weighted components may still
produce a match.

The disadvantages of this method are a higher degree of required tweaking on the user’s part to get the right
balance of weights in order to optimize successful matches. This can be difficult for users to understand and
communicate to one another.

Also, the cut-off mark for good matches versus bad matches can be difficult to assess. For example, a matching
plan with 95 to 100 percent success may have found all good matches, but matching plan success between 90 and
94 percent may map to only 85 percent genuine matches. Matches between 85 and 89 percent may correspond to
only 65 percent genuine matches, and so on. The following table illustrates this principle.

Close analysis of the match results is required because of the relationship between match quality and match
thresholds scores assigned since there may not be a one-to-one mapping between the plan’s weighted score and
the number of records that can be considered genuine matches.

Best Practice Operations

The following section outlines best practices for matching with IDQ.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 594 of 1017


Capturing Client Requirements

Capturing client requirements is key to understanding how successful and relevant your matching plans are likely
to be. As a best practice, be sure to answer the following questions, as a minimum, before designing and
implementing a matching plan:

● How large is the dataset to be matched?


● How often will the matching plans be executed?
● When will the match process need to be completed?
● Are there any other dependent processes?
● What are the rules for determining a match?
● What process is required to sign-off on the quality of match results?
● What processes exist for merging records?

Test Results

Performance tests demonstrate the following:

● IDQ has near-linear scalability in a multi-processor environment.


● Scalability in standard installations, as achieved in the allocation of matching plans to multiple processors,
will eventually level off.

Performance is the key to success in high-volume matching solutions. IDQ’s architecture supports massive
scalability by allowing large jobs to be subdivided and executed across several processors. This scalability greatly
enhances IDQ’s ability to meet the service levels required by users without sacrificing quality or requiring an overly
complex solution.

If IDQ is integrated with PowerCenter, matching scalability can be achieved using PowerCenter's partitioning
capabilities.

Managing Group Sizes

As stated earlier, group sizes have a significant affect on the speed of matching plan execution. Also, the quantity
of small groups should be minimized to ensure that the greatest number of comparisons are captured. Keep the
following parameters in mind when designing a grouping plan.

Condition Best practice Exceptions


Maximum group size 5,000 records Large datasets over 2M records with
uniform data. Minimize the number of
groups containing more than 5,000
records.
Minimum number of single- 1,000 groups per one
record groups million record dataset.
Optimum number of 500,000,000 comparisons +/- 20 percent
comparisons per 1 million records

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 595 of 1017


In cases where the datasets are large, multiple group keys may be required to segment the data to ensure that
best practice guidelines are followed. Informatica Corporation can provide sample grouping plans that automate
these requirements as far as is practicable.

Group Key Identification

Identifying appropriate group keys is essential to the success of a matching plan. Ideally, any dataset that is about
to be matched has been profiled and standardized to identify candidate keys.

Group keys act as a “first pass” or high-level summary of the shape of the dataset(s). Remember that only data
records within a given group are compared with one another. Therefore, it is vital to select group keys that have
high data quality scores for completeness, conformity, consistency, and accuracy.

Group key selection depends on the type of data in the dataset, for example whether it contains name and address
data or other data types such as product codes.

Hardware Specifications

Matching is a resource-intensive operation, especially in terms of processor capability. Three key


variables determine the effect of hardware on a matching plan: processor speed, disk performance, and memory.

The majority of the activity required in matching is tied to the processor. Therefore, the speed of the processor has
a significant affect on how fast a matching plan completes. Although the average computational speed for IDQ is
one million comparisons per minute, the speed can range from as low as 250,000 comparisons to 6.5 million
comparisons per minute, depending on the hardware specification, background processes running, and
components used. As a best practice, higher-specification processors (e.g., 1.5 GHz minimum) should be used for
high-volume matching plans.

Hard disk capacity and available memory can also determine how fast a plan completes. The hard disk reads and
writes data required by IDQ sources and sinks. The speed of the disk and the level of defragmentation affect how
quickly data can be read from, and written to, the hard disk. Information that cannot be stored in memory during
plan execution must be temporarily written to the hard disk. This increases the time required to retrieve information
that otherwise could be stored in memory, and also increases the load on the hard disk. A RAID drive may be
appropriate for datasets of 3 to 4 million records and a minimum of 512MB of memory should be available.

The following table is a rough guide for hardware estimates based on IDQ Runtime on Windows platforms.
Specifications for UNIX-based systems vary.

Match volumes Suggested hardware specification


< 1,500,000 records 1.5 GHz computer, 512MB RAM
1,500,000 to 3 million records Multi processor server, 1GB RAM
> 3 million records Multi-processor server, 2GB RAM, RAID 5 hard disk

Single Processor vs. Multi-Processor

With IDQ Runtime, it is possible to run multiple processes in parallel. Matching plans, whether they are file-based
or database-based, can be split into multiple plans to take advantage of multiple processors on a server. Be aware
however, that this requires additional effort to create the groups and consolidate the match output. Also, matching
plans split across four processors do not run four times faster than a single-processor matching plan. As a result,
multi-processor matching may not significantly improve performance in every case.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 596 of 1017


Using IDQ with PowerCenter and taking advantage of PowerCenter's partitioning capabilities may also improve
throughput. This approach has the advantage that splitting plans into multiple independent plans is not typically
required.

The following table can help in estimating the execution time between a single and multi-processor match plan.

Plan Type Single Processor Multiprocessor


Standardardization/ Depends on operations and Single processor time plus 20 percent.
grouping size of data set.
(Time equals Y * 1.20)
(Time equals Y)
Matching Est 1 million comparisons a Time for single processor matching divided
minute. by no or processors (NP) multiplied by 25
percent. (Time equals [(X / NP) * 1.25])
(Time equals X)

For example, if a single processor plan takes one hour to group and standardize the data and eight hours to match,
a four-processor match plan should require approximately one hour and 20 minute to group and standardize and
two and one half hours to match. The time difference between a single- and multi-processor plan in this case would
be more than five hours (i.e., nine hours for the single processor plan versus three hours and 50 minutes for the
quad-processor plan).

Deterministic vs. Probabilistic Comparisons

No best-practice research has yet been completed on which type of comparison is most effective at determining a
match. Each method has strengths and weaknesses. A 2006 article by Forrester Research stated a preference for
deterministic comparisons since they remove the burden of identifying a universal match threshold from the user.

Bear in mind that IDQ supports deterministic matching operations only. However, IDQ’s Weight Based Analyzer
component lets plan designers calculate weighted match scores for matched fields.

Database vs. File-Based Matching

File-based matching and database matching perform essentially the same operations. The major differences
between the two methods revolve around how data is stored and how the outputs can be manipulated after
matching is complete. With regards to selecting one method or the other, there are no best practice
recommendations since this is largely defined by requirements.

The following table outlines the strengths and weakness of each method:

File-Based Method Database Method


Ease of implementation Easy to implement Requires SQL knowledge
Performance Fastest method Slower than file-based method
Space utilization Requires more hard-disk space Lower hard-disk space
requirement
Operating system restrictions Possible limit to number of None
groups that can be created

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 597 of 1017


Ability to control/ manipulate Low High
output

High-Volume Data Matching Techniques

This section discusses the challenges facing IDQ matching plan designers in opti-mizing their plans for speed of
execution and quality of results. It highlights the key factors affecting matching performance and discusses the
results of IDQ performance testing in single and multi-processor environments.

Checking for duplicate records where no clear connection exists among data elements is a resource-intensive
activity. In order to detect matching information, a record must be compared against every other record in a
dataset. For a single data source, the quantity of comparisons required to check an entire dataset increases
geometrically as the volume of data increases. A similar situation arises when matching between two datasets,
where the number of comparisons required is a multiple of the volumes of data in each dataset.

When the volume of data increases into the tens of millions, the number of comparisons required to identify
matches — and consequently, the amount of time required to check for matches — reaches impractical levels.

Approaches to High-Volume Matching

Two key factors control the time it takes to match a dataset:

● The number of comparisons required to check the data.


● The number of comparisons that can be performed per minute.

The first factor can be controlled in IDQ through grouping, which involves logically segmenting the dataset into
distinct elements, or groups, so that there is a high probability that records within a group are not duplicates of
records outside of the group. Grouping data greatly reduces the total number of required comparisons without
affecting match accuracy.

IDQ affects the number of comparisons per minute in two ways:

● Its matching components maximize the comparison activities assigned to the com-puter processor. This
reduces the amount of disk I/O communication in the system and increases the number of comparisons
per minute. Therefore, hard-ware with higher processor speeds has higher match throughputs.
● IDQ architecture also allows matching tasks to be broken into smaller tasks and shared across multiple
processors. The use of multiple processors to handle matching operations greatly enhances IDQ
scalability with regard to high-volume matching problems.

The following section outlines how a multi-processor matching solution can be imple-mented and illustrates the
results obtained in Informatica Corporation testing.

Multi-Processor Matching: Solution Overview

IDQ does not automatically distribute its load across multiple processors. To scale a matching plan to take
advantage of a multi-processor environment, the plan designer must develop multiple plans for execution in parallel.

To develop this solution, the plan designer first groups the data to prevent the plan from running low-probability
comparisons. Groups are then subdivided into one or more subgroups (the number of subgroups depends on the
plan being run and the number of processors in use). Each subgroup is assigned to a discrete matching plan, and

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 598 of 1017


the plans are executed in parallel.

The following diagram outlines how multi-processor matching can be implemented in a database model. Source
data is first grouped and then subgrouped according to the number of processors available to the job. Each
subgroup of data is loaded into a sepa-rate staging area, and the discrete match plans are run in parallel against
each table. Results from each plan are consolidated to generate a single match result for the orig-inal source data.

Informatica Corporation Match Plan Tests

Informatica Corporation performed match plan tests on a 2GHz Intel Xeon dual-processor server running Windows
2003 (Server edition). Two gigabytes of RAM were available. The hyper-threading ability of the Xeon processors
effectively provided four CPUs on which to run the tests.

Several tests were performed using file-based and database-based matching methods and single and multiple
processor methods. The tests were performed on one million rows of data. Grouping of the data limited the total
number of comparisons to approximately 500,000,000.

Test results using file-based and database-based methods showed a near linear scal-ability as the number of
available processors increased. As the number of processors increased, so too did the demand on disk I/O
resources. As the processor capacity began to scale upward, disk I/O in this configuration eventually limited the
benefits of adding additional processor capacity. This is demonstrated in the graph below.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 599 of 1017


Execution times for multiple processors were based on the longest execution time of the jobs run in parallel.
Therefore, having an even distribution of records across all proc-essors was important to maintaining scalability.
When the data was not evenly distributed, some match plans ran longer than others, and the benefits of scaling
over multiple processors was not as evident.

Last updated: 26-May-08 17:52

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 600 of 1017


Effective Data Standardizing Techniques

Challenge

To enable users to streamline their data cleansing and standardization processes (or plans) with
Informatica Data Quality (IDQ). The intent is to shorten development timelines and ensure a
consistent and methodological approach to cleansing and standardizing project data.

Description

Data cleansing refers to operations that remove non-relevant information and “noise” from the
content of the data. Examples of cleansing operations include the removal of person names, “care
of” information, excess character spaces, or punctuation from postal address.

Data standardization refers to operations related to modifying the appearance of the data, so that it
takes on a more uniform structure and to enriching the data by deriving additional details from
existing content.

Cleansing and Standardization Operations

Data can be transformed into a “standard” format appropriate for its business type. This is typically
performed on complex data types such as name and address or product data. A data
standardization operation typically profiles data by type (e.g., word, number, code) and parses
data strings into discrete components. This reveals the content of the elements within the data as
well as standardizing the data itself.

For best results, the Data Quality Developer should carry out these steps in consultation with a
member of the business. Often, this individual is the data steward, the person who best
understands the nature of the data within the business scenario.

● Within IDQ, the Profile Standardizer is a powerful tool for parsing unsorted data into the
correct fields. However, when using the Profile Standardizer, be aware that there is a
finite number of profiles (500) that can be contained within a cleansing plan. Users can
extend the number of profiles by using the first 500 profiles within one component and
then feeding the data overflow into a second Profile Standardizer via the Token Parser
component.

After the data is parsed and labeled, it should be evident if reference dictionaries will be needed to
further standardize the data. It may take several iterations of dictionary construction and review
before the data is standardized to an acceptable level. Once acceptable standardization has been
achieved, data quality scorecard or dashboard reporting can be introduced. For information on
dashboard reporting, see the Report Viewer chapter of the Informatica Data Quality 3.1 User
Guide.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 601 of 1017


Discovering Business Rules

At this point, the business user may discover and define business rules applicable to the data.
These rules should be documented and converted to logic that can be contained within a data
quality plan. When building a data quality plan, be sure to group related business rules together in
a single rules component whenever possible; otherwise the plan may become very difficult to read.
If there are rules that do not lend themselves easily to regular IDQ components (i.e, when
standardizing product data information), it may be necessary to perform some custom
scripting using IDQ’s scripting component. This requirement may arise when a string or an
element within a string needs to be treated as an array.

Standard and Third-Party Reference Data

Reference data can be a useful tool when standardizing data. Terms with variant formats or
spellings can be standardized to a single form. IDQ installs with several reference dictionary files
that cover common name and address and business terms. The illustration below shows part of a
dictionary of street address suffixes.

Common Issues when Cleansing and Standardizing Data

If the customer has expectations of a bureau-style service, it may be advisable to re-emphasize


the score-carding and graded-data approach to cleansing and standardizing. This helps to ensure
that the customer develops reasonable expectations of what can be achieved with the data set
within an agreed-upon timeframe.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 602 of 1017


Standardizing Ambiguous Data

Data values can often appear ambiguous, particularly in name and address data where name,
address, and premise values can be interchangeable. For example, Hill, Park, and Church are all
common surnames. In some cases, the position of the value is important. “ST” can be a suffix for
street or a prefix for Saint, and sometimes they can both occur in the same string.

The address string “St Patrick’s Church, Main St” can reasonably be interpreted as “Saint Patrick’s
Church, Main Street.” In this case, if the delimiter is a space (thus ignoring any commas and
periods), the string has five tokens. You may need to write business rules using the IDQ Scripting
component, as you are treating the string as an array. St with position 1 within the string would be
standardized to meaning_1, whereas St with position 5 would be standardized to meaning_2.
Each data value can then be compared to a discrete prefix and suffix dictionary.

Conclusion

Using the data cleansing and standardization techniques described in this Best Practice can help
an organization to recognize the value of incorporating IDQ into their development
methodology. Because data quality is an iterative process, the business rules initially developed
may require ongoing modification, as the results produced by IDQ will be affected by the starting
condition of the data and the requirements of the business users.

When data arrives in multiple languages, it is worth creating similar IDQ plans for each country
and applying the same rules across these plans. The data would typically be staged in a database,
and the plans developed using a SQL statement as input, with a “where country_code= ‘DE’”
clause, for example. Country dictionaries are identifiable by country code to facilitate such
statements. Remember that IDQ installs with a large set of reference dictionaries and additional
dictionaries are available from Informatica.

IDQ provides several components that focus on verifying and correcting the accuracy of name and
postal address data. These components leverage address reference data that originates from
national postal carriers such as the United States Postal Service. Such datasets enable IDQ to
validate an address to premise level. Please note, the reference datasets are licensed and
installed as discrete Informatica products, and thus it is important to discuss their inclusion in the
project with the business in advance so as to avoid budget and installation issues. Several types of
reference data, with differing levels of address granularity, are available from Informatica. Pricing
for the licensing of these components may vary and should be discussed with the Informatica
Account Manager.

Last updated: 01-Feb-07 18:52

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 603 of 1017


Integrating Data Quality Plans with PowerCenter

Challenge

This Best Practice outlines the steps to integrate an Informatica Data Quality (IDQ) plan into a PowerCenter
mapping. This document assumes that the appropriate setup and configuration of IDQ and PowerCenter have
been completed as part of the software installation process and these steps are not included in this document.

Description
Preparing IDQ Plans for PowerCenter Integration

IDQ plans are typically developed and tested by executing from workbench. Plans running locally from workbench
can use any of the available IDQ Source and Sink components. This is not true for plans that are integrated into
PowerCenter as they can only use Source and Sink components that contain the “Enable Real-time processing”
check box. Specifically those components are CSV Source, CSV Match Source, CSV Sink and CSV Match Sink.
In addition, the Real-time Source and Sink can be used; however, they require additional setup as each field name
and length must be defined. Database source and sinks are not allowed in PC integration.

When IDQ plans are integrated within a PowerCenter mapping, the source and sink need to be enabled by setting
the enable real-time processing option on them. Consider the following points when developing a plan for
integration in PC.

● If the IDQ was plan developed using database source and/or sink, you must replace them with CSV Sink/
Source or CSV Match Sink/Source.
● If the IDQ plan was developed using group sink/source (or dual group sink), you must replace them with
either CSV Sink/Source or CSV Match Sink/Source depending on the functionality you are replacing.
When replacing group sink you also must add functionality to the PC mapping to replicate the grouping.
This is done by placing a join and sort prior to the IDQ plan containing the match.
● PowerCenter only sees the input and output ports of the IDQ plan from within the PC mapping. This is
driven by the input file used for the workbench plan and the fields selected as output in the sink. If you
don’t see a field after the plan is integrated in PowerCenter, it means the field is not in the input file or not
selected as output.
● PowerCenter integration does not allow input ports to be selected as output if the IDQ transformation is
defined as a passive transformation. If the IDQ transformation is configured as active this is not an issue
as you must select all fields needed as output from the IDQ transformation within the sink transformation
of the IDQ plan. Passive and active IDQ transformations follow the general restrictions and rules for
active and passive transformations in PowerCenter.
● The delimiter of the Source and Sink must be comma for integration IDQ plans. Other fields such as Pipe
will cause an error within the PowerCenter Designer. If you encounter this error, go back to workbench,
change the delimiter to comma, save the plan and then go back to PowerCenter Designer and perform
the import of the plan again.
● For reusability of IDQ plans, use generic naming conventions for the input and output ports. For example,
rather than naming a field Customer address1, customer address2, customer city, name the field
address1, address2, city, etc. Thus, if the same standardization and cleansing is needed by multiple
sources you can integrate the same IDQ plan, which will reduce development time as well as ongoing
maintenance.
● Use only necessary fields as input to each mapping plan. If you are working with an input file that has 50
fields and you only really need 10 fields for the IDQ plan, create a file that contains only the necessary
field names, save it as a comma delimited file and then point to that newly created file from the source of
the IDQ plan. This changes the input field reference to only those fields that must be visible in the
PowerCenter integration.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 604 of 1017


● Once the source and sink are converted to real time, you cannot run the plan within workbench, only
within the PowerCenter mapping. However, you may change the check box at any time to revert to
standalone processing. Be careful not to refresh the IDQ plan in the mapping within PowerCenter while
real time is not enabled. If you do so, the PowerCenter mapping will display an error message and will not
allow that mapping to be integrated until the Runtime enable is active again.

Integrating IDQ Plans into PowerCenter Mappings

After the IDQ Plans are converted to real time-enabled, they are ready to integrate into a PowerCenter mapping.

Integrating into PowerCenter requires proper installation and configuration of the IDQ/PowerCenter integration,
including:

● Making appropriate changes to environment variables (to .profile for UNIX)


● Installing IDQ on the PowerCenter server
● Running IDQ Integration and Content install on the server
● Registering IDQ plug-in via the PowerCenter Admin console

Note: The plug-in must be registered in each repository from which an IDQ transformation is to be
developed.

● Installing IDQ workbench on the workstation


● Installing IDQ Integration and Content on the workstation using the PowerCenter Designer

When all of the above steps are executed correctly, the IDQ transformation icon, shown below, is visible in the
PowerCenter repository.

To integrate an IDQ plan, open the mapping, and click on the IDQ icon. Then click in the mapping workspace to
insert the transformation into the mapping. The following dialog box appears:

Select Active or Passive, as appropriate. Typically, an active transformation is necessary only for a matching
plan. If selecting Active, IDQ plan input needs to have all input fields passed through, as typical PowerCenter

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 605 of 1017


rules apply to Active and Passive transformation processing.

As the following figure illustrates, the IDQ transformation is “empty” in its initial, un-configured state. Notice all
ports are currently blank; they will be populated upon import/integration of the IDQ plan.

Double-click on the title bar for the IDQ transformation to open it for editing.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 606 of 1017


Then select the far right tab, “Configuration”.

When first integrating an IDQ plan, the connection and repository displays are blank. Click the Connect button to
establish a connection to the appropriate IDQ repository.

In the Host Name box, specify the name of the computer on which the IDQ repository is installed. This is usually
the PowerCenter server. If the default Port Number (3306) was changed during installation, specify the correct
value. Next, click Test Connection.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 607 of 1017


Note: In some cases if the User Name has not been granted privileges on the Host server you will not be allowed
to connect. The procedure for granting privileges to the IDQ (MySQL) repository is explained at the end of this
document.

When the connection is established, click the down arrow to the right of the Plan Name box, and the following
dialog is displayed:

Browse to the plan you want to import, then click on the Validate button. If there is an error in the plan, a dialog
box appears. For example, if the Source and Sink have not been configured correctly, the following dialog box
appears.

If the plan is valid for PowerCenter integration, the following dialog is displayed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 608 of 1017


After a valid plan has been configured, the PowerCenter ports (equivalent to the IDQ Source and Sink fields, are
visible and can be connected just as any other PowerCenter transformation.

Refreshing IDQ Plans for PowerCenter Integration

After Data Quality Plans are integrated in PowerCenter, changes made to the IDQ plan in Workbench are not
reflected in the PowerCenter mapping until the plan is manually refreshed in the PowerCenter mapping. When
you save an IDQ plan, it is saved in the MySQL repository. When you integrate that plan into PowerCenter, a copy
of that plan is then integrated in the PowerCenter metadata; the MySQL repository and the PowerCenter
repository do not communicate updates automatically.

The following paragraphs detail the process for refreshing integrated IDQ plans when necessary to reflect changes
made in workbench.

● Double-click on IDQ transformation in PowerCenter Mapping


● Select the Configurations tab:
● Select Refresh. This reads the current version of the plan and refreshes it within PowerCenter.
● Select apply. If any PowerCenter-specific errors were created when the plan was modified, an error
dialog is displayed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 609 of 1017


● Update input, output, and pass-through ports as necessary, then save the mapping in PowerCenter, and
test the changes.

Saving IDQ Plans to the Appropriate Repository – MySQL Permissions

Plans that are to be integrated into PowerCenter mappings must be saved to an IDQ Repository that is visible to
the PowerCenter Designer prior to integration. The usual practice is to save the plan to the IDQ repository located
on the PowerCenter server.

In order for a Workbench client to save a plan to that repository, the client machine must be granted permissions
to the MySQL on the server. If the client machine has not been granted access, the client receives an error
message when attempting to access the server repository. The person at your organization who has login rights to
the server on which IDQ is installed needs to perform this task for all users who will need to save or retrieve plans
from the IDQ Server. This procedure is detailed below.

● Identify the IP address for any client machine that needs to be granted access.
● Login to the server on which the MySQL repository is located and login to MySQL:

mysql –u root

● For a user to connect to IDQ server, save and retrieve plans, enter the following command:

grant all privileges on *.* to ‘admin’@’<idq_client_ip>’

● For a user to integrate an IDQ plan into PowerCenter, grant the following privilege:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 610 of 1017


grant all privileges on *.* to ‘root’@’<powercenter_client_ip>’

Last updated: 20-May-08 23:18

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 611 of 1017


Managing Internal and External Reference Data

Challenge

To provide guidelines for the development and management of the reference data sources that can be
used with data quality plans in Informatica Data Quality (IDQ). The goal is to ensure the smooth transition
from development to production for reference data files and the plans with which they are associated.

Description

Reference data files can be used by a plan to verify or enhance the accuracy of the data inputs to the plan.
A reference data file is a list of verified-correct terms and, where appropriate, acceptable variants on those
terms. It may be a list of employees, package measurements, or valid postal addresses — any data set
that provides an objective reference against which project data sources can be checked or corrected.
Reference files are essential to some, but not all data quality processes.

Reference data can be internal or external in origin.

Internal data is specific to a particular project or client. Such data is typically generated from internal
company information. It may be custom-built for the project.

External data has been sourced or purchased from outside the organization. External data is used when
authoritative, independently-verified data is needed to provide the desired level of data quality to a
particular aspect of the source data. Examples include the dictionary files that install with IDQ, postal
address data sets that have been verified as current and complete by a national postal carrier, such as
United States Postal Service, or company registration and identification information from an industry-
standard source such as Dun & Bradstreet.

Reference data can be stored in a file format recognizable to Informatica Data Quality or in a format that
requires intermediary (third-party) software in order to be read by Informatica applications.

Internal data files, as they are often created specifically for data quality projects, are typically saved in the
dictionary file format or as delimited text files, which are easily portable into dictionary format. Databases
can also be used as a source for internal data.

External files are more likely to remain in their original format. For example, external data may be
contained in a database or in a library whose files cannot be edited or opened on the desktop to reveal
discrete data values.

Working with Internal Data

Obtaining Reference Data

Most organizations already possess much information that can be used as reference data — for example,
employee tax numbers or customer names. These forms of data may or may not be part of the project
source data, and they may be stored in different parts of the organization.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 612 of 1017


The question arises, are internal data sources sufficiently reliable for use as reference? Bear in mind that
in some cases the reference data does not need to be 100 percent accurate. It can be good enough to
compare project data against reference data and to flag inconsistencies between them, particularly in
cases where both sets of data are highly unlikely to share common errors.

Saving the Data in .DIC File Format

IDQ installs with a set of reference dictionaries that have been created to handle many types of business
data. These dictionaries are created using a proprietary .DIC file name extension. DIC is abbreviated from
dictionary, and dictionary files are essentially comma delimited text files.

You can create a new dictionary in three ways:

● You can save an appropriately formatted delimited file as a .DIC file into the Dictionaries folders of
your IDQ (client or server) installation.
● You can use the Dictionary Manager within Data Quality Workbench. This method allows you to
create text and database dictionaries.
● You can write from plan files directly to a dictionary using the IDQ Report Viewer (see below).

The figure below shows a dictionary file open in IDQ Workbench and its underlying .DIC file open in a text
editor. Note that the dictionary file has at least two columns of data. The Label column contains the correct
or standardized form of each datum from the dictionary’s perspective. The Item columns contain versions
of each datum that the dictionary recognizes as identical to or coterminous with the Label entry. Therefore,
each datum in the dictionary must have at least two entries in the DIC file (see the text editor illustration
below). A dictionary can have multiple Item columns.

To edit a dictionary value, open the DIC file and make your changes. You can make changes either
through a text editor or by opening the dictionary in the Dictionary Manager.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 613 of 1017


To add a value to a dictionary, open the DIC file in Dictionary Manager, place the cursor in an empty row,
and add a Label string and at least one Item string. You can also add values in a text editor by placing the
cursor on a new line and typing Label and Item values separated by commas.

Once saved, the dictionary is ready for use in IDQ.

Note: IDQ users with database expertise can create and specify dictionaries that are linked to database
tables, and that thus can be updated dynamically when the underlying data is updated. Database
dictionaries are useful when the reference data has been originated for other purposes and is likely to
change independently of data quality. By making use of a dynamic connection, data quality plans can
always point to the current version of the reference data.

Sharing Reference Data Across the Organization

As you can publish or export plans from a local Data Quality repository to server repositories, so you can
copy dictionaries across the network. The File Manager within IDQ Workbench provides an Explorer-like
mechanism for moving files to other machines across the network.

Bear in mind that Data Quality looks for .DIC files in pre-set locations within the IDQ installation when
running a plan. By default, Data Quality relies on dictionaries being located in the following locations:

● The Dictionaries folders installed with Workbench and Server.


● The user’s file space in the Data Quality service domain.

IDQ does not recognize a dictionary file that is not in such a location, even if you can browse to the file
when designing the data quality plan. Thus, any plan that uses a dictionary in a non-standard location will
fail.

This is most relevant when you publish or export a plan to another machine on the network. You must
ensure that copies of any dictionary files used in the local plan are available in a suitable location on the
service domain — in the user space on the server, or at a location in the server’s Dictionaries folders that
corresponds to the dictionaries’ location on Workbench — when the plan is copied to the server-side
repository.

Note: You can change the locations in which IDQ looks for plan dictionaries by editing the config.xml file.
However, this is the master configuration file for the product and you should not edit it without consulting
Informatica Support. Bear in mind that Data Quality looks only in the locations set in the config.xml file.

Version Controlling Updates and Managing Rollout from Development to Production

Plans can be version-controlled during development in Workbench and when published to a domain
repository. You can create and annotate multiple versions of a plan, and review/roll back to earlier versions
when necessary.

Dictionary files are not version controlled by IDQ, however. You should define a process to log changes
and back-up your dictionaries using version control software if possible or a manual method. If
modifications are to be made to the versions of dictionary files installed by the software, it is recommended
that these modifications be made to a copy of the original file, renamed or relocated as desired. This
approach avoids the risk that a subsequent installation might overwrite changes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 614 of 1017


Database reference data can also be version controlled, although this presents difficulties if the database
is very large in size. Bear in mind that third-party reference data, such as postal address data, should not
ordinarily be changed, and so the need for a versioning strategy for these files is debatable.

Working with External Data

Formatting Data into Dictionary Format

External data may or may not permit the copying of data into text format — for example, external data
contained in a database or in library files. Currently, third-party postal address validation data is provided
to Informatica users in this manner, and IDQ leverages software from the vendor to read these files. (The
third-party software has a very small footprint.) However, some software files can be amenable to data
extraction to file.

Obtaining Updates for External Reference Data

External data vendors produce regular data updates, and it’s vital to refresh your external reference data
when updates become available. The key advantage of external data — its reliability — is lost if you do not
apply the latest files from the vendor. If you obtained third-party data through Informatica, you will be kept
up to date with the latest data as it becomes available for as long as your data subscription warrants. You
can check that you possess the latest versions of third-party data by contacting your Informatica Account
Manager.

Managing Reference Updates and Rolling Out Across the Organization

If your organization has a reference data subscription, you will receive either regular data files on compact
disc or regular information on how to download data from Informatica or vendor web sites. You must
develop a strategy for distributing these updates to all parties who run plans with the external data. This
may involve installing the data on machines in a service domain.

Bear in mind that postal address data vendors update their offerings every two or three months, and that a
significant percentage of postal addresses can change in such time periods.

You should plan for the task of obtaining and distributing updates in your organization at frequent intervals.
Depending on the number of IDQ installations that must be updated, updating your organization with third-
party reference data can be a sizable task.

Strategies for Managing Internal and External Reference Data

Experience working with reference data leads to a series of best practice tips for creating and managing
reference data files.

Using Workbench to Build Dictionaries

With IDQ Workbench, you can select data fields or columns from a dataset and save them in a dictionary-
compatible format.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 615 of 1017


Let’s say you have designed a data quality plan that identifies invalid or anomalous records in a customer
database. Using IDQ, you can create an exception file of these bad records, and subsequently use this file
to create a dictionary-compatible file.

For example, let’s say you have an exception file containing suspect or invalid customer account records.
Using a very simple data quality plan, you can quickly parse the account numbers from this file to create a
new text file containing the account serial numbers only. This file effectively constitutes the labels column
of your dictionary.

By opening this file in Microsoft Excel or a comparable program and copying the contents of Column A into
Column B, and then saving the spreadsheet as a CSV file, you create a file with Label and Item1 columns.
Rename the file with a .DIC suffix and add it to the Dictionaries folder of your IDQ installation: the
dictionary is now visible to the IDQ Dictionary Manager. You now have a dictionary file of bad account
numbers that you can use in any plans checking the validity of the organization's account records.

Using Report Viewer to Build Dictionaries

The IDQ Report Viewer allows you to create exception files and dictionaries on-the-fly from report data.
The figure below illustrates how you can drill-down into report data, right-click on a column, and save the
column data as a dictionary file. This file will be populated with Label and Item1 entries corresponding to
the column data.

In this case, the dictionary created is a list of serial numbers from invalid customer records (specifically,
records containing bad zip codes). The plan designer can now create plans to check customer databases
against these serial numbers. You can also append data to an existing dictionary file in this manner.

As a general rule, it is a best practice to follow the dictionary organization structure installed by the
application, adding to that structure as necessary to accommodate specialized and supplemental
dictionaries. Subsequent users are then relieved of the need to examine the config.xml file for possible
modifications, thereby lowering the risk of accidental errors during migration. When following the original
dictionary organization structure is not practical or contravenes other requirements, take care to document

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 616 of 1017


the customizations.

Since external data may be obtained from third parties and may not be in file format, the most efficient way
to share its content across the organization is to locate it on the Data Quality Server machine. (Specifically,
this is the machine that hosts the Execution Service.)

Moving Dictionary Files After IDQ Plans are Built

This is a similar issue to that of sharing reference data across the organization. If you must move or
relocate your reference data files post-plan development, you have three options:

● You can reset the location to which IDQ looks by default for dictionary files.
● You can reconfigure the plan components that employ the dictionaries to point to the new
location. Depending on the complexity of the plan concerned, this can be very labor-intensive.
● If deploying plans in a batch or scheduled task, you can append the new location to the plan
execution command. You can do this by appending a parameter file to the plan execution
instructions on the command line. The parameter file is an xml file that can contain a simple
command to use one file path instead of another.

Last updated: 08-Feb-07 17:09

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 617 of 1017


Real-Time Matching Using PowerCenter

Challenge

This Best Practice describes the rationale for matching in real-time along with the concepts and strategies used in
planning for and developing a real-time matching solution. It also provides step-by-step instructions on how to build this
process using Informatica’s PowerCenter and Data Quality.

The cheapest and most effective way to eliminate duplicate records from a system is to prevent them from ever being
entered in the first place. Whether the data is coming from a website, an application entry, EDI feeds messages on a
queue, changes captured from a database, or other common data feeds, taking these records and matching them
against existing master data that already exists allows for only the new, unique records to be added.

● Benefits of preventing duplicate records include:


● Better ability to service customer, with the most accurate and complete information readily available
● Reduced risk of fraud or over-exposure
● Trusted information at the source
● Less effort in BI, data warehouse, and/or migration projects

Description

Performing effective real-time matching involves multiple puzzle pieces.

1. There is a master data set (or possibly multiple master data sets) that contain clean and unique customers,
prospects, suppliers, products, and/or many other types of data.
2. To interact with the master data set, there is an incoming transaction; typically thought to be a new item. This
transaction can be anything from a new customer signing up on the web to a list of new products; this is
anything that is assumed to be new and intended to be added to master.
3. There must be a process to determine if a “new” item really is new or if it already exists within the master data
set. In a perfect world of consistent id’s, spellings, and representations of data across all companies and
systems, checking for duplicates would simply be some sort of exact lookup into the master to see if the item
already exists. Unfortunately, this is not the case and even being creative and using %LIKE% syntax does not
provide thorough results. For example, comparing Bob to Robert or GRN to Green requires a more
sophisticated approach.

Standardizing Data in Advance of Matching

The first prerequisite for successful matching is to cleanse and standardize the master data set. This process requires
well-defined rules for important attributes. Applying these rules to the data should result in complete, consistent,
conformant, valid data, which really means trusted data. These rules should also be reusable so they can be used with
the incoming transaction data prior to matching. The more compromises made in the quality of master data by failing to
cleanse and standardize, the more effort will need to be put into the matching logic, and the less value the organization
will derive from it. There will be many more chances of missed matches allowing duplicates to enter the system.

Once the master data is cleansed, the next step is to develop criteria for candidate selection. For efficient matching,
there is no need to compare records that are so dissimilar that they cannot meet the business rules for matching. On
the other hand, the set of candidates must be sufficiently broad to minimize the chance that similar records will not be
compared. For example, when matching consumer data on name and address, it may be sensible to limit candidate
pull records to those having the same zip code and the same first letter of the last name, because we can reason that if
those elements are different between two records, those two records will not match.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 618 of 1017


There also may be cases where multiple candidate sets are needed. This would be the case if there are multiple sets of
match rules that the two records will be compared against. Adding to the previous example, think of matching on name
and address for one set of match rules and name and phone for a second. This would require selecting records from
the master that have the same phone number and first letter of the last name.

Once the candidate selection process is resolved, the matching logic can be developed. This can consist of matching
one to many elements of the input record to each candidate pulled from the master. Once the data is compared each
pair of records, one input and one candidate, will have a match score or a series of match scores. Scores below a
certain threshold can then be discarded and potential matches can be output or displayed.

The full real-time match process flow includes:

1. The input record coming into the server


2. The server then standardizes the incoming record and retrieves candidate records from the master data source
that could match the incoming record
3. Match pairs are then generated, one for each candidate, consisting of the incoming record and the candidate
4. The match pairs then go through the matching logic resulting in a match score
5. Records with a match score below a given threshold are discarded
6. The returned result set consists of the candidates that are potential matches to the incoming record

Developing an Effective Candidate Selection Strategy

Determining which records from the master should be compared with the incoming record is a critical decision in an
effective real-time matching system. For most organizations it is not realistic to match an incoming record to all master
records. Consider even a modest customer master data set with one million records; the amount of processing, and
thus the wait in real-time would be unacceptable.

Candidate selection for real-time matching is synonymous to grouping or blocking for batch matching. The goal of
candidate selection is to select only that subset of the records from the master that are definitively related by a field, part
of a field, or combination of multiple parts/fields. The selection is done using a candidate key or group key. Ideally this
key would be constructed and stored in an indexed field within the master table(s) allowing for the quickest retrieval.
There are many instances where multiple keys are used to allow for one key to be missing or different, while another
pulls in the record as a candidate.

What specific data elements the candidate key should consist of very much depends on the scenario and the match
rules. The one common theme with candidate keys is the data elements used should have the highest levels of
completeness and validity possible. It is also best to use elements that can be verified as valid, such as a postal code

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 619 of 1017


or a National ID. The table below lists multiple common matching elements and how group keys could be used around
the data.

The ideal size of the candidate record sets, for sub-second response times, should be under 300 records. For
acceptable two to three second response times, candidate record counts should be kept under 5000 records.

Step by Step Development

The following instructions further explain the steps for building a solution to real-time matching using the Informatica
suite. They involve the following applications:

● Informatica PowerCenter 8.5.1 - utilizing Web Services Hub


● Informatica Data Explorer 5.0 SP4
● Informatica Data Quality 8.5 SP1 – utilizing North American Country Pack
● SQL Server 2000

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 620 of 1017


Scenario:

● A customer master file is provided with the following structure

● In this scenario, we are performing a name and address match


● Because address is part of the match, we will use the recommended address grouping strategy for our
candidate key (see table1)
● The desire is that different applications from the business will be able to make a web service call to determine if
the data entry represents a new customer or an existing customer

Solution:

1. The first step is to analyze the customer master file. Assume that this analysis shows the postcode field is
complete for all records and the majority of it is of high accuracy. Assume also that neither the first name or last
name field is completely populated; thus the match rules we must account for blank names.
2. The next step is to load the customer master file into the database. Below is a list of tasks that should be
implemented in the mapping that loads the customer master data into the database:

● Standardize and validate the address, outputting the discreet address components such as house
number, street name, street type, directional, and suite number. (Pre-built mapplet to do this; country
pack)
● Generate the candidate key field, populate that with the selected strategy (assume it is the first 3
characters of the zip, house number, and the first character of street name), and generate an index on
that field. (Expression, output of previous mapplet, hint: substr(in_ZIPCODE, 0, 3)||
in_HOUSE_NUMBER||substr(in_STREET_NAME, 0, 1))
● Standardize the phone number. (Pre-built mapplet to do this; country pack)
● Parse the name field into individual fields. Although the data structure indicates names are already
parsed into first, middle, and last, assume there are examples where the names are not properly
fielded. Also remember to output a value to handle of nicknames. (Pre-built mapplet to do this; country
pack)
● Once complete, your customer master table should look something like this:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 621 of 1017


3. Now that the customer master has been loaded, a Web Service mapping must be created to handle real-time
matching. For this project, assume that the incoming record will include a full name field, address, city, state,
zip, and a phone number. All fields will be free-form text. Since we are providing the Service, we will be using a
Web Service Provider source and target. Follow these steps to build the source and target definitions.

● Within PowerCenter Designer, go to the source analyzer and select the source menu. From there
select Web Service Provider and the Create Web Service Definition.

● You will see a screen like the one below where the Service can be named and input and output ports
can be created. Since this is a matching scenario, the potential that multiple records will be returned
must be taken into account. Select the Multiple Occurring Elements checkbox for the output ports
section. Also add a match score output field to return the percentage at which the input record matches
the different potential matching records from the master.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 622 of 1017


● Both the source and target should now be present in the project folder.

4. An IDQ match plan must be build to use within the mapping. In developing a plan for real-time, using a CSV
source and CSV sink, both enabled for real-time is the most significant difference from a similar match plan
designed for use in IDQ standalone. The source will have the _1 and the _2 fields that a Group Source would
supply built into it, e.g. Firstname_1 & Firstname_2. Another difference from batch matching in PowerCenter is
that the DQ transformation can be set to passive. The following steps illustrate converting the North America
Country Pack’s Individual Name and Address Match Plan from a plan built for use in a batch mapping to a plan
built for use in a real-time mapping.

● Open the DCM_NorthAmerica project and from within the Match folder make a copy of the
“Individual Name and Address Match” plan. Rename it to “RT Individual Name and Address
Match”.
● Create a new stub CSV file with only the header row. This will be used to generate a new CSV
Source within the plan. This header must use all of the input fields used by the plan before
modification. For convenience, a sample stub header is listed below. The header for the stub
file will duplicate all of the fields, with one set having a suffix of _1 and the other _2.

IN_GROUP_KEY_1,IN_FIRSTNAME_1,IN_FIRSTNAME_ALT_1,
IN_MIDNAME_1,IN_LASTNAME_1,IN_POSTNAME_1,
IN_HOUSE_NUM_1,IN_STREET_NAME_1,IN_DIRECTIONAL_1,
IN_ADDRESS2_1,IN_SUITE_NUM_1,IN_CITY_1,IN_STATE_1,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 623 of 1017


IN_POSTAL_CODE_1,IN_GROUP_KEY_2,IN_FIRSTNAME_2,
IN_FIRSTNAME_ALT_2,IN_MIDNAME_2,IN_LASTNAME_2,
IN_POSTNAME_2,IN_HOUSE_NUM_2,IN_STREET_NAME_2,
IN_DIRECTIONAL_2,IN_ADDRESS2_2,IN_CITY_2,IN_STATE_2,
IN_POSTAL_CODE_2

● Now delete the CSV Match Source from the plan and add a new CSV Source, and point it at the
new stub file.
● Because the components were originally mapped to the CSV Match Source and that was
deleted, the fields within your plan need to be reselected. As you open the different match
components and RBAs, you can see the different instances that need to be reselected as they
appear with a red diamond, as seen below.

● Also delete the CSV Match Sink and replace it with a CSV Sink. Only the match score field(s)
must be selected for output. This plan will be imported into a passive transformation.
Consequently, data can be passed around it and does not need to be carried through the
transformation. With this implementation you can output multiple match scores so it is possible
to see why two records matched or didn’t match on a field by field basis.
● Select the check box for Enable Real-time Processing in both the source and the sink and the
plan will be ready to be imported into PowerCenter.

5. The mapping will consist of:


a. The source and target previously generated
b. An IDQ transformation importing the plan just built
c. The same IDQ cleansing and standardization transformations used to load then master data (Refer to
step 2 for specifics)
d. An Expression transformation to generate the group key and build a single directional field
e. A SQL transformation to get the candidate records for the master table
f. A Filter transformation to filter those records that match score below a certain threshold
g. A Sequence transformation to build a unique key for each matching record returned in the SOAP
response

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 624 of 1017


● Within PowerCenter Designer, create a new mapping and drag the web service source and target
previously created into the mapping.
● Add the following country pack mapplets to standardize and validate the incoming record from the web
service:

❍ mplt_dq_p_Personal_Name_Standardization_FML
❍ mplt_dq_p_USA_Address_Validation
❍ mplt_dq_p_USA_Phone_Standardization_Validation

● Add an Expression Transformation and build the candidate key from the Address Validation mapplet
output fields. Remember to use the same logic as in the mapping that loaded the customer master.
Also within the expression, concatenate the pre and post directional field into a single directional field
for matching purposes.
● Add a SQL transformation to the mapping. The SQL transform will present a dialog box with a few
questions related to the SQL transformation. For this example select Query mode, MS SQL Server
(change as desired), and a Static connection. For details on the other options refer to the PowerCenter
help.
● Connect all necessary fields from the source qualifier, DQ mapplets, and Expression transformation to
the SQL transformation. These fields should include:

❍ XPK_n4_Envelope (This is the Web Service message key)


❍ Parsed name elements
❍ Standardized and parsed address elements, which will be used for matching.
❍ Standardized phone number

● The next step is to build the query from within the SQL transformation to select the candidate records.
Make sure that the output fields agree with the query in number, name, and type.

The output of the SQL transform will be the incoming customer record along with the candidate record.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 625 of 1017


These will be stacked records where the Input/Output fields will represent the input record and the
Output only fields will represent the Candidate record. A simple example of this is shown in the table
below where a single incoming record will be paired with two candidate records:

● Comparing the new record to the candidates is done by embedding the IDQ plan converted in step 4
into the mapping through the use of the Data Quality transformation. When this transformation is
created, select passive as the transformation type. The output of the Data Quality transformation will
be a match score. This match score will be in a float type format between 0.0 and 1.0.
● Using a filter transformation, all records that have a match score below a certain threshold will get
filtered off. For this scenario, the cut-off will be 80%. (Hint: TO_FLOAT(out_match_score) >= .80)
● Any record coming out of the filter transformation is a potential match that exceeds the specified
threshold, and the record will be included in the response. Each of these records needs a new Unique
ID so the Sequence Generator transformation will be used.
● To complete the mapping, the output of the Filter and Sequence Generator transformations need to be
mapped to the target. Make sure to map the input primary key field (XPK_n4_Envelope_output) to the
primary key field of the envelope group in the target (XPK_n4_Envelope) and to the foreign key of the
response element group in the target (FK_n4_Envelope). Map the output of the Sequence Generator
to the primary key field of the response element group.
● The mapping should look like this:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 626 of 1017


6. Before testing the mapping, create a workflow.

● Using the Workflow Manager, generate a new workflow and session for this mapping using all the
defaults.
● Once created, edit the session task. On the Mapping tab select the SQL transformation and make sure
the connection type is relational. Also make sure to select the proper connection. For more advanced
tweaking and web service settings see the PowerCenter documentation.

● The final step is to expose this workflow as a Web Service. This is done by editing the Workflow. The
workflow needs to be Web Services enabled and this is done by selecting the enabled checkbox for
Web Services. Once the Web Service is enabled, it should be configured. For all the specific details of
this please refer to the PowerCenter documentation, but for the purpose of this scenario:

a. Give the service the name you would like to see exposed to the outside world

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 627 of 1017


b. Set the timeout to 30 seconds
c. Allow 2 concurrent runs
d. Set the workflow to be visible and runnable

7. The web service is ready for testing.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 628 of 1017


Last updated: 26-May-08 12:57

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 629 of 1017


Testing Data Quality Plans

Challenge

To provide a guide for testing data quality processes or plans created using Informatica
Data Quality (IDQ) and to manage some of the unique complexities associated with
data quality plans.

Description

Testing data quality plans is an iterative process that occurs as part of the Design
Phase of Velocity. Plan testing often precedes the project’s main testing activities, as
the tested plan outputs will be used as inputs in the Build Phase. It is not necessary to
formally test the plans used in the Analyze Phase of Velocity.

The development of data quality plans typically follows a prototyping methodology of


create, execute, analyze. Testing is performed as part of the third step, in order to
determine that the plans are being developed in accordance with design and project
requirements. This method of iterative testing helps support rapid identification and
resolution of bugs.

Bear in mind that data quality plans are designed to analyze and resolve data content
issues. These are not typically cut-and-dry problems, but more often represent a
continuum of data improvement issues where it is possible that every data instance is
unique and there is a target level of data quality rather than a “right or wrong answer”.
Data quality plans tend to resolve problems in terms of percentages and probabilities
that a problem is fixed. For example, the project may set a target of 95 percent
accuracy in its customer addresses. The level of inaccuracy acceptability is also likely
to change over time, based upon the importance of a given data field to the underlying
business process. As well, accuracy should continuously improve as the data quality
rules are applied and the existing data sets adhere to a higher standard of quality.

Common Questions in Data Quality Plan Testing

● What dataset will you use to test the plans? While the ideal situation is to
use a data set that exactly mimics the project production data, you may not
gain access to this data. If you obtain a full cloned set of the project data for
testing purposes, bear in mind that some plans (specifically some data

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 630 of 1017


matching plans) can take several hours to complete. Consider testing data
matching plans overnight.
● Are the plans using reference dictionaries? Reference dictionary
management is an important factor since it is possible to make changes to a
reference dictionary independently of IDQ and without making any changes to
the plan itself. When you pass an IDQ plan as tested, you must ensure that no
additional work is carried out on any dictionaries referenced in the plan.
Moreover, you must ensure that the dictionary files reside in locations that
are valid IDQ.
● How will the plans be executed? Will they be executed on a remote IDQ
Server and/or via a scheduler? In cases like these, it’s vital to ensure that your
plan resources, including source data files and reference data files, are in valid
locations for use by the Data Quality engine. For details on the local and
remote locations to which IDQ looks for source and reference data files, refer
to the Informatica Data Quality 8.5 User Guide.
● Will the plans be integrated into a PowerCenter transformation? If so, the
plans must have real-time enabled data source and sink components.

Strategies for Testing Data Quality Plans

The best practice steps for testing plans can be grouped under two headings.

Testing to Validate Rules

1. Identify a small, representative sample of source data.


2. To determine the results to expect when the plans are run, manually process
the data based on the rules for profiling, standardization or matching that the
plans will apply.
3. Execute the plans on the test dataset and validate the plan results against the
manually-derived results.

Testing to Validate Plan Effectiveness

This process is concerned with establishing that a data enhancement plan has been
properly designed; that is, that the plan delivers the required improvements in data
quality.

This is largely a matter of comparing the business and project requirements for data
quality and establishing if the plans are on course to deliver these. If not, the plans may
need a thorough redesign – or the business and project targets may need to be
revised. In either case, discussions should be held with the key business stakeholders
to review the results of the IDQ plan and determine the appropriate course of action. In

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 631 of 1017


addition, once the entire data set is processed against the business rules, there may be
other data anomalies that were unaccounted for that may require additional
modifications to the underlying business rules and IDQ plans.

Last updated: 05-Dec-07 16:02

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 632 of 1017


Tuning Data Quality Plans

Challenge

This document gives an insight into the type of considerations and issues a user needs
to be aware of when making changes to data quality processes defined in Informatica
Data Quality (IDQ). In IDQ, data quality processes are called plans.

The principal focus of this best practice is to know how to tune your plans without
adversely affecting the plan logic. This best practice is not intended to replace training
materials but serve as a guide for decision making in the areas of adding, removing or
changing the operational components that comprise a data quality plan.

Description

You should consider the following questions prior to making changes to a data quality
plan:

● What is the purpose of changing the plan? You should consider changing a
plan if you believe the plan is not optimally configured, or the plan is not
functioning properly and there is a problem at execution time or the plan is not
delivering expected results as per the plan design principles.
● Are you trained to change the plan? Data quality plans can be complex.
You should not alter a plan unless you have been trained or are highly
experienced with IDQ methodology.
● Is the plan properly documented? You should ensure all plan
documentation on the data flow and the data components are up-to-date. For
guidelines on documenting IDQ plans, see the Sample Deliverable Data
Quality Plan Design.
● Have you backed up the plan before editing? If you are using IDQ in a
client-server environment, you can create a baseline version of the plan using
IDQ version control functionality. In addition, you should copy the plan to a
new project folder (viz., Work_Folder) in the Workbench for changing and
testing, and leave the original plan untouched during testing.
● Is the plan operating directly on production data? This applies especially
to standardization plans. When editing a plan, always work on staged data
(database or flat-file). You can later migrate the plan to the production
environment after complete and thorough testing.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 633 of 1017


You should have a clear goal whenever you plan to change an existing plan. An event
may prompt the change: for example, input data changing (in format or content), or
changes in business rules or business/project targets. You should take into account all
current change-management procedures, and the updated plans should be thoroughly
tested before production processes are updated. This includes integration and
regression testing too. (See also Testing Data Quality Plans.)

Bear in mind that at a high level there are two types of data quality plans: data analysis
and data enhancement plans.

● Data analysis plans produce reports on data patterns and data quality across
the input data. The key objective in data analysis is to determine the levels of
completeness, conformity, and consistency in the dataset. In pursuing these
objectives, data analysis plans can also identify cases of missing, inaccurate
or “noisy” data.
● Data enhancement plans corrects completeness, conformity and consistency
problems; they can also identify duplicate data entries and fix accuracy issues
through the use of reference data.

Your goal in a data analysis plan is to discover the quality and usability of your data. It
is not necessarily your goal to obtain the best scores for your data. Your goal in a data
enhancement plan is to resolve the data quality issues discovered in the data analysis.

Adding Components

In general, simply adding a component to a plan is not likely to directly affect results if
no further changes are made to the plan. However, once the outputs from the new
component are integrated into existing components, the data process flow is changed
and the plan must be re-tested and results reviewed in detail before migrating the plan
into production.

Bear in mind, particularly in data analysis plans, that improved plan statistics do not
always mean that the plan is performing better. It is possible to configure a plan that
moves “beyond the point of truth” by focusing on certain data elements and excluding
others.

When added to existing plans, some components have a larger impact than others. For
example, adding a “To Upper” component to convert text into upper case may not
cause the plan results to change meaningfully, although the presentation of the output
data will change. However, adding and integrating a Rule Based Analyzer component

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 634 of 1017


(designed to apply business rules) may cause a severe impact, as the rules are likely to
change the plan logic.

As well as adding a new component — that is, a new icon — to the plan, you can add a
new instance to an existing component. This can have the same effect as adding and
integrating a new component icon. To avoid overloading a plan with too many
components, it is a good practice to add multiple instances to a single component,
within reason. Good plan design suggests that instances within a single component
should be logically similar and work on the selected inputs in similar ways. The overall
name for the component should also be changed to reflect the logic of the instances
contained in the component. If you add a new instance to a component, and that
instance behaves very differently to the other instances in that component — for
example, if it acts on an unrelated set of outputs or performs an unrelated type of action
on the data — you should probably add a new component for this instance. This will
also help you keep track of your changes onscreen.

To avoid making plans over-complicated, it is often a good practice to split tasks into
multiple plans where a large amount of data quality measures need to be checked. This
makes plans and business rules easier to maintain and provides a good framework for
future development. For example, in an environment where a large number of attributes
must be evaluated against the six standard data quality criteria (i.e., completeness,
conformity, consistency, accuracy, duplication and consolidation) using one plan per
data quality criterion may be a good way to move forward. Alternatively, splitting plans
up by data entity may be advantageous. Similarly, during standardization, you can
create plans for specific function areas (e.g,. address, product, or name) as opposed to
adding all standardization tasks to a single large plan.

For more information on the six standard data quality criteria, see Data Cleansing

Removing Components

Removing a component from a plan is likely to have a major impact since, in most
cases, data flow in the plan will be broken. If you remove an integrated component,
configuration changes will be required to all components that use the outputs from the
component. The plan cannot run without these configuration changes being completed.

The only exceptions to this case are when the output(s) of the removed component are
solely used by CSV Sink component or by a frequency component. However, in these
cases, you must note that the plan output changes since the column(s) no longer
appear in the result set.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 635 of 1017


Editing Component Configurations

Changing the configuration of a component can have a comparable impact on the


overall plan as adding or removing a component – the plan’s logic changes, and
therefore, so do the results that it produces. However, although adding or removing a
component may make a plan non-executable, changing the configuration of a
component can impact the results in more subtle ways. For example, changing the
reference dictionary used by a parsing component does not “break” a plan, but may
have a major impact on the resulting output.

Similarly, changing the name of a component instance output does not break a plan. By
default, component output names “cascade” through the other components in the plan,
so when you change an output name, all subsequent components automatically update
with the new output name. It is not necessary to change the configuration of dependent
components.

Last updated: 26-May-08 11:12

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 636 of 1017


Using Data Explorer for Data Discovery and Analysis

Challenge

To understand and make full use of Informatica Data Explorer’s potential to profile and define mappings for your
project data.

Data profiling and mapping provide a firm foundation for virtually any project involving data movement, migration,
consolidation or integration, from data warehouse/data mart development, ERP migrations, and enterprise
application integration to CRM initiatives and B2B integration. These types of projects rely on an accurate
understanding of the true structure of the source data in order to correctly transform the data for a given target
database design. However, the data’s actual form rarely coincides with its documented or supposed form.

The key to success for data-related projects is to fully understand the data as it actually is, before attempting to
cleanse, transform, integrate, mine, or otherwise operate on it. Informatica Data Explorer is a key tool for this
purpose.

This Best Practice describes how to use Informatica Data Explorer (IDE) in data profiling and mapping scenarios.

Description

Data profiling and data mapping involve a combination of automated and human analyses to reveal the quality,
content and structure of project data sources. Data profiling analyzes several aspects of data structure and
content, including characteristics of each column or field, the relationships between fields, and the commonality
of data values between fields— often an indicator of redundant data.

Data Profiling

Data profiling involves the explicit analysis of source data and the comparison of observed data characteristics
against data quality standards. Data quality and integrity issues include invalid values, multiple formats within a
field, non-atomic fields (such as long address strings), duplicate entities, cryptic field names, and others. Quality
standards may either be the native rules expressed in the source data’s metadata, or an external standard (e.g.,
corporate, industry, or government) to which the source data must be mapped in order to be assessed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 637 of 1017


Data profiling in IDE is based on two main processes:

● Inference of characteristics from the data


● Comparison of those characteristics with specified standards, as an assessment of data quality

Data mapping involves establishing relationships among data elements in various data structures or sources, in
terms of how the same information is expressed or stored in different ways in different sources. By performing
these processes early in a data project, IT organizations can preempt the “code/load/explode” syndrome, wherein
a project fails at the load stage because the data is not in the anticipated form.

Data profiling and mapping are fundamental techniques applicable to virtually any project. The following figure
summarizes and abstracts these scenarios into a single depiction of the IDE solution.

The overall process flow for the IDE Solution is as follows:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 638 of 1017


1. Data and metadata are prepared and imported into IDE.

2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents
cleansing and transformation requirements based on the source and normalized schemas.
3. The resultant metadata are exported to and managed in the IDE Repository.
4. In a derived-target scenario, the project team designs the target database by modeling the existing data
sources and then modifying the model as required to meet current business and performance
requirements. In this scenario, IDE is used to develop the normalized schema into a target database.

The normalized and target schemas are then exported to IDE’s FTM/XML tool, which documents
transformation requirements between fields in the source, normalized, and target schemas.

OR
5. In a fixed-target scenario, the design of the target database is a given (i.e., because another
organization is responsible for developing it, or because an off-the-shelf package or industry standard is
to be used). In this scenario, the schema development process is bypassed. Instead, FTM/XML is used to
map the source data fields to the corresponding fields in an externally-specified target schema, and to
document transformation requirements between fields in the normalized and target schemas. FTM is
used for SQL-based metadata structures, and FTM/XML is used to map SQL and/or XML-based
metadata structures. Externally specified targets are typical for ERP package migrations, business-to-
business integration projects, or situations where a data modeling team is independently designing the
target schema.
6. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and
loading or formatting specs developed with IDE applications.

IDE's Methods of Data Profiling

IDE employs three methods of data profiling:

Column profiling - infers metadata from the data for a column or set of columns. IDE infers both the most likely
metadata and alternate metadata which is consistent with the data.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 639 of 1017


Table Structural profiling - uses the sample data to infer relationships among the columns in a table. This
process can discover primary and foreign keys, functional dependencies, and sub-tables.

Cross-Table profiling - determines the overlap of values across a set of columns, which may come from
multiple tables.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 640 of 1017


Profiling against external standards requires that the data source be mapped to the standard before being
assessed (as shown in the following figure). Note that the mapping is performed by IDE’s Fixed Target Mapping
tool (FTM). IDE can also be used in the development and application of corporate standards, making them
relevant to existing systems as well as to new systems.

Data profiling projects may involve iterative profiling and cleansing as well since data cleansing may improve the
quality of the results obtained through dependency and redundancy profiling. Note that Informatica Data Quality
should be considered as an alternative tool for data cleansing.

IDE and Fixed-Target Migration

Fixed-target migration projects involve the conversion and migration of data from one or more sources to an
externally defined or fixed-target. IDE is used to profile the data and develop a normalized schema representing

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 641 of 1017


the data source(s), while IDE’s Fixed Target Mapping tool (FTM) is used to map from the normalized schema to
the fixed target.

The general sequence of activities for a fixed-target migration project, as shown in the figure below, is as follows:

1. Data is prepared for IDE. Metadata is imported into IDE.


2. IDE profiles the data, generates accurate metadata (including a normalized schema), and documents
cleansing and transformation requirements based on the source and normalized schemas. The cleansing
requirements can be reviewed and modified by the Data Quality team.
3. The resultant metadata are exported to and managed by the IDE Repository.
4. FTM maps the source data fields to the corresponding fields in an externally specified target schema, and
documents transformation requirements between fields in the normalized and target schemas. Externally-
specified targets are typical for ERP migrations or projects where a data modeling team is independently
designing the target schema.
5. The IDE Repository is used to export or generate reports documenting the cleansing, transformation, and
loading or formatting specs developed with IDE and FTM.
6. The cleansing, transformation, and formatting specs can be used by the application development or Data
Quality team to cleanse the data, implement any required edits and integrity management functions, and
develop the transforms or configure an ETL product to perform the data conversion and migration.

The following screen shot shows how IDE can be used to generate a suggested normalized schema, which may
discover ‘hidden’ tables within tables.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 642 of 1017


Depending on the staging architecture used, IDE can generate the data definition language (DDL) needed to
establish several of the staging databases between the sources and target, as shown below:

Derived-Target Migration

Derived-target migration projects involve the conversion and migration of data from one or more sources to a
target database defined by the migration team. IDE is used to profile the data and develop a normalized schema
representing the data source(s), and to further develop the normalized schema into a target schema by adding
tables and/or fields, eliminating unused tables and/or fields, changing the relational structure, and/or
denormalizing the schema to enhance performance. When the target schema is developed from the normalized
schema within IDE, the product automatically maintains the mappings from the source to normalized schema,
and from the normalized to target schemas.

The figure below shows that the general sequence of activities for a derived-target migration project is as follows:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 643 of 1017


1. Data is prepared for IDE. Metadata is imported into IDE.
2. IDE is used to profile the data, generate accurate metadata (including a normalized schema), and
document cleansing and transformation requirements based on the source and normalized schemas. The
cleansing requirements can be reviewed and modified by the Data Quality team.
3. IDE is used to modify and develop the normalized schema into a target schema. This generally involves
removing obsolete or spurious data elements, incorporating new business requirements and data
elements, adapting to corporate data standards, and denormalizing to enhance performance.
4. The resultant metadata are exported to and managed by the IDE Repository.
5. FTM is used to develop and document transformation requirements between the normalized and target
schemas. The mappings between the data elements are automatically carried over from the IDE-based
schema development process.
6. The IDE Repository is used to export an XSLT document containing the transformation and the formatting
specs developed with IDE and FTM/XML.
7. The cleansing, transformation, and formatting specs are used by the application development or Data
Quality team to cleanse the data, implement any required edits and integrity management functions, and
develop the transforms of configure an ETL product to perform the data conversion and migration.

Last updated: 09-Feb-07 12:55

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 644 of 1017


Working with Pre-Built Plans in Data Cleanse and Match

Challenge

To provide a set of best practices for users of the pre-built data quality processes designed for use with the Informatica Data
Cleanse and Match (DC&M) product offering.

Informatica Data Cleanse and Match is a cross-application data quality solution that installs two components to the PowerCenter
system:

● Data Cleanse and Match Workbench, the desktop application in which data quality processes - or plans - plans can be
designed, tested, and executed. Workbench installs with its own Data Quality repository, where plans are stored until
needed.
● Data Quality Integration, a plug-in component that integrates Informatica Data Quality and PowerCenter. The plug-in
adds a transformation to PowerCenter, called the Data Quality Integration transformation; PowerCenter Designer users
can connect to the Data Quality repository and read data quality plan information into this transformation.

Informatica Data Cleanse and Match has been developed to work with Content Packs developed by Informatica. This document
focuses on the plans that install with the North America Content Pack, which was developed in conjunction with the components
of Data Cleanse and Match. The North America Content Pack delivers data parsing, cleansing, standardization, and de-duplication
functionality to United States and Canadian name and address data through a series of pre-built data quality plans and address
reference data files.

This document focuses on the following areas:

● when to use one plan vs. another for data cleansing.


● what behavior to expected from the plans.
● how best to manage exception data.

Description

The North America Content Pack installs several plans to the Data Quality Repository:

● Plans 01-04 are designed to parse, standardize, and validate United States name and address data.
● Plans 05-07 are designed to enable single-source matching operations (identifying duplicates within a data set) or dual
source matching operations (identifying matching records between two datasets).

The processing logic for data matching is split between PowerCenter and Informatica Data Quality (IDQ) applications.

Plans 01-04: Parsing, Cleansing, and Validation

These plans provide modular solutions for name and address data. The plans can operate on highly unstructured and well-
structured data sources. The level of structure contained in a given data set determines the plan to be used.

The following diagram demonstrates how the level of structure in address data maps to the plans required to standardize and
validate an address.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 645 of 1017


In cases where the address is well structured and specific data elements (i.e., city, state, and zip) are mapped to specific fields,
only the address validation plan may be required. Where the city, state, and zip are mapped to address fields, but not specifically
labeled as such (e.g., as Address1 through Address5), a combination of the address standardization and validation plans
is required. In extreme cases, where the data is not mapped to any address columns, a combination of the general parser, address
standardization, and validation plans may be required to obtain meaning from the data.

The purpose of making the plans modular is twofold:

● It is possible to apply these plans on an individual basis to the data. There is no requirement that the plans be run in
sequence with each other. For example, the address validation plan (plan 03) can be run successfully to validate input
addresses discretely from the other plans. In fact, the Data Quality Developer will not run all seven plans consecutively on
the same dataset. Plans 01 and 02 are not designed to operate in sequence, nor are plans 06 and 07.
● Modular plans facilitate faster performance. Designing a single plan to perform all the processing tasks contained in the
seven plans, even if it were desirable from a functional point of view, would result in significant performance degradation
and extremely complex plan logic that would be difficult to modify and maintain.

01 General Parser

The General Parser plan was developed to handle highly unstructured data and to parse it into type-specific fields. For example,
consider data stored in the following format:

Field1 Field2 Field3 Field4 Field5


100 Cardinal Way Informatica Corp CA 94063 info@informatica.com Redwood City

Redwood City 38725 100 Cardinal Way CA 94063 info@informatica.com

While it is unusual to see data fragmented and spread across a number of fields in this way, it can and does happen. In cases such
as this, data is not stored in any specific fields. Street addresses, email addresses, company names, and dates are scattered
throughout the data. Using a combination of dictionaries and pattern recognition, the General Parser plan sorts such data into type-
specific fields of address, names, company names, Social Security Numbers, dates, telephone numbers, and email addresses,
depending on the profile of the content. As a result, the above data will be parsed into the following format:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 646 of 1017


Address1 Address2 Address3 E-mail Date Company
100 Cardinal CA 94063 Redwood City info@informatica.com Informatica Corp
Way
Redwood City 100 Cardinal CA 94063 info@informatica.com 08/01/2006
Way

The General Parser does not attempt to apply any structure or meaning to the data. Its purpose is to identify and sort data by
information type. As demonstrated with the address fields in the above example, the address fields are labeled as addresses; the
contents are not arranged in a standard address format, they are flagged as addresses in the order in which they were processed
in the file.

The General Parser does not attempt to validate the correctness of a field. For example, the dates are accepted as valid because
they have a structure of symbols and numbers that represents a date. A value of 99/99/9999 would also be parsed as a date.

The General Parser does not attempt to handle multiple information types in a single field. For example, if a person name and
address element are contained in the same field, the General Parser would label the entire field either a name or an address - or
leave it unparsed - depending on the elements in the field it can identify first (if any).

While the General Parser does not make any assumption about the data prior to parsing, it parses based on the elements of data
that it can make sense of first. In cases where no elements of information can be labeled, the field is left in a pipe-delimited form
containing unparsed data.

The effectiveness of the General Parser to recognize various information types is a function of the dictionaries used to identify that
data and the rules used to sort them. Adding or deleting dictionary entries can greatly affect the effectiveness of this plan.

Overall, the General Parser is likely only be used in limited cases, where certain types of information may be mixed together, (e.g.,
telephone and email in the same contact field), or in cases where the data has been badly managed, such as when several files of
differing structures have been merged into a single file.

02 Name Standardization

The Name Standardization plan is designed to take in person name or company name information and apply parsing and
standardization logic to it. Name Standardization follows two different tracks: one for person names and one for company names.

The plan input fields include two inputs for company names. Data that is entered in these fields are assumed to be valid company
names, and no additional tests are performed to validate that the data is an existing company name. Any combination of letters,
numbers, and symbols can represent a company; therefore, in the absence of an external reference data source, further tests to
validate a company name are not likely to yield usable results.

Any data entered into the company name fields is subjected to two processes. First, the company name is standardized using the
Word Manager component, standardizing any company suffixes included in the field. Second, the standardized company name is
matched against the company_names.dic dictionary, which returns the standardized Dun & Bradstreet company name, if found.

The second track for name standardization is person names standardization. While this track is dedicated to standardizing person
names, it does not necessarily assume that all data entered here is a person name. Person names in North America tend to follow
a set structure and typically do not contain company suffixes or digits. Therefore, values entered in this field that contain a company
suffix or a company name are taken out of the person name track and moved to the company name track. Additional logic is
applied to identify people whose last name is similar (or equal) to a valid company name (for example John Sears); inputs that
contain an identified first name and a company name are treated as a person name.

If the company name track inputs are already fully populated for the record in question, then any company name detected in a
person name column is moved to a field for unparsed company name output. If the name is not recognized as a company name (e.
g., by the presence of a company suffix) but contains digits, the data is parsed into the non-name data output field. Any remaining
data is accepted as being a valid person name and parsed as such.

North American person names are typically entered in one of two different styles: either in a “firstname middlename surname”
format or “surname, firstname middlename” format. Name parsing algorithms have been built using this assumption.

Name parsing occurs in two passes. The first pass applies a series of dictionaries to the name fields, attempting to parse out name

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 647 of 1017


prefixes, name suffixes, firstnames, and any extraneous data (“noise”) present. Any remaining details are assumed to be middle
name or surname details. A rule is applied to the parsed details to check if the name has been parsed correctly. If not, “best guess”
parsing is applied to the field based on the possible assumed formats.

When name details have been parsed into first, last, and middle name formats, the first name is used to derive additional details
including gender and the name prefix. Finally, using all parsed and derived name elements, salutations are generated.

In cases where no clear gender can be generated from the first name, the gender field is typically left blank or indeterminate.

The salutation field is generated according to the derived gender information. This can be easily replicated outside the data quality
plan if the salutation is not immediately needed as an output from the process (assuming the gender field is an output).

Depending on the data entered in the person name fields, certain companies may be treated as person names and parsed
according to person name processing rules. Likewise, some person names may be identified as companies and standardized
according to company name processing logic. This is typically a result of the dictionary content. If this is a significant problem when
working with name data, some adjustments to the dictionaries and the rule logic for the plan may be required.

Non-name data encountered in the name standardization plan may be standardized as names depending on the contents of the
fields. For example, an address datum such as “Corporate Parkway” may be standardized as a business name, as “Corporate” is
also a business suffix. Any text data that is entered in a person name field is always treated as a person or company, depending on
whether or not the field contains a recognizable company suffix in the text.

To ensure that the name standardization plan is delivering adequate results, Informatica strongly recommends pre- and post-
execution analysis of the data.

Based on the following input:

ROW ID IN NAME1
1 Steven King
2 Chris Pope Jr.
3 Shannon C. Prince
4 Dean Jones
5 Mike Judge
6 Thomas Staples
7 Eugene F. Sears
8 Roy Jones Jr.
9 Thomas Smith, Sr
10 Eddie Martin III
11 Martin Luther King, Jr.
12 Staples Corner
13 Sears Chicago
14 Robert Tyre
15 Chris News

The following outputs are produced by the Name Standardization plan:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 648 of 1017


The last entry (Chris News) is identified as a company in the current plan configuration – such results can be refined by changing
the underlying dictionary entries used to identify company and person names.

03 US Canada Standardization

This plan is designed to apply basic standardization processes to city, state/province, and zip/postal code information for United
States and Canadian postal address data. The purpose of the plan is to deliver basic standardization to address elements where
processing time is critical and one hundred percent validation is not possible due to time constraints. The plan also organizes key
search elements into discrete fields, thereby speeding up the validation process.

The plan accepts up to six generic address fields and attempts to parse out city, state/province, and zip/postal code information. All
remaining information is assumed to be address information and is absorbed into the address line 1-3 fields. Any information that
cannot be parsed into the remaining fields is merged into the non-address data field.

The plan makes a number of assumptions that may or may not suit your data:

● When parsing city, state, and zip details, the address standardization dictionaries assume that these data elements are
spelled correctly. Variation in town/city names is very limited, and in cases where punctuation differences exist or where
town names are commonly misspelled, the standardization plan may not correctly parse the information.
● Zip codes are all assumed to be five-digit. In some files, zip codes that begin with “0” may lack this first number and so
appear as a four-digit codes, and these may be missed during parsing. Adding four-digit zips to the dictionary is not
recommended, as these will conflict with the “Plus 4” element of a zip code. Zip codes may also be confused with other
five-digit numbers in an address line such as street numbers.
● City names are also commonly found in street names and other address elements. For example, “United” is part of a
country (United States of America) and is also a town name in the U.S. Bear in mind that the dictionary parsing operates
from right to left across the data, so that country name and zip code fields are analyzed before city names and street
addresses. Therefore, the word “United” may be parsed and written as the town name for a given address before the
actual town name datum is reached.
● The plan appends a country code to the end of a parsed address if it can identify it as U.S. or Canadian. Therefore, there
is no need to include any country code field in the address inputs when configuring the plan.

Most of these issues can be dealt with, if necessary, by minor adjustments to the plan logic or to the dictionaries, or by adding
some pre-processing logic to a workflow prior to passing the data into the plan.

The plan assumes that all data entered into it are valid address elements. Therefore, once city, state, and zip details have been
parsed out, the plan assumes all remaining elements are street address lines and parses them in the order they occurred as
address lines 1-3.

04 NA Address Validation

The purposes of the North America Address Validation plan are:

● To match input addresses against known valid addresses in an address database, and
● To parse, standardize, and enrich the input addresses.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 649 of 1017


Performing these operations is a resource-intensive process. Using the US Canada Standardization plan before the NA Address
Validation plan helps to improve validation plan results in cases where city, state, and zip code information are not already in
discrete fields. City, state, and zip are key search criteria for the address validation engine, and they need to be mapped into
discrete fields. Not having these fields correctly mapped prior to plan execution leads to poor results and slow execution times.

The address validation APIs store specific area information in memory and continue to use that information from one record to the
next, when applicable. Therefore, when running validation plans, it is advisable to sort address data by zip/postal code in order to
maximize the usage of data in memory.

In cases where status codes, error codes, or invalid results are generated as plan outputs, refer to the Informatica Data Quality 3.1
User Guide for information on how to interpret them.

Plans 05-07: Pre-Match Standardization, Grouping, and Matching

These plans take advantage of PowerCenter and IDQ capabilities and are commonly used in pairs. Users will use either plan 05
and 06 or plans 05 and 07. There plans work as follows:

● 05 Match Standardization and Grouping. This plan is used to perform basic standardization and grouping operations on
the data prior to matching.
● 06 Single Source Matching. Single source matching seeks to identify duplicate records within a single data set.
● 07 Dual Source Matching. Dual source matching seeks to identify duplicate records between two datasets.

Note that the matching plans are designed for use within a PowerCenter mapping and do not deliver optimal results when executed
directly from IDQ Workbench. Note also that the Standardization and Matching plans are geared towards North American English
data. Although they work with datasets in other languages, the results may be sub-optimal.

Matching Concepts

To ensure the best possible matching results and performance, match plans usually use a pre-processing step to standardize and
group the data.

The aim for standardization here is different from a classic standardization plan – the intent is to ensure that different spellings,
abbreviations, etc. are as similar to each other as possible to return better match set. For example, 123 Main Rd. and 123 Main
Road will obtain an imperfect match score, although they clearly refer to the same street address.

Grouping, in a matching context, means sorting input records based on identical values in one or more user-selected fields. When a
matching plan is run on grouped data, serial matching operations are performed on a group-by-group basis, so that data records
within a group are matched but records across groups are not. A well-designed grouping plan can dramatically cut plan processing
time while minimizing the likelihood of missed matches in the dataset.

Grouping performs two functions. It sorts the records in a dataset to increase matching plan performance, and it creates new data
columns to provide group key options for the matching plan. (In PowerCenter, the Sorter transformation can organize the data to
facilitate matching performance. Therefore, the main function of grouping in a PowerCenter context is to create candidate group
keys.
In both Data Quality and PowerCenter, grouping operations do not affect the source dataset itself.)

Matching on un-grouped data involves a large number of comparisons that realistically will not generate a meaningful quantity of
additional matches. For example, when looking for duplicates in a customer list, there is little value in comparing the record for John
Smith with the record for Angela Murphy as they are obviously not going to be considered as duplicate entries. The type of
grouping used depends on the type of information being matched; in general, productive fields for grouping name and address data
are location-based (e.g. city name, zip codes) or person/company based (surname and company name composites). For more
information on grouping strategies for best result/performance relationship, see the Best Practice Effective Data Matching
Techniques.

Plan 05 (Match Standardization and Grouping) performs cleansing and standardization operations on the data before
group keys are generated. It offers a number of grouping options. The plan generates the following group keys:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 650 of 1017


● OUT_ZIP_GROUP: first 5 digits of ZIP code
● OUT_ZIP_NAME3_GROUP: first 5 digits of ZIP code and the first 3 characters of the last name
● OUT_ZIP_NAME5_GROUP: first 5 digits of ZIP code and the first 5 characters of the last name
● OUT_ZIP_COMPANY3_GROUP: first 5 digits of ZIP code and the first 3 characters of the cleansed company name
● OUT_ZIP_COMPANY5_GROUP: first 5 digits of ZIP code and the first 5 characters of the cleansed company name

The grouping output used depends on the data contents and data volume.

Plans 06 Single Source Matching and 07 Dual Source Matching

Plans 06 and 07 are set up in similar ways and assume that person name, company name, and address data inputs will be used.
However, in PowerCenter, plan 07 requires the additional input of a Source tag, typically generated by an Expression
transform upstream in the PowerCenter mapping.

A number of matching algorithms are applied to the address and name elements. To ensure the best possible result, a weight-
based component and a custom rule are applied to the outputs from the matching components. For further information on IDQ
matching components, consult the Informatica Data Quality 3.1 User Guide.

By default the plans are configured to write as output all records that match with an 85% percent or higher degree of certainty. The
Data Quality Developer can easily adjusted this figure in each plan.

PowerCenter Mappings

When configuring the Data Quality Integration transformation for the matching plan, the Developer must select a valid grouping field.

To ensure best matching results, the PowerCenter mapping that contains plan 05 should include a Sorter transformation that sorts
data according to the group key to be used during matching. This transformation should follow standardization and grouping
operations. Note that a single mapping can contain multiple Data Quality Integration transformations, so that the Data Quality
Developer or Data Integration Developer can add plan 05 to one Integration transformation and plan 06 or 07 to another in the
same mapping. The standardization plan requires a passive transformation, whereas the matching plan requires an active

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 651 of 1017


transformation.

The developer can add a Sequencer transformation to the mapping to generate a unique identifier for each input record if these not
present in the source data. (Note that a unique identifier is not required for matching processes).

When working with the dual source matching plan, additional PowerCenter transformations are required to pre-process the data for
the Integration transformation. Expression transformations are used to label each input with a source tag of A and B respectively.
The data from the two sources is then joined together using a Union transformation, before being passed to the Integration
transformation containing the standardization and grouping plan. From here on, the mapping has the same design as the single
source version.

Last updated: 09-Feb-07 13:18

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 652 of 1017


Designing Data Integration Architectures

Challenge

Develop a sound data integration architecture that can serve as a foundation for data integration
solutions.

Description

Historically, organizations have approached the development of a "data warehouse" or "data mart"
as a departmental effort, without considering an enterprise perspective. The result has been silos
of corporate data and analysis, which very often conflict with each other in terms of both detailed
data and the business conclusions implied by it. Data integration efforts are often the cornerstone
in today's IT initiatives. Taking an enterprise-wide, architect stance in developing data
integration solutions provides many advantages, including:

● A sound architectural foundation ensures the solution can evolve and scale with the
business over time.
● Proper architecture can isolate the application component (business context) of the data
integration solution from the technology.
● Broader data integration efforts will be simplified by using an holisitc enterprise-based
approach.
● Lastly, architectures allow for reuse - reuse of skills, design objects, and knowledge.

As the evolution of data integration solutions (and the corresponding nomenclature) has
progressed, the necessity of building these solutions on a solid architectural framework has
become more and more clear. To understand why, a brief review of the history of data
integration solutions and their predecessors is warranted.

As businesses become more global, Service Oriented Architecture (SOA) becomes more of an
Information Technology standard. Having a solid architecture is paramount to the success of data
Integration efforts.

Historical Perspective

Online Transaction Processing Systems (OLTPs) have always provided a very detailed,
transaction-oriented view of an organization's data. While this view was indispensable for the day-
to-day operation of a business, its ability to provide a "big picture" view of the operation, critical for
management decision-making, was severely limited. Initial attempts to address this problem took
several directions:

Reporting directly against the production system. This approach minimized the effort
associated with developing management reports, but introduced a number of significant issues:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 653 of 1017


The nature of OLTP data is, by definition, "point-in-time." Thus, reports run at different times of the
year, month, or even the day, were inconsistent with each other.

Ad hoc queries against the production database introduced uncontrolled performance issues,
resulting in slow reporting results and degradation of OLTP system performance.

Trending and aggregate analysis was difficult (or impossible) with the detailed data available in the
OLTP systems.

● Mirroring the production system in a reporting database . While this approach


alleviated the performance degradation of the OLTP system, it did nothing to address the
other issues noted above.
● Reporting databases . To address the fundamental issues associated with reporting
against the OLTP schema, organizations began to move toward dedicated reporting
databases. These databases were optimized for the types of queries typically run by
analysts, rather than those used by systems supporting data entry clerks or customer
service representatives. These databases may or may not have included pre-aggregated
data, and took several forms, including traditional RDBMS as well as newer technology
Online Analytical Processing (OLAP) solutions.

The initial attempts at reporting solutions were typically point solutions; they were developed
internally to provide very targeted data to a particular department within the enterprise. For
example, the Marketing department might extract sales and demographic data in order to infer
customer purchasing habits. Concurrently, the Sales department was also extracting sales data for
the purpose of awarding commissions to the sales force. Over time, these isolated silos of
information became irreconcilable, since the extracts and business rules applied to the data during
the extract process differed for the different departments

The result of this evolution was that the Sales and Marketing departments might report completely
different sales figures to executive management, resulting in a lack of confidence in both
departments' "data marts." From a technical perspective, the uncoordinated extracts of the same
data from the source systems multiple times placed undue strain on system resources.

The solution seemed to be the "centralized" or "galactic" data warehouse. This warehouse would
be supported by a single set of periodic extracts of all relevant data into the data warehouse (or
Operational Data Store), with the data being cleansed and made consistent as part of the extract
process. The problem with this solution was its enormous complexity, typically resulting in project
failure. The scale of these failures led many organizations to abandon the concept of the enterprise
data warehouse in favor of the isolated, "stovepipe" data marts described earlier. While these
solutions still had all of the issues discussed previously, they had the clear advantage of providing
individual departments with the data they needed without the unmanageability of the enterprise
solution.

As individual departments pursued their own data and data integration needs, they not only created
data stovepipes, they also created technical islands. The approaches to populating the data marts
and performing the data integration tasks varied widely, resulting in a single enterprise evaluating,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 654 of 1017


purchasing, and being trained on multiple tools and adopting multiple methods for performing these
tasks. If, at any point, the organization did attempt to undertake an enterprise effort, it was likely to
face the daunting challenge of integrating the disparate data as well as the widely varying
technologies. To deal with these issues, organizations began developing approaches that
considered the enterprise-level requirements of a data integration solution.

Centralized Data Warehouse

The first approach to gain popularity was the centralized data warehouse. Designed to solve the
decision support needs for the entire enterprise at one time, with one effort, the data integration
process extracts the data directly from the operational systems. It transforms the data according to
the business rules and loads it into a single target database serving as the enterprise-wide data
warehouse.

Advantages

The centralized model offers a number of benefits to the overall architecture, including:

● Centralized control . Since a single project drives the entire process, there is centralized
control over everything occurring in the data warehouse. This makes it easier to manage a
production system while concurrently integrating new components of the warehouse.
● Consistent metadata . Because the warehouse environment is contained in a single
database and the metadata is stored in a single repository, the entire enterprise can be
queried whether you are looking at data from Finance, Customers, or Human Resources.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 655 of 1017


● Enterprise view . Developing the entire project at one time provides a global view of how
data from one workgroup coordinates with data from others. Since the warehouse is highly
integrated, different workgroups often share common tables such as customer, employee,
and item lists.
● High data integrity . A single, integrated data repository for the entire enterprise would
naturally avoid all data integrity issues that result from duplicate copies and versions of the
same business data.

Disadvantages

Of course, the centralized data warehouse also involves a number of drawbacks, including:

● Lengthy implementation cycle. With the complete warehouse environment developed


simultaneously, many components of the warehouse become daunting tasks, such as
analyzing all of the source systems and developing the target data model. Even minor
tasks, such as defining how to measure profit and establishing naming conventions,
snowball into major issues.
● Substantial up-front costs . Many analysts who have studied the costs of this approach
agree that this type of effort nearly always runs into the millions. While this level of
investment is often justified, the problem lies in the delay between the investment and the
delivery of value back to the business.
● Scope too broad . The centralized data warehouse requires a single database to satisfy
the needs of the entire organization. Attempts to develop an enterprise-wide warehouse
using this approach have rarely succeeded, since the goal is simply too ambitious. As a
result, this wide scope has been a strong contributor to project failure.
● Impact on the operational systems . Different tables within the warehouse often read
data from the same source tables, but manipulate it differently before loading it into the
targets. Since the centralized approach extracts data directly from the operational
systems, a source table that feeds into three different target tables is queried three times
to load the appropriate target tables in the warehouse. When combined with all the other
loads for the warehouse, this can create an unacceptable performance hit on the
operational systems.
● Potential integration challenges. A centralized data warehouse has the disadvantage of
limited scalability. As businesses change and consolidate, adding new interfaces and/or
merging a potentially disparate data source into the centralized data warehouse can be a
challenge.

Independent Data Mart

The second warehousing approach is the independent data mart, which gained popularity in 1996
when DBMS magazine ran a cover story featuring this strategy. This architecture is based on the
same principles as the centralized approach, but it scales down the scope from solving the
warehousing needs of the entire company to the needs of a single department or workgroup.

Much like the centralized data warehouse, an independent data mart extracts data directly from the
operational sources, manipulates the data according to the business rules, and loads a single

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 656 of 1017


target database serving as the independent data mart. In some cases, the operational data may be
staged in an Operational Data Store (ODS) and then moved to the mart.

Advantages

The independent data mart is the logical opposite of the centralized data warehouse. The
disadvantages of the centralized approach are the strengths of the independent data mart:

● Impact on operational databases localized . Because the independent data mart is


trying to solve the DSS needs of a single department or workgroup, only the few
operational databases containing the information required need to be analyzed.
● Reduced scope of the data model . The target data modeling effort is vastly reduced
since it only needs to serve a single department or workgroup, rather than the entire
company.
● Lower up-front costs . The data mart is serving only a single department or workgroup;
thus hardware and software costs are reduced.
● Fast implementation . The project can be completed in months, not years. The process
of defining business terms and naming conventions is simplified since "players from the
same team" are working on the project.

Disadvantages

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 657 of 1017


Of course, independent data marts also have some significant disadvantages:

● Lack of centralized control . Because several independent data marts are needed to
solve the decision support needs of an organization, there is no centralized control. Each
data mart or project controls itself, but there is no central control from a single location.
● Redundant data . After several data marts are in production throughout the organization,
all of the problems associated with data redundancy surface, such as inconsistent
definitions of the same data object or timing differences that make reconciliation
impossible.
● Metadata integration . Due to their independence, the opportunity to share metadata - for
example, the definition and business rules associated with the Invoice data object - is lost.
Subsequent projects must repeat the development and deployment of common data
objects.
● Manageability . The independent data marts control their own scheduling routines and
therefore store and report their metadata differently, with a negative impact on the
manageability of the data warehouse. There is no centralized scheduler to coordinate the
individual loads appropriately or metadata browser to maintain the global metadata and
share development work among related projects.

Dependent Data Marts (Federated Data Warehouses)

The third warehouse architecture is the dependent data mart approach supported by the hub-and-
spoke architecture of PowerCenter and PowerExchange. After studying more than one hundred
different warehousing projects, Informatica introduced this approach in 1998, leveraging the
benefits of the centralized data warehouse and independent data mart.

The more general term being adopted to describe this approach is the "federated data warehouse."
Industry analysts have recognized that, in many cases, there is no "one size fits all" solution.
Although the goal of true enterprise architecture, with conformed dimensions and strict standards,
is laudable, it is often impractical, particularly for early efforts. Thus, the concept of the federated
data warehouse was born. It allows for the relatively independent development of data marts, but
leverages a centralized PowerCenter repository for sharing transformations, source and target
objects, business rules, etc.

Recent literature describes the federated architecture approach as a way to get closer to the goal
of a truly centralized architecture while allowing for the practical realities of most organizations. The
centralized warehouse concept is sacrificed in favor of a more pragmatic approach, whereby the
organization can develop semi-autonomous data marts, so long as they subscribe to a common
view of the business. This common business model is the fundamental, underlying basis of the
federated architecture, since it ensures consistent use of business terms and meanings throughout
the enterprise.

With the exception of the rare case of a truly independent data mart, where no future growth is
planned or anticipated, and where no opportunities for integration with other business areas exist,
the federated data warehouse architecture provides the best framework for building a data
integration solution.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 658 of 1017


Informatica's PowerCenter and PowerExchange products provide an essential capability for
supporting the federated architecture: the shared Global Repository. When used in conjunction
with one or more Local Repositories, the Global Repository serves as a sort of "federal" governing
body, providing a common understanding of core business concepts that can be shared across the
semi-autonomous data marts. These data marts each have their own Local Repository, which
typically include a combination of purely local metadata and shared metadata by way of links to the
Global Repository.

This environment allows for relatively independent development of individual data marts, but also
supports metadata sharing without obstacles. The common business model and names described
above can be captured in metadata terms and stored in the Global Repository. The data marts use
the common business model as a basis, but extend the model by developing departmental
metadata and storing it locally.

A typical characteristic of the federated architecture is the existence of an Operational Data Store
(ODS). Although this component is optional, it can be found in many implementations that extract
data from multiple source systems and load multiple targets. The ODS was originally designed to
extract and hold operational data that would be sent to a centralized data warehouse, working as a
time-variant database to support end-user reporting directly from operational systems. A typical
ODS had to be organized by data subject area because it did not retain the data model from the
operational system.

Informatica's approach to the ODS, by contrast, has virtually no change in data model from the
operational system, so it need not be organized by subject area. The ODS does not permit direct

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 659 of 1017


end-user reporting, and its refresh policies are more closely aligned with the refresh schedules of
the enterprise data marts it may be feeding. It can also perform more sophisticated consolidation
functions than a traditional ODS.

Advantages

The Federated architecture brings together the best features of the centralized data warehouse
and independent data mart:

● Room for expansion . While the architecture is designed to quickly deploy the initial data
mart, it is also easy to share project deliverables across subsequent data marts by
migrating local metadata to the Global Repository. Reuse is built in.
● Centralized control . A single platform controls the environment from development to test
to production. Mechanisms to control and monitor the data movement from operational
databases into the data integration environment are applied across the data marts, easing
the system management task.
● Consistent metadata . A Global Repository spans all the data marts, providing a
consistent view of metadata.
● Enterprise view . Viewing all the metadata from a central location also provides an
enterprise view, easing the maintenance burden for the warehouse administrators.
Business users can also access the entire environment when necessary (assuming that
security privileges are granted).
● High data integrity . Using a set of integrated metadata repositories for the entire
enterprise removes data integrity issues that result from duplicate copies of data.
● Minimized impact on operational systems . Frequently accessed source data, such as
customer, product, or invoice records is moved into the decision support environment
once, leaving the operational systems unaffected by the number of target data marts.

Disadvantages

Disadvantages of the federated approach include:

● Data propagation . This approach moves data twice-to the ODS, then into the individual
data mart. This requires extra database space to store the staged data as well as extra
time to move the data. However, the disadvantage can be mitigated by not saving the data
permanently in the ODS. After the warehouse is refreshed, the ODS can be truncated, or
a rolling three months of data can be saved.
● Increased development effort during initial installations . For each table in the target,
there needs to be one load developed from the ODS to the target, in addition to all the
loads from the source to the targets.

Operational Data Store

Using a staging area or ODS differs from a centralized data warehouse approach since the ODS is
not organized by subject area and is not customized for viewing by end users or even for reporting.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 660 of 1017


The primary focus of the ODS is in providing a clean, consistent set of operational data for creating
and refreshing data marts. Separating out this function allows the ODS to provide more reliable
and flexible support.

Data from the various operational sources is staged for subsequent extraction by target systems in
the ODS. In the ODS, data is cleaned and remains normalized, tables from different databases are
joined, and a refresh policy is carried out (a change/capture facility may be used to schedule ODS
refreshes, for instance).

The ODS and the data marts may reside in a single database or be distributed across several
physical databases and servers.

Characteristics of the Operational Data Store are:

● Normalized
● Detailed (not summarized)
● Integrated
● Cleansed
● Consistent

Within an enterprise data mart, the ODS can consolidate data from disparate systems in a number
of ways:

● Normalizes data where necessary (such as non-relational mainframe data), preparing it for
storage in a relational system.
● Cleans data by enforcing commonalties in dates, names and other data types that appear
across multiple systems.
● Maintains reference data to help standardize other formats; references might range from
zip codes and currency conversion rates to product-code-to-product-name translations.
The ODS may apply fundamental transformations to some database tables in order to
reconcile common definitions, but the ODS is not intended to be a transformation
processor for end-user reporting requirements.

Its role is to consolidate detailed data within common formats. This enables users to create wide
varieties of data integration reports, with confidence that those reports will be based on the same
detailed data, using common definitions and formats.

The following table compares the key differences in the three architectures:

Architecture Centralized Data Independent Data Federated


Warehouse Mart Data Warehouse

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 661 of 1017


Centralized Yes No Yes
Control

Consistent Yes No Yes


Metadata

Cost effective No Yes Yes

Enterprise View Yes No Yes

Fast No Yes Yes


Implementation

High Data Yes No Yes


Integrity

Immediate ROI No Yes Yes

Repeatable No Yes Yes


Process

The Role of Enterprise Architecture

The federated architecture approach allows for the planning and implementation of an enterprise
architecture framework that addresses not only short-term departmental needs, but also the long-
term enterprise requirements of the business. This does not mean that the entire architectural
investment must be made in advance of any application development. However, it does mean that
development is approached within the guidelines of the framework, allowing for future growth
without significant technological change. The remainder of this chapter will focus on the process of
designing and developing a data integration solution architecture using PowerCenter as the
platform.

Fitting Into the Corporate Architecture

Very few organizations have the luxury of creating a "green field" architecture to support their
decision support needs. Rather, the architecture must fit within an existing set of corporate
guidelines regarding preferred hardware, operating systems, databases, and other software. The
Technical Architect, if not already an employee of the organization, should ensure that he/she has
a thorough understanding of the existing (and future vision of) technical infrastructure. Doing so will
eliminate the possibility of developing an elegant technical solution that will never be implemented
because it defies corporate standards.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 662 of 1017


Last updated: 29-May-08 13:30

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 663 of 1017


Development FAQs

Challenge

Using the PowerCenter product suite to effectively develop, name, and document components of the data integration solution.
While the most effective use of PowerCenter depends on the specific situation, this Best Practice addresses some questions
that are commonly raised by project teams. It provides answers in a number of areas, including Logs, Scheduling, Backup
Strategies, Server Administration, Custom Transformations, and Metadata. Refer to the product guides supplied with
PowerCenter for additional information.

Description

The following pages summarize some of the questions that typically arise during development and suggest potential resolutions.

Mapping Design

Q: How does source format affect performance? (i.e., is it more efficient to source from a flat file rather than a database?)

In general, a flat file that is located on the server machine loads faster than a database located on the server machine. Fixed-
width files are faster than delimited files because delimited files require extra parsing. However, if there is an intent to perform
intricate transformations before loading to target, it may be advisable to first load the flat file into a relational database, which
allows the PowerCenter mappings to access the data in an optimized fashion by using filters, custom transformations, and
custom SQL SELECTs where appropriate.

Q: What are some considerations when designing the mapping? (i.e., what is the impact of having multiple targets populated by
a single map?)

With PowerCenter, it is possible to design a mapping with multiple targets. If each target has a separate source qualifier, you
can then load the targets in a specific order using Target Load Ordering. However, the recommendation is to limit the amount of
complex logic in a mapping. Not only is it easier to debug a mapping with a limited number of objects, but such mappings can
also be run concurrently and make use of more system resources. When using multiple output files (targets), consider writing to
multiple disks or file systems simultaneously. This minimizes disk writing contention and applies to a session writing to multiple
targets, and to multiple sessions running simultaneously.

Q: What are some considerations for determining how many objects and transformations to include in a single mapping?

The business requirement is always the first consideration, regardless of the number of objects it takes to fulfill the requirement.
Beyond this, consideration should be given to having objects that stage data at certain points to allow both easier debugging
and better understandability, as well as to create potential partition points. This should be balanced against the fact that more
objects means more overhead for the DTM process.

It should also be noted that the most expensive use of the DTM is passing unnecessary data through the mapping. It is best to
use filters as early as possible in the mapping to remove rows of data that are not needed. This is the SQL equivalent of the
WHERE clause. Using the filter condition in the Source Qualifier to filter out the rows at the database level is a good way to
increase the performance of the mapping. If this is not possible, a filter or router transformation can be used instead.

Log File Organization

Q: How does PowerCenter handle logs?

The Service Manager provides accumulated log events from each service in the domain and for sessions and workflows. To
perform the logging function, the Service Manager runs a Log Manager and a Log Agent.

The Log Manager runs on the master gateway node. It collects and processes log events for Service Manager domain
operations and application services. The log events contain operational and error messages for a domain. The Service

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 664 of 1017


Manager and the application services send log events to the Log Manager. When the Log Manager receives log events, it
generates log event files, which can be viewed in the Administration Console.

The Log Agent runs on the nodes to collect and process log events for session and workflows. Log events for workflows include
information about tasks performed by the Integration Service, workflow processing, and workflow errors. Log events for
sessions include information about the tasks performed by the Integration Service, session errors, and load summary and
transformation statistics for the session. You can view log events for the last workflow run with the Log Events window in the
Workflow Monitor.

Log event files are binary files that the Administration Console Log Viewer uses to display log events. When you view log
events in the Administration Console, the Log Manager uses the log event files to display the log events for the domain or
application service. For more information, please see Chapter 16: Managing Logs in the Administrator Guide.

Q: Where can I view the logs?

Logs can be viewed in two locations: the Administration Console or the Workflow Monitor. The Administration Console displays
domain-level operational and error messages. The Workflow Monitor displays session and workflow level processing and error
messages.

Q: Where is the best place to maintain Session Logs?

One often-recommended location is a shared directory location that is accessible to the gateway node. If you have more than
one gateway node, store the logs on a shared disk. This keeps all the logs in the same directory. The location can be changed
in the Administration Console.

If you have more than one PowerCenter domain, you must configure a different directory path for each domain’s Log Manager.
Multiple domains can not use the same shared directory path.

For more information, please refer to Chapter 16: Managing Logs of the Administrator Guide.

Q: What documentation is available for the error codes that appear within the error log files?

Log file errors and descriptions appear in Chapter 39: LGS Messages of the PowerCenter Trouble Shooting Guide. Error
information also appears in the PowerCenter Help File within the PowerCenter client applications. For other database-specific
errors, consult your Database User Guide.

Scheduling Techniques

Q: What are the benefits of using workflows with multiple tasks rather than a workflow with a stand-alone session?

Using a workflow to group logical sessions minimizes the number of objects that must be managed to successfully load the
warehouse. For example, a hundred individual sessions can be logically grouped into twenty workflows. The Operations group
can then work with twenty workflows to load the warehouse, which simplifies the operations tasks associated with loading the
targets.

Workflows can be created to run tasks sequentially or concurrently, or have tasks in different paths doing either.

● A sequential workflow runs sessions and tasks one at a time, in a linear sequence. Sequential workflows help ensure
that dependencies are met as needed. For example, a sequential workflow ensures that session1 runs before session2
when session2 is dependent on the load of session1, and so on. It's also possible to set up conditions to run the next
session only if the previous session was successful, or to stop on errors, etc.
● A concurrent workflow groups logical sessions and tasks together, like a sequential workflow, but runs all the tasks at
one time. This can reduce the load times into the warehouse, taking advantage of hardware platforms' symmetric multi-
processing (SMP) architecture.

Other workflow options, such as nesting worklets within workflows, can further reduce the complexity of loading the warehouse.
This capability allows for the creation of very complex and flexible workflow streams without the use of a third-party scheduler.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 665 of 1017


Q: Assuming a workflow failure, does PowerCenter allow restart from the point of failure?

No. When a workflow fails, you can choose to start a workflow from a particular task but not from the point of failure. It is
possible, however, to create tasks and flows based on error handling assumptions. If a previously running real-time workflow
fails, first recover and then restart that workflow from the Workflow Monitor.

Q: How can a failed workflow be recovered if it is not visible from the Workflow Monitor?

Start the Workflow Manager and open the corresponding workflow. Find the failed task and right click to "Recover Workflow
From Task."

Q: What guidelines exist regarding the execution of multiple concurrent sessions / workflows within or across applications?

Workflow Execution needs to be planned around two main constraints:

● Available system resources


● Memory and processors

The number of sessions that can run efficiently at one time depends on the number of processors available on the server. The
load manager is always running as a process. If bottlenecks with regards to I/O and network are addressed, a session will be
compute-bound, meaning its throughput is limited by the availability of CPU cycles. Most sessions are transformation intensive,
so the DTM always runs. However, some sessions require more I/O, so they use less processor time. A general rule is that a
session needs about 120 percent of a processor for the DTM, reader, and writer in total.

For concurrent sessions:

One session per processor is about right; you can run more, but that requires a "trial and error" approach to determine what
number of sessions starts to affect session performance and possibly adversely affect other executing tasks on the server.

If possible, sessions should run at "off-peak" hours to have as many available resources as possible.

Even after available processors are determined, it is necessary to look at overall system resource usage. Determining memory
usage is more difficult than the processors calculation; it tends to vary according to system load and number of PowerCenter
sessions running.

The first step is to estimate memory usage, accounting for:

● Operating system kernel and miscellaneous processes


● Database engine
● Informatica Load Manager

Next, each session being run needs to be examined with regard to the memory usage, including the DTM buffer size and any
cache/memory allocations for transformations such as lookups, aggregators, ranks, sorters and joiners.

At this point, you should have a good idea of what memory is utilized during concurrent sessions. It is important to arrange the
production run to maximize use of this memory. Remember to account for sessions with large memory requirements; you may
be able to run only one large session, or several small sessions concurrently.

Load-order dependencies are also an important consideration because they often create additional constraints. For example,
load the dimensions first, then facts. Also, some sources may only be available at specific times; some network links may
become saturated if overloaded; and some target tables may need to be available to end users earlier than others.

Q: Is it possible to perform two "levels" of event notification? At the application level and the PowerCenter Server level to notify
the Server Administrator?

The application level of event notification can be accomplished through post-session email. Post-session email allows you to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 666 of 1017


create two different messages; one to be sent upon successful completion of the session, the other to be sent if the session
fails. Messages can be a simple notification of session completion or failure, or a more complex notification containing specifics
about the session. You can use the following variables in the text of your post-session email:

Email Variable Description

%s Session name

%l Total records loaded

%r Total records rejected

%e Session status

%t Table details, including read throughput in bytes/second


and write throughput in rows/second

%b Session start time

%c Session completion time

%i Session elapsed time (session completion time-session


start time)

%g Attaches the session log to the message

%m Name and version of the mapping used in the session

%d Name of the folder containing the session

%n Name of the repository containing the session

%a<filename> Attaches the named file. The file must be local to the
Informatica Server. The following are valid filenames: %a<c:
\data\sales.txt> or %a</users/john/data/sales.txt>

On Windows NT, you can attach a file of any type.


On UNIX, you can only attach text files. If you attach a non-
text file, the send may fail.

Note: The filename cannot include the Greater Than


character (>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session email. The repository user who starts the PowerCenter
server must have the rmail tool installed in the path in order to send email.

To verify the rmail tool is accessible:

1. Login to the UNIX system as the PowerCenter user who starts the PowerCenter Server.
2. Type rmail <fully qualified email address> at the prompt and press Enter.
3. Type '.' to indicate the end of the message and press Enter.
4. You should receive a blank email from the PowerCenter user's email account. If not, locate the directory where rmail

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 667 of 1017


resides and add that directory to the path.
5. When you have verified that rmail is installed correctly, you are ready to send post-session email.

The output should look like the following:

Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed

Rows Rows ReadThroughput WriteThroughput


Table Name
Loaded Rejected (bytes/sec) (rows/sec)

Status

1 0 30 1 t_Q3_sales

No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0: 00:10 (h:m:s)

This information, or a subset, can also be sent to any text pager that accepts email.

Backup Strategy Recommendation

Q: Can individual objects within a repository be restored from the backup or from a prior version?

At the present time, individual objects cannot be restored from a backup using the PowerCenter Repository Manager (i.e., you
can only restore the entire repository). But, it is possible to restore the backup repository into a different database and then
manually copy the individual objects back into the main repository.

It should be noted that PowerCenter does not restore repository backup files created in previous versions of PowerCenter. To
correctly restore a repository, the version of PowerCenter used to create the backup file must be used for the restore as well.

An option for the backup of individual objects is to export them to XML files. This allows for the granular re-importation of
individual objects, mappings, tasks, workflows, etc.

Refer to Migration Procedures - PowerCenter for details on promoting new or changed objects between development, test, QA,
and production environments.

Server Administration

Q: What built-in functions does PowerCenter provide to notify someone in the event that the server goes down, or some other
significant event occurs?

The Repository Service can be used to send messages notifying users that the server will be shut down. Additionally, the
Repository Service can be used to send notification messages about repository objects that are created, modified, or deleted by
another user. Notification messages are received through the PowerCenter Client tools.

Q: What system resources should be monitored? What should be considered normal or acceptable server performance levels?

The pmprocs utility, which is available for UNIX systems only, shows the currently executing PowerCenter processes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 668 of 1017


Pmprocs is a script that combines the ps and ipcs commands. It is available through Informatica Technical Support. The utility
provides the following information:

● CPID - Creator PID (process ID)


● LPID - Last PID that accessed the resource
● Semaphores - used to sync the reader and writer
● 0 or 1 - shows slot in LM shared memory

A variety of UNIX and Windows NT commands and utilities are also available. Consult your UNIX and/or Windows NT
documentation.

Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an Oracle instance crash?

If the UNIX server crashes, you should first check to see if the repository database is able to come back up successfully. If this
is the case, then you should try to start the PowerCenter server. Use the pmserver.err log to check if the server has started
correctly. You can also use ps -ef | grep pmserver to see if the server process (the Load Manager) is running.

Custom Transformations

Q: What is the relationship between the Java or SQL transformation and the Custom transformation?

Many advanced transformations, including Java and SQL, were built using the Custom transformation. Custom transformations
operate in conjunction with procedures you create outside of the Designer interface to extend PowerCenter functionality.

Other transformations that were built using Custom transformations include HTTP, SQL, Union , XML Parser, XML Generator,
and many others. Below is a summary of noticeable differences.

Transformation # of Input Groups # of Output Groups Type

Custom Multiple Multiple Active/Passive

HTTP One One Passive

Java One One Active/Passive

SQL One One Active/Passive

Union Multiple One Active

XML Parser One Multiple Active

XML Generator Multiple One Active

For further details, please see the Transformation Guide.

Q: What is the main benefit of a Custom transformation over an External Procedure transformation?

A Custom transformation allows for the separation of input and output functions, whereas an External Procedure transformation
handles both the input and output simultaneously. Additionally, an External Procedure transformation’s parameters consist of all
the ports of the transformation.

The ability to separate input and output functions is especially useful for sorting and aggregation, which require all input rows to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 669 of 1017


be processed before outputting any output rows.

Q: How do I change a Custom transformation from Active to Passive, or vice versa?

After the creation of the Custom transformation, the transformation type cannot be changed. In order to set the appropriate type,
delete and recreate the transformation.

Q: What is the difference between active and passive Java transformations? When should one be used over the other?

An active Java transformation allows for the generation of more than one output row for each input row. Conversely, a passive
Java transformation only allows for the generation of one output row per input row.

Use active if you need to generate multiple rows with each input. For example, a Java transformation contains two input ports
that represent a start date and an end date. You can generate an output row for each date between the start and end date. Use
passive when you need one output row for each input.

Q: What are the advantages of a SQL transformation over a Source Qualifier?

A SQL transformation allows for the processing of SQL queries in the middle of a mapping. It allows you to insert, delete,
update, and retrieve rows from a database. For example, you might need to create database tables before adding new
transactions. The SQL transformation allows for the creation of these tables from within the workflow.

Q: What is the difference between the SQL transformation’s Script and Query modes?

Script mode allows for the execution of externally located ANSI SQL scripts. Query mode executes a query that you define in a
query editor. You can pass strings or parameters to the query to define dynamic queries or change the selection parameters.

For more information, please see Chapter 22: SQL Transformation in the Transformation Guide.

Metadata

Q: What recommendations or considerations exist as to naming standards or repository administration for metadata that may
be extracted from the PowerCenter repository and used in others?

With PowerCenter, you can enter description information for all repository objects, sources, targets, transformations, etc, but the
amount of metadata that you enter should be determined by the business requirements. You can also drill down to the column
level and give descriptions of the columns in a table if necessary. All information about column size and scale, data types, and
primary keys are stored in the repository.

The decision on how much metadata to create is often driven by project timelines. While it may be beneficial for a developer to
enter detailed descriptions of each column, expression, variable, etc, it is also very time consuming to do so. Therefore, this
decision should be made on the basis of how much metadata is likely to be required by the systems that use the metadata.

There are some time-saving tools that are available to better manage a metadata strategy and content, such as third-party
metadata software and, for sources and targets, data modeling tools.

Q: What procedures exist for extracting metadata from the repository?

Informatica offers an extremely rich suite of metadata-driven tools for data warehousing applications. All of these tools store,
retrieve, and manage their metadata in Informatica's PowerCenter repository. The motivation behind the original Metadata
Exchange (MX) architecture was to provide an effective and easy-to-use interface to the repository.

Today, Informatica and several key Business Intelligence (BI) vendors, including Brio, Business Objects, Cognos, and
MicroStrategy, are effectively using the MX views to report and query the Informatica metadata.

Informatica strongly discourages accessing the repository directly, even for SELECT access because some releases of
PowerCenter change the look and feel of the repository tables, resulting in a maintenance task for you. Rather, views have

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 670 of 1017


been created to provide access to the metadata stored in the repository.

Additionally, Informatica's Metadata Manager and Data Analyzer, allow for more robust reporting against the repository
database and are able to present reports to the end-user and/or management.

Versioning

Q: How can I keep multiple copies of the same object within PowerCenter?

A: With PowerCenter, you can use version control to maintain previous copies of every changed object.

You can enable version control after you create a repository. Version control allows you to maintain multiple versions of an
object, control development of the object, and track changes. You can configure a repository for versioning when you create it,
or you can upgrade an existing repository to support versioned objects.

When you enable version control for a repository, the repository assigns all versioned objects version number 1 and each object
has an active status.

You can perform the following tasks when you work with a versioned object:

● View object version properties. Each versioned object has a set of version properties and a status. You can also
configure the status of a folder to freeze all objects it contains or make them active for editing.
● Track changes to an object. You can view a history that includes all versions of a given object, and compare any
version of the object in the history to any other version. This allows you to determine changes made to an object over
time.
● Check the object version in and out. You can check out an object to reserve it while you edit the object. When you
check in an object, the repository saves a new version of the object and allows you to add comments to the version.
You can also find objects checked out by yourself and other users.
● Delete or purge the object version. You can delete an object from view and continue to store it in the repository. You
can recover, or undelete, deleted objects. If you want to permanently remove an object version, you can purge it from
the repository.

Q: Is there a way to migrate only the changed objects from Development to Production without having to spend too much time
on making a list of all changed/affected objects?

A: Yes there is.

You can create Deployment Groups that allow you to group versioned objects for migration to a different repository. You can
create the following types of deployment groups:

● Static. You populate the deployment group by manually selecting objects.


● Dynamic. You use the result set from an object query to populate the deployment group.

To make a smooth transition/migration to Production, you need to have a query associated with your Dynamic deployment
group. When you associate an object query with the deployment group, the Repository Agent runs the query at the time of
deployment. You can associate an object query with a deployment group when you edit or create a deployment group.

If the repository is enabled for versioning, you may also copy the objects in a deployment group from one repository to another.
Copying a deployment group allows you to copy objects in a single copy operation from across multiple folders in the source
repository into multiple folders in the target repository. Copying a deployment group also allows you to specify individual objects
to copy, rather than the entire contents of a folder.

Performance

Q: Can PowerCenter sessions be load balanced?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 671 of 1017


A: Yes, if the PowerCenter Enterprise Grid Option option is available. The Load Balancer is a component of the Integration
Service that dispatches tasks to Integration Service processes running on nodes in a grid. It matches task requirements with
resource availability to identify the best Integration Service process to run a task. It can dispatch tasks on a single node or
across nodes.

Tasks can be dispatched in three ways: Round-robin, Metric-based, and Adaptive. Additionally, you can set the Service Levels
to change the priority of each task waiting to be dispatched. This can be changed in the Administration Console’s domain
properties.

For more information, please refer to Chapter 11: Configuring the Load Balancer in the Administrator Guide.

Web Services

Q: How does Web Services Hub work in PowerCenter?

A: The Web Services Hub is a web service gateway for external clients. It processes SOAP requests from web service clients
that want to access PowerCenter functionality through web services. Web service clients access the Integration Service and
Repository Service through the Web Services Hub.

The Web Services Hub hosts Batch and Real-time Web Services. When you install PowerCenter Services, the PowerCenter
installer installs the Web Services Hub. Use the Administration Console to configure and manage the Web Services Hub. For
more information, please refer to Creating and Configuring the Web Services Hub in the Administrator Guide.

The Web Services Hub connects to the Repository Server and the PowerCenter Server through TCP/IP. Web service clients log
in to the Web Services Hub through HTTP(s). The Web Services Hub authenticates the client based on repository user name
and password. You can use the Web Services Hub console to view service information and download Web Services
Description Language (WSDL) files necessary for running services and workflows.

Last updated: 06-Dec-07 15:00

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 672 of 1017


Event Based Scheduling

Challenge

In an operational environment, the beginning of a task often needs to be triggered by


some event, either internal or external, to the Informatica environment. In versions of
PowerCenter prior to version 6.0, this was achieved through the use of indicator files. In
PowerCenter 6.0 and forward, it is achieved through use of the EventRaise and
EventWait Workflow and Worklet tasks, as well as indicator files.

Description

Event-based scheduling with versions of PowerCenter prior to 6.0 was achieved


through the use indicator files. Users specified the indicator file configuration in the
session configuration under advanced options. When the session started, the
PowerCenter Server looked for the specified file name; if it wasn’t there, it waited until it
appeared, then deleted it, and triggered the session.

In PowerCenter 6.0 and above, event-based scheduling is triggered by Event-Wait and


Event-Raise tasks. These tasks can be used to define task execution order within a
workflow or worklet. They can even be used to control sessions across workflows.

● An Event-Raise task represents a user-defined event (i.e., an indicator file).


● An Event-Wait task waits for an event to occurwithin a workflow. After the
event triggers, the PowerCenter Server continues executing the workflow from
the Event-Wait task forward.

The following paragraphs describe events that can be triggered by an Event-Wait task.

Waiting for Pre-Defined Events

To use a pre-defined event, you need a session, shell command, script, or batch file to
create an indicator file. You must create the file locally or send it to a directory local to
the PowerCenter Server. The file can be any format recognized by the PowerCenter
Server operating system. You can choose to have the PowerCenter Server delete the
indicator file after it detects the file, or you can manually delete the indicator file. The
PowerCenter Server marks the status of the Event-Wait task as "failed" if it cannot
delete the indicator file.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 673 of 1017


When you specify the indicator file in the Event-Wait task, specify the directory in which
the file will appear and the name of the indicator file. Do not use either a source or
target file name as the indicator file name. You must also provide the absolute path for
the file and the directory must be local to the PowerCenter Server. If you only specify
the file name, and not the directory, Workflow Manager looks for the indicator file in the
system directory. For example, on Windows NT, the system directory is C:/winnt/
system32. You can enter the actual name of the file or use server variables to specify
the location of the files. The PowerCenter Server writes the time the file appears in the
workflow log.

Follow these steps to set up a pre-defined event in the workflow:

1. Create an Event-Wait task and double-click the Event-Wait task to open the
Edit Tasks dialog box.
2. In the Events tab of the Edit Task dialog box, select Pre-defined.
3. Enter the path of the indicator file.
4. If you want the PowerCenter Server to delete the indicator file after it detects
the file, select the Delete Indicator File option in the Properties tab.
5. Click OK.

Pre-defined Event

A pre-defined event is a file-watch event. For pre-defined events, use an Event-Wait


task to instruct the PowerCenter Server to wait for the specified indicator file to appear
before continuing with the rest of the workflow. When the PowerCenter Server locates
the indicator file, it starts the task downstream of the Event-Wait.

User-defined Event

A user-defined event is defined at the workflow or worklet level and the Event-Raise
task triggers the event at one point of the workflow/worklet. If an Event-Wait task is
configured in the same workflow/worklet to listen for that event, then execution will
continue from the Event-Wait task forward.

The following is an example of using user-defined events:

Assume that you have four sessions that you want to execute in a workflow. You want
P1_session and P2_session to execute concurrently to save time. You also want to
execute Q3_session after P1_session completes. You want to execute Q4_session
only when P1_session, P2_session, and Q3_session complete. Follow these steps:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 674 of 1017


1. Link P1_session and P2_session concurrently.
2. Add Q3_session after P1_session
3. Declare an event called P1Q3_Complete in the Events tab of the workflow
properties
4. In the workspace, add an Event-Raise task after Q3_session.
5. Specify the P1Q3_Complete event in the Event-Raise task properties. This
allows the Event-Raise task to trigger the event when P1_session and
Q3_session complete.
6. Add an Event-Wait task after P2_session.
7. Specify the Q1 Q3_Complete event for the Event-Wait task.
8. Add Q4_session after the Event-Wait task. When the PowerCenter Server
processes the Event-Wait task, it waits until the Event-Raise task triggers
Q1Q3_Complete before it executes Q4_session.

The PowerCenter Server executes the workflow in the following order:

1. The PowerCenter Server executes P1_session and P2_session concurrently.


2. When P1_session completes, the PowerCenter Server executes Q3_session.
3. The PowerCenter Server finishes executing P2_session.
4. The Event-Wait task waits for the Event-Raise task to trigger the event.
5. The PowerCenter Server completes Q3_session.
6. The Event-Raise task triggers the event, Q1Q3_complete.
7. The Informatica Server executes Q4_session because the event,
Q1Q3_Complete, has been triggered.

Be sure to take carein setting the links though. If they are left as the default and if Q3
fails, the Event-Raise will never happen. Then the Event-Wait will wait forever and the
workflow will run until it is stopped. To avoid this, check the workflow option ‘suspend
on error’. With this option, if a session fails, the whole workflow goes into suspended
mode and can send an email to notify developers.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 675 of 1017


Key Management in Data Warehousing
Solutions

Challenge

Key management refers to the technique that manages key allocation in a decision
support RDBMS to create a single view of reference data from multiple sources.
Informatica recommends a concept of key management that ensures loading
everything extracted from a source system into the data warehouse.

This Best Practice provides some tips for employing the Informatica-recommended
approach of key management, an approach that deviates from many traditional data
warehouse solutions that apply logical and data warehouse (surrogate) key strategies
where errors are loaded and transactions rejected from referential integrity issues.

Description

Key management in a decision support RDBMS comprises three techniques for


handling the following common situations:

● Key merging/matching
● Missing keys
● Unknown keys

All three methods are applicable to a Reference Data Store, whereas only the missing
and unknown keys are relevant for an Operational Data Store (ODS). Key management
should be handled at the data integration level, thereby making it transparent to the
Business Intelligence layer.

Key Merging/Matching

When companies source data from more than one transaction system of a similar type,
the same object may have different, non-unique legacy keys. Additionally, a single key
may have several descriptions or attributes in each of the source systems. The
independence of these systems can result in incongruent coding, which poses a
greater problem than records being sourced from multiple systems.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 676 of 1017


A business can resolve this inconsistency by undertaking a complete code
standardization initiative (often as part of a larger metadata management effort) or
applying a Universal Reference Data Store (URDS). Standardizing code requires an
object to be uniquely represented in the new system. Alternatively, URDS contains
universal codes for common reference values. Most companies adopt this pragmatic
approach, while embarking on the longer term solution of code standardization.

The bottom line is that nearly every data warehouse project encounters this issue and
needs to find a solution in the short term.

Missing Keys

A problem arises when a transaction is sent through without a value in a column where
a foreign key should exist (i.e., a reference to a key in a reference table). This normally
occurs during the loading of transactional data, although it can also occur when loading
reference data into hierarchy structures. In many older data warehouse solutions, this
condition would be identified as an error and the transaction row would be rejected.
The row would have to be processed through some other mechanism to find the correct
code and loaded at a later date. This is often a slow and cumbersome process that
leaves the data warehouse incomplete until the issue is resolved.

The more practical way to resolve this situation is to allocate a special key in place of
the missing key, which links it with a dummy 'missing key' row in the related table. This
enables the transaction to continue through the loading process and end up in the
warehouse without further processing. Furthermore, the row ID of the bad transaction
can be recorded in an error log, allowing the addition of the correct key value at a later
time.

The major advantage of this approach is that any aggregate values derived from the
transaction table will be correct because the transaction exists in the data warehouse
rather than being in some external error processing file waiting to be fixed.

Simple Example:

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE

Audi TT18 Doe10224 1 35,000

In the transaction above, there is no code in the SALES REP column. As this row is
processed, a dummy sales rep key (UNKNOWN) is added to the record to link to a
record in the SALES REP table. A data warehouse key (8888888) is also added to the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 677 of 1017


transaction.

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY

Audi TT18 Doe10224 9999999 1 35,000 8888888

The related sales rep record may look like this:

REP CODE REP NAME REP MANAGER

1234567 David Jones Mark Smith

7654321 Mark Smith

9999999 Missing Rep

An error log entry to identify the missing key on this transaction may look like:

ERROR CODE TABLE NAME KEY NAME KEY

MSGKEY ORDERS SALES REP 8888888

This type of error reporting is not usually necessary because the transactions with
missing keys can be identified using standard end-user reporting tools against the data
warehouse.

Unknown Keys

Unknown keys need to be treated much like missing keys except that the load process
has to add the unknown key value to the referenced table to maintain integrity rather
than explicitly allocating a dummy key to the transaction. The process also needs to
make two error log entries. The first, to log the fact that a new and unknown key has
been added to the reference table and a second to record the transaction in which the
unknown key was found.

Simple example:

The sales rep reference data record might look like the following:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 678 of 1017


DWKEY REP NAME REP MANAGER

1234567 David Jones Mark Smith

7654321 Mark Smith

9999999 Missing Rep

A transaction comes into ODS with the record below:

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE

Audi TT18 Doe10224 2424242 1 35,000

In the transaction above, the code 2424242 appears in the SALES REP column. As
this row is processed, a new row has to be added to the Sales Rep reference table.
This allows the transaction to be loaded successfully.

DWKEY REP NAME REP MANAGER

2424242 Unknown

A data warehouse key (8888889) is also added to the transaction.

PRODUCT CUSTOMER SALES REP QUANTITY UNIT PRICE DWKEY

Audi TT18 Doe10224 2424242 1 35,000 8888889

Some warehouse administrators like to have an error log entry generated to identify the
addition of a new reference table entry. This can be achieved simply by adding the
following entries to an error log.

ERROR CODE TABLE NAME KEY NAME KEY

NEWROW SALES REP SALES REP 2424242

A second log entry can be added with the data warehouse key of the transaction in
which the unknown key was found.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 679 of 1017


ERROR CODE TABLE NAME KEY NAME KEY

UNKNKEY ORDERS SALES REP 8888889

As with missing keys, error reporting is not essential because the unknown status is
clearly visible through the standard end-user reporting.

Moreover, regardless of the error logging, the system is self-healing because the newly
added reference data entry will be updated with full details as soon as these changes
appear in a reference data feed.

This would result in the reference data entry looking complete.

DWKEY REP NAME REP MANAGER

2424242 David Digby Mark Smith

Employing the Informatica recommended key management strategy produces the


following benefits:

● All rows can be loaded into the data warehouse


● All objects are allocated a unique key
● Referential integrity is maintained
● Load dependencies are removed

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 680 of 1017


Mapping Auto-Generation

Challenge

In the course of developing mappings for PowerCenter, situations can arise where a set of similar functions/procedures must be
executed for each mapping. The first reaction to this issue is generally to employ a mapplet. These objects are suited to
situations where all of the individual fields/data are the same across uses of the mapplet. However, in cases where the fields are
different – but the ‘process’ is the same – a requirement emerges to ‘generate’ multiple mappings using a standard template of
actions and procedures.

The potential benefits of Autogeneration are focused on a reduction in the Total Cost of Ownership (TCO) of the integration
application and include:

● Reduced build time


● Reduced requirement for skilled developer resources
● Promotion of pattern-based design
● Built in quality and consistency
● Reduced defect rate through elimination of manual errors
● Reduced support overhead

Description

From the outset, it should be emphasized that auto-generation should be integrated into the overall development strategy. It is
probable that some components will still need to be manually developed and many of the disciplines and best practices that are
documented elsewhere in Velocity still apply. It is best to regard autogeneration as a productivity aid in specific situations and not
as a technique that works in all situations. Currently, the autogeneration of 100% of the components required is not a realistic
objective.

All of the techniques discussed here revolve around the generation of an XML file which shares the standard format of exported
PowerCenter components as defined in the powrmart.dtd schema definition. After being generated, the resulting XML document
is imported into PowerCenter using standard facilities available through the user interface or via command line.

With Informatica technology, there are a number of options for XML targeting which can be leveraged to implement
autogeneration. Thus you can exploit these features to make the technology self-generating.

The stages in implementing an autogeneration strategy are:

1. Establish the Scope for Autogeneration


2. Design the Assembly Line(s)
3. Build the Assembly Line
4. Implement the QA and Testing Strategies

These stages are discussed in more detail in the following sections.

1. Establish the Scope for Autogeneration

There are three types of opportunities for manufacturing components:

● Pattern-Driven
● Rules-Driven
● Metadata-Driven

A Pattern-Driven build is appropriate when a single pattern of transformation is to be replicated for multiple source-target

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 681 of 1017


combinations. For example, the initial extract in a standard data warehouse load typically extracts some source data with
standardized filters, and then adds some load metadata before populating a staging table which essentially replicates the source
structure.

The potential for Rules-Driven build typically arises when non-technical users are empowered to articulate transformation
requirements in a format which is the source for a process generating components. Usually, this is accomplished via a
spreadsheet which defines the source-to-target mapping and uses a standardized syntax to define the transformation rules. To
implement this type of autogeneration, it is necessary to build an application (typically based on a PowerCenter mapping) which
reads the spreadsheet, matches the sources and targets against the metadata in the repository and produces the XML output.

Finally, the potential for Metadata-Driven build arises when the import of source and target metadata enables transformation
requirements to be inferred which also requires a mechanism for mapping sources to target. For example, when a text source
column is mapped to a numeric target column the inferred rule is to test for data type compatibility.

The first stage in the implementation of an autogeneration strategy is to decide which of these autogeneration types is applicable
and to ensure that the appropriate technology is available.

In most case, it is the Pattern-Driven build which is the main area of interest; this is precisely the requirement which the mapping
generation license option within PowerCenter is designed to address. This option uses the freely distributed Informatica Data
Stencil design tool for Microsoft Visio and freely distributed Informatica Velocity-based mapping templates to accelerate and
automate mapping design.

Generally speaking, applications which involve a small number of highly-complex flows of data tailored to very specific source/
target attributes are not good candidates for pattern-driven autogeneration.

Currently, there is a great deal of product innovation in the areas of Rules-Driven and Metadata-driven autogeneration One
option includes using PowerCenter via an XML target to generate the required XML files later used as import mappings..
Depending on the scale and complexity of both the autogeneration-rules and the functionality of the generated components, it
may be advisable to acquire a license for the PowerCenter Unstructured Data option.

In conclusion, at the end of this stage the type of autogeneration should be identified and all the required technology licenses
should be acquired.

2. Design the Assembly Line

It is assumed that the standard development activities in the Velocity Architect and Design phases have been undertaken and at
this stage, the development team should understand the data and the value to be added to it.

It should be possible to identify the patterns of data movement.

The main stages in designing the assembly line are:

● Manually develop a prototype


● Distinguish between the generic and the flow-specific components
● Establish the boundaries and inter-action between generated and manually built components
● Agree the format and syntax for the specification of the rules (usually Excel)
● Articulate the rules in the agreed format
● Incorporate component generation in the overall development process
● Develop the manual components (if any)

It is recommended that a prototype is manually developed for a representative subset of the sources and targets since the
adoption of autogeneration techniques does not obviate the need for a re-usability strategy. Even if some components are
generated rather than built, it is still necessary to distinguish between the generic and the flow-specific components. This will
allow the generic functionality to be mapped onto the appropriate re-usable PowerCenter components – mapplets,
transformations, user defined functions etc.

The manual development of the prototype also allows the scope of the autogeneration to be established. It is unlikely that every

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 682 of 1017


single required PowerCenter component can be generated; and may be restricted by the current capabilities of the PowerCenter
Visio Stencil. It is necessary to establish the demarcation between generated and manually-built components.

It will also be necessary to devise a customization strategy if the autogeneration is seen as a repeatable process. How are
manual modifications to the generated component to be implemented? Should this be isolated in discrete components which are
called from the generated components?

If the autogeneration strategy is based on an application rather than the Visio stencil mapping generation option, ensure that the
components you are planning to generate are consistent with the restrictions on the XML export file by referring to the product
documentation.

TIP
If you modify an exported XML file, you need to make sure that the XML file conforms to the structure of powrmart.dtd. You
also need to make sure the metadata in the XML file conforms to Designer and Workflow Manager rules. For example, when
you define a shortcut to an object, define the folder in which the referenced object resides as a shared folder. Although
PowerCenter validates the XML file before importing repository objects from it, it might not catch all invalid changes. If you
import into the repository an object that does not conform to Designer or Workflow Manager rules, you may cause data
inconsistencies in the repository.

Do not modify the powrmart.dtd file.

CRCVALUE Codes

Informatica restricts which elements you can modify in the XML file. When you export a Designer object, the PowerCenter
Client might include a Cyclic Redundancy Checking Value (CRCVALUE) code in one or more elements in the XML file. The
CRCVALUE code is another attribute in an element.

When the PowerCenter Client includes a CRCVALUE code in the exported XML file, you can modify some attributes and
elements before importing the object into a repository. For example, VSAM source objects always contain a CRCVALUE
code, so you can only modify some attributes in a VSAM source object. If you modify certain attributes in an element that
contains a CRCVALUE code, you cannot import the object

For more information, refer to the Chapter on Exporting and Importing Objects in the PowerCenter Repository Guide.

3. Build the Assembly Line

Essentially, the requirements for the autogeneration may be discerned from the XML exports of the manually developed
prototype.

Autogeneration Based on Visio Data Stencil

(Refer to the product documentation for more information on installation, configuration and usage.)

It is important to confirm that all the required PowerCenter transformations are supported by the installed version of the Stencil.

The use of an external industry-standard interface such as MS Visio allows the tool to be used by Business Analysts rather than
PowerCenter specialists. Apart from allowing the mapping patterns to be specified, the Stencil may also be used as a
documentation tool.

Essentially, there are three usage stages:

● Implement the Design in a Visio template


● Publish the Design
● Generate the PC Components

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 683 of 1017


A separate Visio template is defined for every pattern identified in the design phase. A template can be created from scratch or
imported from a mapping export; an example is shown below:

The icons for transformation objects should be familiar to PowerCenter users. Less easily understood will be the concept of
properties for the links (i.e. relationships) between the objects in the Stencil. These link rules define what ports propagate from
one transformation to the next and there may be multiple rules in a single link.

Essentially, the process of developing the template consists of identifying the dynamic components in the pattern and
parameterizing them such as.

● Source and target table name


● Source primary key, target primary key
● Lookup table name and foreign keys
● Transformations

Once the template is saved and validated, it needs to be “published” which simply makes it available in formats which the
generating mechanisms can understand such as:

● Mapping template parameter xml


● Mapping template xml

One of the outputs from the publishing is the template for the definition of the parameters specified in the template. An example
of a modified file is shown below:

<?xml version='1.0' encoding='UTF-8'?>


<!DOCTYPE PARAMETERS SYSTEM "parameters.dtd">
<PARAMETERS REPOSITORY_NAME="REP_MAIN" REPOSITORY_VERSION="179"
REPOSITORY_CODEPAGE="MS1252" REPOSITORY_DATABASETYPE="Oracle">
<MAPPING NAME="M_LOAD_CUSTOMER_GENERATED" FOLDER_NAME="PTM_2008_VISIO_SOURCE"
DESCRIPTION="M_LOAD_CUSTOMER">
<PARAM NAME="$SRC_KEY$" VALUE="CUSTOMER_CODE" />
<PARAM NAME="$TGT$" VALUE="CUSTOMER_DIM" />
<PARAM NAME="$TGT_KEY$" VALUE="CUSTOMER_ID" />
<PARAM NAME="$SRC$" VALUE="CUSTOMER_MASTER" />
</MAPPING>
<MAPPING NAME="M_LOAD_PRODUCT_GENERATED" FOLDER_NAME="PTM_2008_VISIO_SOURCE"
DESCRIPTION="M_LOAD_CUSTOMER">
<PARAM NAME="$SRC_KEY$" VALUE="PRODUCT_CODE" />
<PARAM NAME="$TGT$" VALUE="PRODUCT_DIM" />
<PARAM NAME="$TGT_KEY$" VALUE="PRODUCT_ID" />

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 684 of 1017


<PARAM NAME="$SRC$" VALUE="PRODUCT_MASTER" />
</MAPPING>
</PARAMETERS>

This file is only used in scripted generation.

The other output from the publishing is the template in XML format. This file is only used in manual generation.

There is a choice of either manual or scripted mechanisms for generating components from the published files.

The manual mechanism involves the importation of the published XML template through the Mapping Template Import Wizard in
the PowerCenter Designer. The parameters defined in the template are entered manually through the user interface.

Alternately, the scripted process is based on a supplied command-line utility – mapgen. The first stage is to manually modify the
published parameter file to specify values for all the mappings to be generated. The second stage is to use PowerCenter to
export source and target definitions for all the objects referenced in the parameter file. These are required in order to generate
the ports.

Mapgen requires the following syntax :

● <-t> Visio Drawing File (i.e., mapping source)


● <-p> ParameterFile (i.e., parameters)
● <-o> MappingFile (i.e., output)
● [-d] TableDefinitionDir (i.e., metadata sources & targets)

The generated output file is imported using the standard import facilities in PowerCenter.

TIP
Even if the scripted option is selected as the main generating mechanism, use the Mapping Template Import Wizard in the
PC Designer to generate the first mapping; this allows the early identification of any errors or inconsistencies in the template.

Autogeneration Based on Informatica Application

This strategy generates PowerCenter XML but can be implemented through either PowerCenter itself or the Unstructured Data
option. Essentially, it will require the same build sub-stages as any other data integration application. The following components
are anticipated:

● Specification of the formats for source to target mapping and transformation rules definition
● Development of a mapping to load the specification spreadsheets into a table
● Development of a mapping to validate the specification and report errors
● Development of a mapping to generate the XML output excluding critical errors
● Development of a component to automate the importation of the XML output into PowerCenter

One of the main issues to be addressed is whether there is a single generation engine which deals with all of the required
patterns, or a series of pattern-specific generation engines.

One of the drivers for the design should be the early identification of errors in the specifications. Otherwise the first indication of
any problem will be the failure of the XML output to import in PowerCenter.

It is very important to define the process around the generation and to allocate responsibilities appropriately.

Autogeneration Based on Java Application

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 685 of 1017


Assuming the appropriate skills are available in the development team, an alternative technique is to develop a Java application
to generate the mapping XML files. The PowerCenter Mapping SDK is a java API that provides all of the elements required to
generate mappings. The mapping SDK can be found in client installation directory. It contains:

● The javadoc (api directory) describe all the class of the java API
● The API (lib directory) which contains the jar files used for mapping SDK application
● Some basic samples which show how java development with Mapping SDK is done

The Java application also requires a mechanism to define the final mapping between source and target structures; the
application interprets this data source and combines it with the metadata in the repository in order to output the required mapping
XML.

4. Implement the QA and Testing Strategies

Presumably there should be less of a requirement for QA and Testing with generated components. This does not mean that the
need to test no longer exists. To some extent, the testing effort should be re-directed to the components in the Assembly line
itself.

There is a great deal of material in Velocity to support QA and Test activities. In particular, refer to Naming Conventions .
Informatica suggests adopting a Naming Convention that distinguishes between generated and manually-built components.

For more information on the QA strategy refer to Using PowerCenter Metadata Manager and Metadata Exchange Views for
Quality Assurance .

Otherwise, the main areas of focus for testing are:

Last updated: 26-May-08 18:26

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 686 of 1017


Mapping Design

Challenge

Optimizing PowerCenter to create an efficient execution environment.

Description

Although PowerCenter environments vary widely, most sessions and/or mappings can
benefit from the implementation of common objects and optimization procedures.
Follow these procedures and rules of thumb when creating mappings to help ensure
optimization.

General Suggestions for Optimizing

1. Reduce the number of transformations. There is always overhead involved in


moving data between transformations.
2. Consider more shared memory for large number of transformations. Session
shared memory between 12MB and 40MB should suffice.
3. Calculate once, use many times.

❍ Avoid calculating or testing the same value over and over.


❍ Calculate it once in an expression, and set a True/False flag.
❍ Within an expression, use variable ports to calculate a value that can
be used multiple times within that transformation.
4. Only connect what is used.

❍ Delete unnecessary links between transformations to minimize the


amount of data moved, particularly in the Source Qualifier.
❍ This is also helpful for maintenance. If a transformation needs to be
reconnected, it is best to only have necessary ports set as input and
output to reconnect.
❍ In lookup transformations, change unused ports to be neither input nor
output. This makes the transformations cleaner looking. It also makes
the generated SQL override as small as possible, which cuts down on
the amount of cache necessary and thereby improves performance.
5. Watch the data types.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 687 of 1017


❍ The engine automatically converts compatible types.
❍ Sometimes data conversion is excessive. Data types are automatically
converted when types differ between connected ports. Minimize data
type changes between transformations by planning data flow prior to
developing the mapping.
6. Facilitate reuse.

❍ Plan for reusable transformations upfront..


❍ Use variables. Use both mapping variables and ports that are
variables. Variable ports are especially beneficial when they can be
used to calculate a complex expression or perform a disconnected
lookup call only once instead of multiple times.
❍ Use mapplets to encapsulate multiple reusable transformations.
❍ Use mapplets to leverage the work of critical developers and minimize
mistakes when performing similar functions.
7. Only manipulate data that needs to be moved and transformed.

❍ Reduce the number of non-essential records that are passed through


the entire mapping.
❍ Use active transformations that reduce the number of records as early
in the mapping as possible (i.e., placing filters, aggregators as close to
source as possible).
❍ Select appropriate driving/master table while using joins. The table with
the lesser number of rows should be the driving/master table for a
faster join.
8. Utilize single-pass reads.

❍ Redesign mappings to utilize one Source Qualifier to populate multiple


targets. This way the server reads this source only once. If you have
different Source Qualifiers for the same source (e.g., one for delete
and one for update/insert), the server reads the source for each Source
Qualifier.
❍ Remove or reduce field-level stored procedures.
9. Utilize Pushdown Optimization.
❍ Design mappings so they can take advantage of the Pushdown

Optimization feature. This improves performance by allowing the


source and/or target database to perform the mapping logic.

Lookup Transformation Optimizing Tips

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 688 of 1017


1. When your source is large, cache lookup table columns for those lookup tables
of 500,000 rows or less. This typically improves performance by 10 to 20
percent.
2. The rule of thumb is not to cache any table over 500,000 rows. This is only true
if the standard row byte count is 1,024 or less. If the row byte count is more
than 1,024, then you need to adjust the 500K-row standard down as the
number of bytes increase (i.e., a 2,048 byte row can drop the cache row count
to between 250K and 300K, so the lookup table should not be cached in this
case). This is just a general rule though. Try running the session with a large
lookup cached and not cached. Caching is often faster on very large lookup
tables.
3. When using a Lookup Table Transformation, improve lookup performance by
placing all conditions that use the equality operator = first in the list of conditions
under the condition tab.
4. Cache only lookup tables if the number of lookup calls is more than 10 to 20
percent of the lookup table rows. For fewer number of lookup calls, do not
cache if the number of lookup table rows is large. For small lookup tables (i.e.,
less than 5,000 rows), cache for more than 5 to 10 lookup calls.
5. Replace lookup with decode or IIF (for small sets of values).
6. If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup.
7. For overly large lookup tables, use dynamic caching along with a persistent
cache. Cache the entire table to a persistent file on the first run, enable the
"update else insert" option on the dynamic cache and the engine never has to
go back to the database to read data from this table. You can also partition this
persistent cache at run time for further performance gains.
8. When handling multiple matches, use the "Return any matching value" setting
whenever possible. Also use this setting if the lookup is being performed to
determine that a match exists, but the value returned is irrelevant. The lookup
creates an index based on the key ports rather than all lookup transformation
ports. This simplified indexing process can improve performance.
9. Review complex expressions.

❍ Examine mappings via Repository Reporting and Dependency


Reporting within the mapping.
❍ Minimize aggregate function calls.
❍ Replace Aggregate Transformation object with an Expression
Transformation object and an Update Strategy Transformation for
certain types of Aggregations.

Operations and Expression Optimizing Tips

1. Numeric operations are faster than string operations.


2. Optimize char-varchar comparisons (i.e., trim spaces before comparing).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 689 of 1017


3. Operators are faster than functions (i.e., || vs. CONCAT).
4. Optimize IIF expressions.
5. Avoid date comparisons in lookup; replace with string.
6. Test expression timing by replacing with constant.
7. Use flat files.

❍ Using flat files located on the server machine loads faster than a
database located in the server machine.
❍ Fixed-width files are faster to load than delimited files because
delimited files require extra parsing.
❍ If processing intricate transformations, consider loading first to a
source flat file into a relational database, which allows the
PowerCenter mappings to access the data in an optimized fashion by
using filters and custom SQL Selects where appropriate.

8. If working with data that is not able to return sorted data (e.g., Web Logs),
consider using the Sorter Advanced External Procedure.
9. Use a Router Transformation to separate data flows instead of multiple Filter
Transformations.
10. Use a Sorter Transformation or hash-auto keys partitioning before an
Aggregator Transformation to optimize the aggregate. With a Sorter
Transformation, the Sorted Ports option can be used even if the original source
cannot be ordered.
11. Use a Normalizer Transformation to pivot rows rather than multiple instances of
the same target.
12. Rejected rows from an update strategy are logged to the bad file. Consider
filtering before the update strategy if retaining these rows is not critical because
logging causes extra overhead on the engine. Choose the option in the update
strategy to discard rejected rows.
13. When using a Joiner Transformation, be sure to make the source with the
smallest amount of data the Master source.
14. If an update override is necessary in a load, consider using a Lookup
transformation just in front of the target to retrieve the primary key. The primary
key update is much faster than the non-indexed lookup override.

Suggestions for Using Mapplets

A mapplet is a reusable object that represents a set of transformations. It allows you to


reuse transformation logic and can contain as many transformations as necessary. Use
the Mapplet Designer to create mapplets.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 690 of 1017


Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 691 of 1017


Mapping Templates

Challenge

Mapping Templates demonstrate proven solutions for tackling challenges that


commonly occur during data integration development efforts. Mapping Templates can
be used to make the development phase of a project more efficient. Mapping
Templates can also serve as a medium to introduce development standards into the
mapping development process that developers need to follow.

A wide array of Mapping Template examples can be obtained for the most current
PowerCenter version from the Informatica Customer Portal. As "templates," each of the
objects in Informatica's Mapping Template Inventory illustrates the transformation logic
and steps required to solve specific data integration requirements. These sample
templates, however, are meant to be used as examples, not as means to implement
development standards.

Description

Reuse Transformation Logic

Templates can be heavily used in a data integration and warehouse environment, when
loading information from multiple source providers into the same target structure, or
when similar source system structures are employed to load different target instances.
Using templates guarantees that any transformation logic that is developed and tested
correctly, once, can be successfully applied across multiple mappings as needed. In
some instances, the process can be further simplified if the source/target structures
have the same attributes, by simply creating multiple instances of the session, each
with its own connection/execution attributes, instead of duplicating the mapping.

Implementing Development Techniques

When the process is not simple enough to allow usage based on the need to duplicate
transformation logic to load the same target, Mapping Templates can help to reproduce
transformation techniques. In this case, the implementation process requires more than
just replacing source/target transformations. This scenario is most useful when certain
logic (i.e., logical group of transformations) is employed across mappings. In many
instances this can be further simplified by making use of mapplets. Additionally user
defined functions can be utilized for expression logic reuse and build complex

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 692 of 1017


expressions using transformation language.

Transport mechanism

Once Mapping Templates have been developed, they can be distributed by any of the
following procedures:

● Copy mapping from development area to the desired repository/folder


● Export mapping template into XML and import to the desired repository/folder.

Mapping template examples

The following Mapping Templates can be downloaded from the Informatica Customer
Portal and are listed by subject area:

Common Data Warehousing Techniques

● Aggregation using Sorted Input


● Tracking Dimension History
● Constraint-Based Loading
● Loading Incremental Updates
● Tracking History and Current
● Inserts or Updates

Transformation Techniques

● Error Handling Strategy


● Flat File Creation with Headers and Footers
● Removing Duplicate Source Records
● Transforming One Record into Multiple Records
● Dynamic Caching
● Sequence Generator Alternative
● Streamline a Mapping with a Mapplet
● Reusable Transformations (Customers)
● Using a Sorter

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 693 of 1017


● Pipeline Partitioning Mapping Template
● Using Update Strategy to Delete Rows
● Loading Heterogenous Targets
● Load Using External Procedure

Advanced Mapping Concepts

● Aggregation Using Expression Transformation


● Building a Parameter File
● Best Build Logic
● Comparing Values Between Records
● Transaction Control Transformation

Source-Specific Requirements

● Processing VSAM Source Files


● Processing Data from an XML Source
● Joining a Flat File with a Relational Table

Industry-Specific Requirements

● Loading SWIFT 942 Messages.htm


● Loading SWIFT 950 Messages.htm

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 694 of 1017


Naming Conventions

Challenge

A variety of factors are considered when assessing the success of a project. Naming standards are an important, but often overlooked
component. The application and enforcement of naming standards not only establishes consistency in the repository, but provides for
a developer friendly environment. Choose a good naming standard and adhere to it to ensure that the repository can be easily
understood by all developers.

Description
Although naming conventions are important for all repository and database objects, the suggestions in this Best Practice focus on the
former. Choosing a convention and sticking with it is the key.

Having a good naming convention facilitates smooth migrations and improves readability for anyone reviewing or carrying out
maintenance on the repository objects. It helps them to understand the processes being affected. If consistent names and descriptions
are not used, significant time may be needed to understand the workings of mappings and transformation objects. If no description is
provided, a developer is likely to spend considerable time going through an object or mapping to understand its objective.

The following pages offer suggested naming conventions for various repository objects. Whatever convention is chosen, it is important
to make the selection very early in the development cycle and communicate the convention to project staff working on the repository.
The policy can be enforced by peer review and at test phases by adding processes to check conventions both to test plans and to test
execution documents.

Suggested Naming Conventions

Designer Objects Suggested Naming Conventions

Mapping m_{PROCESS}_{SOURCE_SYSTEM}_{TARGET_NAME} or suffix with _


{descriptor} if there are multiple mappings for that single target table

Mapplet mplt_{DESCRIPTION}

Target {update_types(s)}_{TARGET_NAME} this naming convention should only


occur within a mapping as the actual target name object affects the actual
table that PowerCenter will access

Aggregator Transformation AGG_{FUNCTION} that leverages the expression and/or a name that
describes the processing being done.

Application Source Qualifier ASQ_{TRANSFORMATION} _{SOURCE_TABLE1}_{SOURCE_TABLE2}


Transformation represents data from application source.

Custom Transformation CT_{TRANSFORMATION} name that describes the processing being done.

Data Quality Transform IDQ_{descriptor}_{plan} with the descriptor describing what this plan is
doing with the optional plan name included if desired.
Expression Transformation EXP_{FUNCTION} that leverages the expression and/or a name that
describes the processing being done.

External Procedure EXT_{PROCEDURE_NAME}


Transformation

Filter Transformation FIL_ or FILT_{FUNCTION} that leverages the expression or a name that
describes the processing being done.

Flexible Target Key Fkey{descriptor}

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 695 of 1017


HTTP http_{descriptor}
Idoc Interpreter idoci_{Descriptor}_{IDOC Type} defining what the idoc does and possibly
the idoc message.
Idoc Prepare idocp_{Descriptor}_{IDOC Type} defining what the idoc does and possibly
the idoc message.
Java Transformation JV_{FUNCTION} that leverages the expression or a name that describes
the processing being done.

Joiner Transformation JNR_{DESCRIPTION}

Lookup Transformation LKP_{TABLE_NAME} or suffix with _{descriptor} if there are multiple look-
ups on a single table. For unconnected look-ups, use ULKP in place of LKP.

Mapplet Input Transformation MPLTI_{DESCRIPTOR} indicating the data going into the mapplet.

Mapplet Output Transformation MPLTO_{DESCRIPTOR} indicating the data coming out of the mapplet.

MQ Source Qualifier MQSQ_{DESCRIPTOR} defines the messaging being selected.


Transformation

Normalizer Transformation NRM_{FUNCTION} that leverages the expression or a name that describes
the processing being done.

Rank Transformation RNK_{FUNCTION} that leverages the expression or a name that describes
the processing being done.

Router Transformation RTR_{DESCRIPTOR}

SAP DMI Prepare dmi_{Entity Descriptor}_{Secondary Descriptor} defining what entity is being
loaded and a secondary description if multiple DMI objects are being
leveraged in a mapping.
Sequence Generator SEQ_{DESCRIPTOR} if using keys for a target table entity, then refer to that
Transformation

Sorter Transformation SRT_{DESCRIPTOR}

Source Qualifier Transformation SQ_{SOURCE_TABLE1}_{SOURCE_TABLE2}. Using all source tables can


be impractical if there are a lot of tables in a source qualifier, so refer to the
type of information being obtained, for example a certain type of product –
SQ_SALES_INSURANCE_PRODUCTS.

Stored Procedure SP_{STORED_PROCEDURE_NAME}


Transformation

Transaction Control TCT_ or TRANS_{DESCRIPTOR} indicating the function of the transaction


Transformation control.

Union Transformation UN_{DESCRIPTOR}

Unstructured Data Transform UDO_{descriptor} with the descriptor ideintifying the kind of data being
parsed by the UDO transform.
Update Strategy Transformation UPD_{UPDATE_TYPE(S)} or UPD_{UPDATE_TYPE(S)}_
{TARGET_NAME} if there are multiple targets in the mapping. E.g.,
UPD_UPDATE_EXISTING_EMPLOYEES

Web Service Consumer WSC_{descriptor}


XML Generator Transformation XMG_{DESCRIPTOR}defines the target message.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 696 of 1017


XML Parser Transformation XMP_{DESCRIPTOR}defines the messaging being selected.

XML Source Qualifier XMSQ_{DESCRIPTOR}defines the data being selected.


Transformation

Port Names

Ports names should remain the same as the source unless some other action is performed on the port. In that case, the port should be
prefixed with the appropriate name.

When the developer brings a source port into a lookup, the port should be prefixed with ‘in_’. This helps the user immediately identify
the ports that are being input without having to line up the ports with the input checkbox. In any other transformation, if the input port
is transformed in an output port with the same name, prefix the input port with ‘in_’.

Generated output ports can also be prefixed. This helps trace the port value throughout the mapping as it may travel through many
other transformations. If it is intended to be able to use the autolink feature based on names, then outputs may be better left as the
name of the target port in the next transformation. For variables inside a transformation, the developer can use the prefix ‘v’, 'var_’ or
‘v_' plus a meaningful name.

With some exceptions, port standards apply when creating a transformation object. The exceptions are the Source Definition, the
Source Qualifier, the Lookup, and the Target Definition ports, which must not change since the port names are used to retrieve data
from the database.

Other transformations that are not applicable to the port standards are:

● Normalizer - The ports created in the Normalizer are automatically formatted when the developer configures it.
● Sequence Generator - The ports are reserved words.
● Router - Because output ports are created automatically, prefixing the input ports with an I_ prefixes the output ports with I_
as well. Port names should not have any prefix.
● Sorter, Update Strategy, Transaction Control, and Filter - These ports are always input and output. There is no need to
rename them unless they are prefixed. Prefixed port names should be removed.
● Union - The group ports are automatically assigned to the input and output; therefore prefixing with anything is reflected in
both the input and output. The port names should not have any prefix.

All other transformation object ports can be prefixed or suffixed with:

● ‘in_’ or ‘i_’for Input ports


● ‘o_’ or ‘_out’ for Output ports
● ‘io_’ for Input/Output ports
● ‘v’,‘v_’ or ‘var_’ for variable ports
● ‘lkp_’ for returns from look ups
● ‘mplt_’ for returns from mapplets

Prefixes are preferable because they are generally easier to see; developers do not need to expand the columns to see the suffix for
longer port names.

Transformation object ports can also:

● Have the Source Qualifier port name.


● Be unique.
● Be meaningful.
● Be given the target port name.

Transformation Descriptions

This section defines the standards to be used for transformation descriptions in the Designer.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 697 of 1017


● Source Qualifier Descriptions. Should include the aim of the source qualifier and the data it is intended to select.

Should also indicate if any overrides are used. If so, it should describe the filters or settings used. Some projects prefer items
such as the SQL statement to be included in the description as well.

● Lookup Transformation Descriptions. Describe the lookup along the lines of the [lookup attribute] obtained from [lookup
table name] to retrieve the [lookup attribute name].

Where:

❍ Lookup attribute is the name of the column being passed into the lookup and is used as the lookup criteria.
❍ Lookup table name is the table on which the lookup is being performed.
❍ Lookup attribute name is the name of the attribute being returned from the lookup. If appropriate, specify the condition
when the lookup is actually executed.

It is also important to note lookup features such as persistent cache or dynamic lookup.

● Expression Transformation Descriptions. Must adhere to the following format:

“This expression … [explanation of what transformation does].”

Expressions can be distinctly different depending on the situation; therefore the explanation should be specific to the actions
being performed.

Within each Expression, transformation ports have their own description in the format:

“This port … [explanation of what the port is used for].”

● Aggregator Transformation Descriptions. Must adhere to the following format:

“This Aggregator … [explanation of what transformation does].”

Aggregators can be distinctly different, depending on the situation; therefore the explanation should be specific to the actions
being performed.

Within each Aggregator, transformation ports have their own description in the format:

“This port … [explanation of what the port is used for].”

● Sequence Generators Transformation Descriptions. Must adhere to the following format:

“This Sequence Generator provides the next value for the [column name] on the [table name].”

Where:

❍ Table name is the table being populated by the sequence number, and the
❍ Column name is the column within that table being populated.

● Joiner Transformation Descriptions. Must adhere to the following format:

“This Joiner uses … [joining field names] from [joining table names].”

Where:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 698 of 1017


Joining field names are the names of the columns on which the join is done, and the

Joining table names are the tables being joined.

● Normalizer Transformation Descriptions. Must adhere to the following format::

“This Normalizer … [explanation].”

Where:

❍ explanation describes what the Normalizer does.

● Filter Transformation Descriptions. Must adhere to the following format:

“This Filter processes … [explanation].”

Where:

❍ explanation describes what the filter criteria are and what they do.

● Stored Procedure Transformation Descriptions. Explain the stored procedure’s functionality within the mapping (i.e., what
does it return in relation to the input ports?).

● Mapplet Input Transformation Descriptions. Describe the input values and their intended use in the mapplet.

● Mapplet Output Transformation Descriptions. Describe the output ports and the subsequent use of those values. As an
example, for an exchange rate mapplet, describe what currency the output value will be in. Answer the questions like: is the
currency fixed or based on other data? What kind of rate is used? is it a fixed inter-company rate? an inter-bank rate?
business rate or tourist rate? Has the conversion gone through an intermediate currency?

● Update Strategies Transformation Descriptions. Describe the Update Strategy and whether it is fixed in its function or
determined by a calculation.

● Sorter Transformation Descriptions. Explanation of the port(s) that are being sorted and their sort direction.

● Router Transformation Descriptions. Describes the groups and their functions.

● Union Transformation Descriptions. Describe the source inputs and indicate what further processing on those inputs (if
any) is expected to take place in later transformations in the mapping.

● Transaction Control Transformation Descriptions. Describe the process behind the transaction control and the function of
the control to commit or rollback.

● Custom Transformation Descriptions. Describe the function that the custom transformation accomplishes and what data is
expected as input and what data will be generated as output. Also indicate the module name (and location) and the procedure
which is used.

● External Procedure Transformation Descriptions. Describe the function of the external procedure and what data is
expected as input and what data will be generated as output. Also indicate the module name (and location) and the
procedure that is used.

● Java Transformation Descriptions. Describe the function of the java code and what data is expected as input and what data
is generated as output. Also indicate whether the java code determines the object to be an Active or Passive transformation.

● Rank Transformation Descriptions. Indicate the columns being used in the rank, the number of records returned from the
rank, the rank direction, and the purpose of the transformation.

● XML Generator Transformation Descriptions. Describe the data expected for the generation of the XML and indicate the
purpose of the XML being generated.

● XML Parser Transformation Descriptions. Describe the input XML expected and the output from the parser and indicate the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 699 of 1017


purpose of the transformation.

Mapping Comments

These comments describe the source data obtained and the structure file, table or facts and dimensions that it populates. Remember
to use business terms along with such technical details as table names. This is beneficial when maintenance is required or if issues
arise that need to be discussed with business analysts.

Mapplet Comments

These comments are used to explain the process that the mapplet carries out. Always be sure to see the notes regarding descriptions
for the input and output transformation.

Repository Objects

Repositories, as well as repository level objects, should also have meaningful names. Repositories should prefix with either ‘L_’ for
local or ‘G’ for global and a descriptor. Descriptors usually include information about the project and/or level of the environment (e.g.,
PROD, TEST, DEV).

Folders and Groups

Working folder names should be meaningful and include project name and, if there are multiple folders for that one project, a
descriptor. User groups should also include project name and descriptors, as necessary. For example, folder DW_SALES_US and
DW_SALES_UK could both have TEAM_SALES as their user group. Individual developer folders or non-production folders should
prefix with ‘z_’ so that they are grouped together and not confused with working production folders.

Shared Objects and Folders

Any object within a folder can be shared across folders and maintained in one central location. These objects are sources, targets,
mappings, transformations, and mapplets. To share objects in a folder, the folder must be designated as shared. In addition to
facilitating maintenance, shared folders help reduce the size of the repository since shortcuts are used to link to the original, instead of
copies.

Only users with the proper permissions can access these shared folders. These users are responsible for migrating the folders across
the repositories and, with help from the developers, for maintaining the objects within the folders. For example, if an object is created
by a developer and is to be shared, the developer should provide details of the object and the level at which the object is to be shared
before the Administrator accepts it as a valid entry into the shared folder. The developers, not necessarily the creator, control the
maintenance of the object, since they must ensure that a subsequent change does not negatively impact other objects.

If the developer has an object that he or she wants to use in several mappings or across multiple folders, like an Expression
transformation that calculates sales tax, the developer can place the object in a shared folder. Then use the object in other folders by
creating a shortcut to the object. In this case, the naming convention is ‘sc_’ (e.g., sc_EXP_CALC_SALES_TAX). The folder should
prefix with ‘SC_’ to identify it as a shared folder and keep all shared folders grouped together in the repository.

Workflow Manager Objects

WorkFlow Objects Suggested Naming Convention

Session s_{MappingName}

Command Object cmd_{DESCRIPTOR}

Worklet wk or wklt_{DESCRIPTOR}

Workflow wkf or wf_{DESCRIPTOR}

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 700 of 1017


Email Task: email_ or eml_{DESCRIPTOR}

Decision Task: dcn_ or dt_{DESCRIPTOR}

Assign Task: asgn_{DESCRIPTOR}

Timer Task: timer_ or tmr_{DESCRIPTOR}

Control Task: ctl_{DESCRIPTOR}Specify when and how the PowerCenter


Server is to stop or abort a workflow by using the Control task in
the workflow.

Event Wait Task: wait_ or ew_{DESCRIPTOR}Waits for an event to occur. Once


the event triggers, the PowerCenter Server continues executing
the rest of the workflow.

Event Raise Task: raise_ or er_{DESCRIPTOR} Represents a user-defined event.


When the PowerCenter Server runs the Event-Raise task, the
Event-Raise task triggers the event. Use the Event-Raise task
with the Event-Wait task to define events.

ODBC Data Source Names

All Open Database Connectivity (ODBC) data source names (DSNs) should be set up in the same way on all client machines.
PowerCenter uniquely identifies a source by its Database Data Source (DBDS) and its name. The DBDS is the same name as the
ODBC DSN since the PowerCenter Client talks to all databases through ODBC.

Also be sure to setup the ODBC DSNs as system DSNs so that all users of a machine can see the DSN. This approach ensures that
there is less chance of a discrepancy occuring among users when they use different (i.e., colleagues') machines and have to recreate
a new DSN when they use a separate machine.

If ODBC DSNs are different across multiple machines, there is a risk of analyzing the same table using different names. For example,
machine1 has ODBS DSN Name0 that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely identified as
Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that points to database1. TableA gets analyzed in on machine 2.
TableA is uniquely identified as Name1.TableA in the repository. The result is that the repository may refer to the same object by
multiple names, creating confusion for developers, testers, and potentially end users.

Also, refrain from using environment tokens in the ODBC DSN. For example, do not call it dev_db01. When migrating objects from dev,
to test, to prod, PowerCenter can wind up with source objects called dev_db01 in the production repository. ODBC database names
should clearly describe the database they reference to ensure that users do not incorrectly point sessions to the wrong databases.

Database Connection Information

Security considerations may dictate using the company name of the database or project instead of {user}_{database name}, except for
developer scratch schemas, which are not found in test or production environments. Be careful not to include machine names or
environment tokens in the database connection name. Database connection names must be very generic to be understandable and
ensure a smooth migration.

The naming convention should be applied across all development, test, and production environments. This allows seamless migration
of sessions when migrating between environments. If an administrator uses the Copy Folder function for migration, session information
is also copied. If the Database Connection information does not already exist in the folder the administrator is copying to, it is also
copied. So, if the developer uses connections with names like Dev_DW in the development repository, they are likely to eventually
wind up in the test, and even the production repositories as the folders are migrated. Manual intervention is then necessary to change
connection names, user names, passwords, and possibly even connect strings.

Instead, if the developer just has a DW connection in each of the three environments, when the administrator copies a folder from the
development environment to the test environment, the sessions automatically use the existing connection in the test repository. With
the right naming convention, you can migrate sessions from the test to production repository without manual intervention.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 701 of 1017


TIP
At the beginning of a project, have the Repository Administrator or DBA setup all connections in all environments based on
the issues discussed in this Best Practice. Then use permission options to protect these connections so that only specified
individuals can modify them. Whenever possible, avoid having developers create their own connections using different
conventions and possibly duplicating connections.

Administration Console Objects

Administration console objects such as domains, nodes, and services should also have meaningful names.

Object Recommended Naming Convention Example

Domain DOM_ or DMN_[PROJECT]_[ENVIRONMENT] DOM_PROCURE_DEV

Node NODE[#]_[SERVER_NAME]_ [optional_descriptor] NODE02_SERVER_rs_b (backup node for the


repository service)

Services:

- Integration INT_SVC_[ENVIRONMENT]_[optional descriptor] INT_SVC_DEV_primary

- Repository REPO_SVC_[ENVIRONMENT]_[optional REPO_SVC_TEST


descriptor]

- Web Services WEB_SVC_[ENVIRONMENT]_[optional descriptor] WEB_SVC_PROD


Hub

PowerCenter PowerExchange Application/Relational Connections

Before the PowerCenter Server can access a source or target in a session, you must configure connections in the Workflow Manager.
When you create or modify a session that reads from, or writes to, a database, you can select only configured source and target
databases. Connections are saved in the repository.

For PowerExchange Client for PowerCenter, you configure relational database and/or application connections. The connection you
configure depends on the type of source data you want to extract and the extraction mode (e.g., PWX[MODE_INITIAL]_[SOURCE]_
[Instance_Name]). The following table shows some examples.

Source Type/ Application Connection/ Connection Type Recommended Naming


Extraction Mode Relational Connection Convention

DB2/390 Bulk Mode Relational PWX DB2390 PWXB_DB2_Instance_Name

DB2/390 Change Application PWX DB2390 PWXC_DB2_Instance_Name


Mode CDC Change

DB2/390 Real Time Application PWX DB2390 PWXR_DB2_Instance_Name


Mode CDC Real Time

IMS Batch Mode Application PWX NRDB Batch PWXB_IMS_ Instance_Name

IMS Change Mode Application PWX NRDB CDC PWXC_IMS_ Instance_Name


Change

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 702 of 1017


IMS Real Time Application PWX NRDB CDC PWXR_IMS_ Instance_Name
Real Time

Oracle Change Mode Application PWX Oracle CDC PWXC_ORA_Instance_Name


Change

Oracle Real Time Application PWX Oracle CDC PWXR_ORA_Instance_Name


Real

PowerCenter PowerExchange Target Connections

The connection you configure depends on the type of target data you want to load.

Target Type Connection Type Recommended Naming


Convention

DB2/390 PWX DB2390 relational database PWXT_DB2_Instance_Name


connection

DB2/400 PWX DB2400 relational database PWXT_DB2_Instance_Name


connection

Last updated: 05-Dec-07 16:20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 703 of 1017


Naming Conventions - Data Quality

Challenge

As with any other development process, the use of clear, consistent, and documented naming conventions
contributes to the effective use of Informatica Data Quality (IDQ). This Best Practice provides suggested naming
conventions for the major structural elements of the IDQ Designer and IDQ Plans.

Description
IDQ Designer

The IDQ Designer is the user interface for the development of IDQ plans.

Each IDQ plan holds the business rules and operations for a distinct process. IDQ plans may be constructed for use
inside the IDQ Designer (a runtime plan), using the athanor-rt command line utility (also runtime), or within an
integration with PowerCenter (a real-time plan).

IDQ requires that each IDQ plan belong to a project. Optionally, plans may be organized in folders within a project.
Folders may be nested to span more than one level.

The organizational structure of IDQ is summarized below.

Element Parent

Repository None. This is the top level organization structure.

Project Repository. There may be multiple projects in a repository.

Folder Project or Folder. Folders may be nested.

Plan Project or Folder.

At any common level of visibility, IDQ requires that all elements have distinct names. Thus no two projects within a
repository may share the same name. Likewise, no two folders at the same level within a project may share the
same name. The rule also applies to plans within the same folder.

IDQ will not permit an element to be renamed if the new name would conflict with an existing element at the same
level. A dialog will explain the error.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 704 of 1017


To prevent naming conflicts when an element is copied, it will be prefixed with “Copy of “ if it is pasted at the same
level as the source of the copy. If the length of the new name is longer than the allowed length for names of the
type of element, the name will be truncated.

Naming Projects

When a project is created, it will be by default have the name “New Project”.

Project naming should be clear and consistent within a repository. The exact approach to naming will vary
depending on an organization’s needs. Suggested naming rules include:

1. Limit project names to 22 characters if possible. The limit imposed by the repository is 30 characters.
Limiting project names to 22 characters allows “Copy of” to be prefixed to copies of a project without
truncating characters.
2. Include enough descriptive information within the project name so an unfamiliar user will have a reasonable
idea of what plans may be included in the project.
3. If plans within a project will operate on only one data source, including the data source in the project name
may be helpful.
4. If abbreviations are used, they should be consistent and documented.

Naming Folders

When a new project is created, by default it will contain four folders, named “Consolidation”, “Matching”, “Profiling”,
and “Standardization”.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 705 of 1017


This naming convention for folders tracks the major types of IDQ plans.

While the default naming convention may prove satisfactory in many cases, it imposes an organizational structure
for plans that may not be optimal. Therefore, another naming convention may make more sense in a particular
circumstance.

Naming guidelines for folders include:

1. Limit folder names to 42 characters if possible. The limit imposed by the repository is 50 characters.
Limiting folder names to 42 characters allows “Copy of” to be prefixed to copies of a folder without truncating
characters.
2. Include enough descriptive information within the folder name so an unfamiliar user will have a reasonable
idea of what plans may be included in the folder.
3. If abbreviations are used, they should be consistent and documented.

Naming Plans

When a new plan is created, the user is required to select from one of the four main plan classifications, “Analysis”,
“Matching”, “Standardization”, or “Consolidation”. By default, the new plan name will correspond to the option
selected.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 706 of 1017


Including the plan type as part of the plan name is helpful in describing what the plan does. Other suggested
naming rules include:

1. Limit plan names to 42 characters if possible. The limit imposed by the repository is 50 characters. Limiting
plan names to 42 characters allows “Copy of” to be prefixed to copies of a plan without truncating
characters.
2. Include enough descriptive information within the plan name so an unfamiliar user will have a reasonable
idea of what the plan does at a high level.
3. While the project and folder structure will be visible within the IDQ Designer and will be required when using
athanor-rt, it is not as readily visible within PowerCenter. Therefore, repetition of the information conveyed
by the project and folder names may be advisable.
4. If abbreviations are used, they should be consistent and documented.

Naming Components

Within the Designer, component types may be identified by their unique icons as well as by hovering over a
component with a mouse.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 707 of 1017


However, the component has no visible name at this level. It is only after opening a component for viewing that the
component’s name becomes visible.

It is suggested that component names be prefixed with an acronym identifying the component type. While less
critical than field naming, as discussed below, using a prefix allows for consistent naming, for clarity, and it makes
field naming more efficient in some cases.

Suggested prefixes are listed below.

Component Prefix

Address Validator AV_

Bigram BG_

Character Labeller CL_

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 708 of 1017


Context Parser CP_

Edit Distance ED_

Hamming Distance HD_

Jaro Distance JD_

Merge MG_

Mixed Field Matcher MFM_

Nysiis NYS_

Profile Standardizer PS_

Rule Based Analyzer RBA_

Scripting SC_

Search Replace SR_

Soundex SX_

Splitter SPL_

To Upper TU_

Token Labeller TL_

Token Parser TP_

Weight Based Analyzer WBA_

Word Manager WM_

In addition, names for components should take into account the following suggested rules:

1. Limit names to a reasonably short length. A limit of 32 characters is suggested. In many cases, component
names are also useful for field names, and databases limit field lengths at varying sizes.
2. Consider using the name of the input field or at least the field type.
3. Consider limiting names to alphabetic characters, spaces, underscores, and numbers. This will make the
corresponding field names compatible with most likely output destinations.
4. If the component type abbreviation itself is not sufficient to identify what the component does, include an
identifier for the function of the component in its name.
5. If abbreviations are used, they should be consistent and documented.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 709 of 1017


Naming Dictionaries

Dictionaries may be given any name suitable for the operating system on which they will be used.

It is suggested that dictionary naming consider the following rules:

1. Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both
Windows and UNIX, avoid using spaces.
2. If a dictionary supplied by Informatica is to be modified, it is suggested that the dictionary be renamed and/
or moved to a new folder. This will avoid accidentally overwriting the modifications when an update is
installed.
3. If abbreviations are used, they should be consistent and documented.

Naming Fields

Careful field naming is probably the most critical standard to follow when using IDQ.

● IDQ requires that all fields output by components have unique names; a name cannot be carried through
from component to component.
● The power of IDQ leads to complex plans with many components.
● IDQ does not have the data lineage feature of PowerCenter, so the component name is the clearest
indicator of the source of an input component when a plan is being examined.

With those considerations in mind, the following naming rules are suggested:

1. Prefix each output field name with the type of component.

Component Prefix

Address Validator AV_

Bigram BG_

Character Labeller CL_

Context Parser CP_

Edit Distance ED_

Hamming Distance HD_

Jaro Distance JD_

Merge MG_

Mixed Field Matcher MFM_

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 710 of 1017


Nysiis NYS_

Profile Standardizer PS_

Rule Based Analyzer RBA_

Scripting SC_

Search Replace SR_

Soundex SX_

Splitter SPL_

To Upper TU_

Token Labeller TL_

Token Parser TP_

Weight Based Analyzer WBA_

Word Manager WM_

2. Use meaningful field names, with consistent, documented abbreviations.


3. Use consistent casing.
4. While it is possible to rename output fields in sink components, this practice should be avoided when
practical, since there is no convenient way to determine which source field provides data to the renamed
output field.

Last updated: 04-Jun-08 18:50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 711 of 1017


Performing Incremental Loads

Challenge

Data warehousing incorporates very large volumes of data. The process of loading the
warehouse in a reasonable timescale without compromising its functionality is
extremely difficult. The goal is to create a load strategy that can minimize downtime for
the warehouse and allow quick and robust data management.

Description

As time windows shrink and data volumes increase, it is important to understand the
impact of a suitable incremental load strategy. The design should allow data to be
incrementally added to the data warehouse with minimal impact on the overall system.
This Best Practice describes several possible load strategies.

Incremental Aggregation

Incremental aggregation is useful for applying incrementally-captured changes in the


source to aggregate calculations in a session.

If the source changes only incrementally, and you can capture those changes, you can
configure the session to process only those changes with each run. This allows the
PowerCenter Integration Service to update the target incrementally, rather than forcing
it to process the entire source and recalculate the same calculations each time you run
the session.

If the session performs incremental aggregation, the PowerCenter Integration Service


saves index and data cache information to disk when the session finishes. The next
time the session runs, the PowerCenter Integration Service uses this historical
information to perform the incremental aggregation. To utilize this functionality set the
“Incremental Aggregation” Session attribute. For details see Chapter 24 in the
Workflow Administration Guide.

Use incremental aggregation under the following conditions:

● Your mapping includes an aggregate function.


● The source changes only incrementally.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 712 of 1017


● You can capture incremental changes (i.e., by filtering source data by
timestamp).
● You get only delta records (i.e., you may have implemented the CDC (Change
Data Capture) feature of PowerExchange).

Do not use incremental aggregation in the following circumstances:

● You cannot capture new source data.


● Processing the incrementally-changed source significantly changes the target.
If processing the incrementally-changed source alters more than half the
existing target, the session may not benefit from using incremental
aggregation.
● Your mapping contains percentile or median functions.

Some conditions that may help in making a decision on an incremental strategy include:

● Error handling, loading and unloading strategies for recovering, reloading, and
unloading data.
● History tracking requirements for keeping track of what has been loaded and
when
● Slowly-changing dimensions. Informatica Mapping Wizards are a good start to
an incremental load strategy. The Wizards generate generic mappings as a
starting point (refer to Chapter 15 in the Designer Guide)

Source Analysis

Data sources typically fall into the following possible scenarios:

● Delta records. Records supplied by the source system include only new or
changed records. In this scenario, all records are generally inserted or updated
into the data warehouse.
● Record indicator or flags. Records that include columns that specify the
intention of the record to be populated into the warehouse. Records can be
selected based upon this flag for all inserts, updates, and deletes.
● Date stamped data. Data is organized by timestamps, and loaded into the
warehouse based upon the last processing date or the effective date range.
● Key values are present. When only key values are present, data must be
checked against what has already been entered into the warehouse. All values
must be checked before entering the warehouse.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 713 of 1017


● No key values present. When no key values are present, surrogate keys are
created and all data is inserted into the warehouse based upon validity of the
records.

Identify Records for Comparison

After the sources are identified, you need to determine which records need to be
entered into the warehouse and how. Here are some considerations:

● Compare with the target table. When source delta loads are received,
determine if the record exists in the target table. The timestamps and natural
keys of the record are the starting point for identifying whether the record is
new, modified, or should be archived. If the record does not exist in the target,
insert the record as a new row. If it does exist, determine if the record needs to
be updated, inserted as a new record, or removed (deleted from target) or
filtered out and not added to the target.
● Record indicators. Record indicators can be beneficial when lookups into the
target are not necessary. Take care to ensure that the record exists for update
or delete scenarios, or does not exist for successful inserts. Some design
effort may be needed to manage errors in these situations.

Determine Method of Comparison

There are four main strategies in mapping design that can be used as a method of
comparison:

● Joins of sources to targets. Records are directly joined to the target using
Source Qualifier join conditions or using Joiner transformations after the
Source Qualifiers (for heterogeneous sources). When using Joiner
transformations, take care to ensure the data volumes are manageable and
that the smaller of the two datasets is configured as the Master side of the join.
● Lookup on target. Using the Lookup transformation, lookup the keys or
critical columns in the target relational database. Consider the caches and
indexing possibilities.
● Load table log. Generate a log table of records that have already been
inserted into the target system. You can use this table for comparison with
lookups or joins, depending on the need and volume. For example, store keys
in a separate table and compare source records against this log table to
determine load strategy. Another example is to store the dates associated with
the data already loaded into a log table.
● MD5 checksum function. Generate a unique value for each row of data and
then compare previous and current unique checksum values to determine

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 714 of 1017


whether the record has changed.

Source-Based Load Strategies

Complete Incremental Loads in a Single File/Table

The simplest method for incremental loads is from flat files or a database in which all
records are going to be loaded. This strategy requires bulk loads into the warehouse
with no overhead on processing of the sources or sorting the source records.

Data can be loaded directly from the source locations into the data warehouse. There is
no additional overhead produced in moving these sources into the warehouse.

Date-Stamped Data

This method involves data that has been stamped using effective dates or sequences.
The incremental load can be determined by dates greater than the previous load date
or data that has an effective key greater than the last key processed.

With the use of relational sources, the records can be selected based on this effective
date and only those records past a certain date are loaded into the warehouse. Views
can also be created to perform the selection criteria. This way, the processing does not
have to be incorporated into the mappings but is kept on the source component.

Placing the load strategy into the other mapping components is more flexible and
controllable by the Data Integration developers and the associated metadata.

To compare the effective dates, you can use mapping variables to provide the previous
date processed (see the description below). An alternative to Repository-maintained
mapping variables is the use of control tables to store the dates and update the control
table after each load.

Non-relational data can be filtered as records are loaded based upon the effective
dates or sequenced keys. A Router transformation or filter can be placed after the
Source Qualifier to remove old records.

Changed Data Based on Keys or Record Information

Data that is uniquely identified by keys can be sourced according to selection criteria.
For example, records that contain primary keys or alternate keys can be used to
determine if they have already been entered into the data warehouse. If they exist, you

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 715 of 1017


can also check to see if you need to update these records or discard the source record.

It may be possible to perform a join with the target tables in which new data can be
selected and loaded into the target. It may also be feasible to lookup in the target to
see if the data exists.

Target-Based Load Strategies

● Loading directly into the target. Loading directly into the target is possible
when the data is going to be bulk loaded. The mapping is then responsible for
error control, recovery, and update strategy.
● Load into flat files and bulk load using an external loader. The
mapping loads data directly into flat files. You can then invoke an external
loader to bulk load the data into the target. This method reduces the load times
(with less downtime for the data warehouse) and provides a means of
maintaining a history of data being loaded into the target. Typically, this
method is only used for updates into the warehouse.
● Load into a mirror database. The data is loaded into a mirror database to
avoid downtime of the active data warehouse. After data has been loaded, the
databases are switched, making the mirror the active database and the active
the mirror.

Using Mapping Variables

You can use a mapping variable to perform incremental loading. By referencing a date-
based mapping variable in the Source Qualifier or join condition, it is possible to select
only those rows with greater than the previously captured date (i.e., the newly inserted
source data). However, the source system must have a reliable date to use.

The steps involved in this method are:

Step 1: Create mapping variable

In the Mapping Designer, choose Mappings > Parameters > Variables. Or, to create
variables for a mapplet, choose Mapplet > Parameters > Variables in the Mapplet
Designer.

Click Add and enter the name of the variable (i.e., $$INCREMENT DATE). In this case,
make your variable a date/time. For the Aggregation option, select MAX.

In the same screen, state your initial value. This date is used during the initial run of the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 716 of 1017


session and as such should represent a date earlier than the earliest desired data. The
date can use any one of these formats:

● MM/DD/RR
● MM/DD/RR HH24:MI:SS
● MM/DD/YYYY
● MM/DD/YYYY HH24:MI:SS

Step 2: Reference the mapping variable in the Source Qualifier

The select statement should look like the following:

Select * from table A


where
CREATE DATE > date(‘$$INCREMENT_DATE’. ‘MM-DD-YYYY HH24:MI:SS’)

Step 3: Refresh the mapping variable for the next session run using
an Expression Transformation

Use an Expression transformation and the pre-defined variable functions to set and use
the mapping variable.

In the expression transformation, create a variable port and use the


SETMAXVARIABLE variable function to capture the maximum source date selected
during each run.

SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)

CREATE_DATE in this example is the date field from the source that should be used to
identify incremental rows.

You can use the variables in the following transformations:

● Expression
● Filter
● Router
● Update Strategy

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 717 of 1017


As the session runs, the variable is refreshed with the max date value encountered
between the source and variable. So, if one row comes through with 9/1/2004, then the
variable gets that value. If all subsequent rows are LESS than that, then 9/1/2004 is
preserved.

Note: This behavior has no effect on the date used in the source qualifier. The initial
select always contains the maximum data value encountered during the previous,
successful session run.

When the mapping completes, the PERSISTENT value of the mapping variable is
stored in the repository for the next run of your session. You can view the value of the
mapping variable in the session log file.

The advantage of the mapping variable and incremental loading is that it allows the
session to use only the new rows of data. No table is needed to store the max(date)
since the variable takes care of it.

After a successful session run, the PowerCenter Integration Service saves the final
value of each variable in the repository. So when you run your session the next time,
only new data from the source system is captured. If necessary, you can override the
value saved in the repository with a value saved in a parameter file.

Using PowerExchange Change Data Capture

PowerExchange (PWX) Change Data Capture (CDC) greatly simplifies the


identification, extraction, and loading of change records. It supports all key mainframe
and midrange database systems, requires no changes to the user application, uses
vendor-supplied technology where possible to capture changes, and eliminates the
need for programming or the use of triggers. Once PWX CDC collects changes, it
places them in a “change stream” for delivery to PowerCenter. Included in the change
data is useful control information, such as the transaction type (insert/update/delete)
and the transaction timestamp. In addition, the change data can be made available
immediately (i.e., in real time) or periodically (i.e., where changes are condensed).

The native interface between PowerCenter and PowerExchange is PowerExchange


Client for PowerCenter (PWXPC). PWXPC enables PowerCenter to pull the change
data from the PWX change stream if real-time consumption is needed or from PWX
condense files if periodic consumption is required. The changes are applied directly. So
if the action flag is “I”, the record is inserted. If the action flag is “U’, the record is
updated. If the action flag is “D”, the record is deleted. There is no need for change
detection logic in the PowerCenter mapping.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 718 of 1017


In addition, by leveraging “group source” processing, where multiple sources are
placed in a single mapping, the PowerCenter session reads the committed changes for
multiple sources in a single efficient pass, and in the order they occurred. The changes
are then propagated to the targets, and upon session completion, restart tokens
(markers) are written out to a PowerCenter file so that the next session run knows the
point to extract from.

Tips for Using PWX CDC

● After installing PWX, ensure the PWX Listener is up and running and that
connectivity is established to the Listener. For best performance, the Listener
should be co-located with the source system.

● In the PWX Navigator client tool, use metadata to configure data access. This
means creating data maps for the non-relational to relational view of
mainframe sources (such as IMS and VSAM) and capture registrations for all
sources (mainframe, Oracle, DB2, etc). Registrations define the specific tables
and columns desired for change capture. There should be one registration per
source. Group the registrations logically, for example, by source database.

● For an initial test, make changes in the source system to the registered
sources. Ensure that the changes are committed.

● Still working in PWX Navigator (and before using PowerCenter), perform Row
Tests to verify the returned change records, including the transaction action
flag (the DTL__CAPXACTION column) and the timestamp. Set the required
access mode: CAPX for change and CAPXRT for real time. Also, if desired,
edit the PWX extraction maps to add the Change Indicator (CI) column. This
CI flag (Y or N) allows for field level capture and can be filtered in the
PowerCenter mapping.

● Use PowerCenter to materialize the targets (i.e., to ensure that sources and
targets are in sync prior to starting the change capture process). This can be
accomplished with a simple pass-through “batch” mapping. This same bulk
mapping can be reused for CDC purposes, but only if specific CDC columns
are not included, and by changing the session connection/mode.

● Import the PWX extraction maps into Designer. This requires the PWXPC

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 719 of 1017


component. Specify the CDC Datamaps option during the import.

● Use “group sourcing” to create the CDC mapping by including multiple sources
in the mapping. This enhances performance because only one read/
connection is made to the PWX Listener and all changes (for the sources in
the mapping) are pulled at one time.

● Keep the CDC mappings simple. There are some limitations; for instance, you
cannot use active transformations. In addition, if loading to a staging area,
store the transaction types (i.e., insert/update/delete) and the timestamp for
subsequent processing downstream. Also, if loading to a staging area, include
an Update Strategy transformation in the mapping with DD_INSERT or
DD_UPDATE in order to override the default behavior and store the action
flags.

● Set up the Application Connection in Workflow Manager to be used by the


CDC session. This requires the PWXPC component. There should be one
connection and token file per CDC mapping/session. Set the UOW (unit of
work) to a low value for faster commits to the target for real-time sessions.
Specify the restart token location and file on the PowerCenter Integration
Service (within the infa_shared directory) and specify the location of the PWX
Listener.

● In the CDC session properties, enable session recovery (i.e., set the Recovery
Strategy to “Resume from last checkpoint”).

● Use post-session commands to archive the restart token files for restart/
recovery purposes. Also, archive the session logs.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 720 of 1017


Real-Time Integration with PowerCenter

Challenge

Configure PowerCenter to work with various PowerExchange data access products to process real-time
data. This Best Practice discusses guidelines for establishing a connection with PowerCenter and setting
up a real-time session to work with PowerCenter.

Description

PowerCenter with real-time option can be used to process data from real-time data sources. PowerCenter
supports the following types of real-time data:

● Messages and message queues. PowerCenter with the real-time option can be used to
integrate third-party messaging applications using a specific PowerExchange data access
product. Each PowerExchange product supports a specific industry-standard messaging
application, such as WebSphere MQ, JMS, MSMQ, SAP NetWeaver, TIBCO, and webMethods.
You can read from messages and message queues and write to messages, messaging
applications, and message queues. WebSphere MQ uses a queue to store and exchange data.
Other applications, such as TIBCO and JMS, use a publish/subscribe model. In this case, the
message exchange is identified using a topic.
● Web service messages. PowerCenter can receive a web service message from a web service
client through the Web Services Hub, transform the data, and load the data to a target or send a
message back to a web service client. A web service message is a SOAP request from a web
service client or a SOAP response from the Web Services Hub. The Integration Service
processes real-time data from a web service client by receiving a message request through the
Web Services Hub and processing the request. The Integration Service can send a reply back to
the web service client through the Web Services Hub or write the data to a target.
● Changed source data. PowerCenter can extract changed data in real time from a source table
using the PowerExchange Listener and write data to a target. Real-time sources supported by
PowerExchange are ADABAS, DATACOM, DB2/390, DB2/400, DB2/UDB, IDMS, IMS, MS SQL
Server, Oracle and VSAM.

Connection Setup

PowerCenter uses some attribute values in order to correctly connect and identify the third-party
messaging application and message itself. Each PowerExchange product supplies its own connection
attributes that need to be configured properly before running a real-time session.

Setting Up Real-Time Session in PowerCenter

The PowerCenter real-time option uses a zero latency engine to process data from the messaging
system. Depending on the messaging systems and the application that sends and receives messages,
there may be a period when there are many messages and, conversely, there may be a period when
there are no messages. PowerCenter uses the attribute ‘Flush Latency’ to determine how often the
messages are being flushed to the target. PowerCenter also provides various attributes to control when
the session ends.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 721 of 1017


The following reader attributes determine when a PowerCenter session should end:

● Message Count - Controls the number of messages the PowerCenter Server reads from the
source before the session stops reading from the source.
● Idle Time - Indicates how long the PowerCenter Server waits when no messages arrive before it
stops reading from the source.
● Time Slice Mode - Indicates a specific range of time that the server read messages from the
source. Only PowerExchange for WebSphere MQ uses this option.
● Reader Time Limit - Indicates the number of seconds the PowerCenter Server spends reading
messages from the source.

The specific filter conditions and options available to you depend on which Real-Time source is being
used. For example -Attributes for PowerExchange for DB2 for i5/OS:

Set the attributes that control how the reader ends. One or more attributes can be used to control the end
of session.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 722 of 1017


For example, set the Reader Time Limit attribute to 3600. The reader will end after 3600 seconds. The
idle time limit is set to 500 seconds. The reader will end if it doesn’t process any changes for 500 seconds
(i.e., it remains idle for 500 seconds).

If more than one attribute is selected, the first attribute that satisfies the condition is used to control the
end of session.

Note:: The real-time attributes can be found in the Reader Properties for PowerExchange for JMS,
TIBCO, webMethods, and SAP iDoc. For PowerExchange for WebSphere MQ , the real-time attributes
must be specified as a filter condition.

The next step is to set the Real-time Flush Latency attribute. The Flush Latency defines how often
PowerCenter should flush messages, expressed in milli-seconds.

For example, if the Real-time Flush Latency is set to 2000, PowerCenter flushes messages every two
seconds. The messages will also be flushed from the reader buffer if the Source Based Commit condition
is reached. The Source Based Commit condition is defined in the Properties tab of the session.

The message recovery option can be enabled to ensure that no messages are lost if a session fails as a
result of unpredictable error, such as power loss. This is especially important for real-time sessions
because some messaging applications do not store the messages after the messages are consumed by
another application.

A unit of work (UOW) is a collection of changes within a single commit scope made by a transaction on
the source system from an external application. Each UOW may consist of a different number of rows
depending on the transaction to the source system. When you use the UOW Count Session condition, the
Integration Service commits source data to the target when it reaches the number of UOWs specified in
the session condition.

For example, if the value for UOW Count is 10, the Integration Service commits all data read from the
source after the 10th UOW enters the source. The lower you set the value, the faster the Integration
Service commits data to the target. The lower value also causes the system to consume more resources.

Executing a Real-Time Session

A real-time session often has to be up and running continuously to listen to the messaging application and
to process messages immediately after the messages arrive. Set the reader attribute Idle Time to -1 and
Flush Latency to a specific time interval. This is applicable for all PowerExchange products except for
PowerExchange for WebSphere MQ where the session continues to run and flush the messages to the
target using the specific flush latency interval.

Another scenario is the ability to read data from another source system and immediately send it to a real-
time target. For example, reading data from a relational source and writing it to WebSphere MQ. In this
case, set the session to run continuously so that every change in the source system can be immediately
reflected in the target.

A real-time session may run continuously until a condition is met to end the session. In some situations it
may be required to periodically stop the session and restart it. This is sometimes necessary to execute a
post-session command or run some other process that is not part of the session. To stop the session and
restart it, it is useful to deploy continuously running workflows. The Integration Service starts the next run

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 723 of 1017


of a continuous workflow as soon as it completes the first.

To set a workflow to run continuously, edit the workflow and select the ‘Scheduler’ tab. Edit the
‘Scheduler’ and select ‘Run Continuously’ from ‘Run Options’. A continuous workflow starts automatically
when the Integration Service initializes. When the workflow stops, it restarts immediately.

Real-Time Sessions and Active Transformations

Some of the transformations in PowerCenter are ‘active transformations’, which means that the number of
input rows and output rows of the transformations are not the same. For most cases, active transformation
requires all of the input rows to be processed before processing the output row to the next transformation
or target. For a real-time session, the flush latency will be ignored if DTM needs to wait for all the rows to
be processed.

Depending on user needs, active transformations, such as aggregator, rank, sorter can be used in a real-
time session by setting the transaction scope property in the active transformation to ‘Transaction’. This
signals the session to process the data in the transformation every transaction. For example, if a real-time
session is using an aggregator that sums a field of an input, the summation will be done per transaction,
as opposed to all rows. The result may or may not be correct depending on the requirement. Use the
active transformation with real-time session if you want to process the data per transaction.

Custom transformations can also be defined to handle data per transaction so that they can be used in a
real-time session.

PowerExchange Real Time Connections

PowerExchange NRDB CDC Real Time connections can be used to extract changes from ADABAS,
DATACOM, IDMS, IMS and VSAM sources in real time.

The DB2/390 connection can be used to extract changes for DB2 on OS/390 and the DB2/400 connection
to extract from AS/400. There is a separate connection to read from DB2 UDB in real time.

The NRDB CDC connection requires the application name and the restart token file name to be
overridden for every session. When the PowerCenter session completes, the PowerCenter Server writes
the last restart token to a physical file called the RestartToken File. The next time the session starts, the
PowerCenter Server reads the restart token from the file and the starts reading changes from the point
where it last left off. Every PowerCenter session needs to have a unique restart token filename.

Informatica recommends archiving the file periodically. The reader timeout or the idle timeout can be used
to stop a real-time session. A post-session command can be used to archive the RestartToken file.

The encryption mode for this connection can slow down the read performance and increase resource
consumption. Compression mode can help in situations where the network is a bottleneck; using
compression also increases the CPU and memory usage on the source system.

Archiving PowerExchange Tokens

When the PowerCenter session completes, the Integration Service writes the last restart token to a

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 724 of 1017


physical file called the RestartToken File. The token in the file indicates the end point where the read job
ended. The next time the session starts, the PowerCenter Server reads the restart token from the file and
the starts reading changes from the point where it left off. The token file is overwritten each time the
session has to write a token out. PowerCenter does not implicitly maintain an archive of these tokens.

If, for some reason, the changes from a particular point in time have to “replayed”, we need the
PowerExchange token from that point in time.

To enable such a process, it is a good practice to periodically copy the token file to a backup folder. This
procedure is necessary to maintain an archive of the PowerExchange tokens. A real-time PowerExchange
session may be stopped periodically, using either the reader time limit or the idle time limit. A post-session
command is used to copy the restart token file to an archive folder. The session will be part of a
continuous running workflow, so when the session completes after the post session command, it
automatically restarts again. From a data processing standpoint very little changes; the process pauses
for a moment, archives the token, and starts again.

The following are examples of post-session commands that can be used to copy a restart token file
(session.token) and append the current system date/time to the file name for archive purposes:

cp session.token session`date '+%m%d%H%M'`.token

Windows:

copy session.token session-%date:~4,2%-%date:~7,2%-%date:~10,4%-%time:~0,2%-%time:~3,2%.token

PowerExchange for WebSphere MQ

1. In the Workflow Manager, connect to a repository and choose Connection > Queue
2. The Queue Connection Browser appears. Select New > Message Queue
3. The Connection Object Definition dialog box appears

You need to specify three attributes in the Connection Object Definition dialog box:

● Name - the name for the connection. (Use <queue_name>_<QM_name> to uniquely identify the
connection.)
● Queue Manager - the Queue Manager name for the message queue. (in Windows, the default
Queue Manager name is QM_<machine name>)
● Queue Name - the Message Queue name

To obtain the Queue Manager and Message Queue names:

● Open the MQ Series Administration Console. The Queue Manager should appear on the left
panel
● Expand the Queue Manager icon. A list of the queues for the queue manager appears on the left
panel

Note that the Queue Manager’s name and Queue Name are case-sensitive.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 725 of 1017


PowerExchange for JMS

PowerExchange for JMS can be used to read or write messages from various JMS providers, such as
WebSphere MQ, JMS, BEA WebLogic Server.

There are two types of JMS application connections:

● JNDI Application Connection, which is used to connect to a JNDI server during a session run.
● JMS Application Connection, which is used to connect to a JMS provider during a session run.

JNDI Application Connection Attributes are:

● Name
● JNDI Context Factory
● JNDI Provider URL
● JNDI UserName
● JNDI Password
● JMS Application Connection

JMS Application Connection Attributes are:

● Name
● JMS Destination Type
● JMS Connection Factory Name
● JMS Destination
● JMS UserName
● JMS Password

Configuring the JNDI Connection for WebSphere MQ

The JNDI settings for WebSphere MQ JMS can be configured using a file system service or LDAP
(Lightweight Directory Access Protocol).

The JNDI setting is stored in a file named JMSAdmin.config. The file should be installed in the
WebSphere MQ Java installation/bin directory.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#)
before the following context factory setting:

INITIAL_CONTEXT_FACTORY=com.sun.jndi.fscontext.RefFSContextFactory

Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#)
before the following context factory setting:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 726 of 1017


INITIAL_CONTEXT_FACTORY=com.sun.jndi.ldap.LdapCtxFactory

Find the PROVIDER_URL settings.

If you are using a file system service provider to store your JNDI settings, remove the number sign (#)
before the following provider URL setting and provide a value for the JNDI directory.

PROVIDER_URL=file: /<JNDI directory>

<JNDI directory> is the directory where you want JNDI to store the .binding file.

Or, if you are using the LDAP service provider to store your JNDI settings, remove the number sign (#)
before the provider URL setting and specify a hostname.

#PROVIDER_URL=ldap://<hostname>/context_name

For example, you can specify:

PROVIDER_URL=ldap://<localhost>/o=infa,c=rc

If you want to provide a user DN and password for connecting to JNDI, you can remove the # from the
following settings and enter a user DN and password:

PROVIDER_USERDN=cn=myname,o=infa,c=rc
PROVIDER_PASSWORD=test

The following table shows the JMSAdmin.config settings and the corresponding attributes in the JNDI
application connection in the Workflow Manager:

JMSAdmin.config Settings: JNDI Application Connection Attribute

INITIAL_CONTEXT_FACTORY JNDI Context Factory

PROVIDER_URL JNDI Provider URL

PROVIDER_USERDN JNDI UserName

PROVIDER_PASSWORD JNDI Password

Configuring the JMS Connection for WebSphere MQ

The JMS connection is defined using a tool in JMS called jmsadmin, which is available in the WebSphere
MQ Java installation/bin directory. Use this tool to configure the JMS Connection Factory.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 727 of 1017


The JMS Connection Factory can be a Queue Connection Factory or Topic Connection Factory.

● When Queue Connection Factory is used, define a JMS queue as the destination.
● When Connection Factory is used, define a JMS topic as the destination.

The command to define a queue connection factory (qcf) is:

def qcf(<qcf_name>) qmgr(queue_manager_name)


hostname (QM_machine_hostname) port (QM_machine_port)

The command to define JMS queue is:

def q(<JMS_queue_name>) qmgr(queue_manager_name) qu(queue_manager_queue_name)

The command to define JMS topic connection factory (tcf) is:

def tcf(<tcf_name>) qmgr(queue_manager_name)


hostname (QM_machine_hostname) port (QM_machine_port)

The command to define the JMS topic is:

def t(<JMS_topic_name>) topic(pub/sub_topic_name)

The topic name must be unique. For example: topic (application/infa)

The following table shows the JMS object types and the corresponding attributes in the JMS application
connection in the Workflow Manager:

JMS Object Types JMS Application Connection Attribute

QueueConnectionFactory or JMS Connection Name


TopicConnectionFactory

JMS Queue Name or JMS Destination


JMS Topic Name

Configure the JNDI and JMS Connection for WebSphere

Configure the JNDI settings for WebSphere to use WebSphere as a provider for JMS sources or targets in
a PowerCenterRT session.

JNDI Connection

Add the following option to the file JMSAdmin.bat to configure JMS properly:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 728 of 1017


-Djava.ext.dirs=<WebSphere Application Server>bin

For example: -Djava.ext.dirs=WebSphere\AppServer\bin

The JNDI connection resides in the JMSAdmin.config file, which is located in the MQ Series Java/bin
directory.

INITIAL_CONTEXT_FACTORY=com.ibm.websphere.naming.wsInitialContextFactory

PROVIDER_URL=iiop://<hostname>/

For example:

PROVIDER_URL=iiop://localhost/

PROVIDER_USERDN=cn=informatica,o=infa,c=rc
PROVIDER_PASSWORD=test

JMS Connection

The JMS configuration is similar to the JMS Connection for WebSphere MQ.

Configure the JNDI and JMS Connection for BEA WebLogic

Configure the JNDI settings for BEA WebLogic to use BEA WebLogic as a provider for JMS sources or
targets in a PowerCenterRT session.

PowerCenter Connect for JMS and the JMS hosting Weblogic server do not need to be on the same
server. PowerCenter Connect for JMS just needs a URL, as long as the URL points to the right place.

JNDI Connection

The WebLogic Server automatically provides a context factory and URL during the JNDI set-up
configuration for WebLogic Server. Enter these values to configure the JNDI connection for JMS sources
and targets in the Workflow Manager.

Enter the following value for JNDI Context Factory in the JNDI Application Connection in the Workflow
Manager:

weblogic.jndi.WLInitialContextFactory

Enter the following value for JNDI Provider URL in the JNDI Application Connection in the Workflow
Manager:

t3://<WebLogic_Server_hostname>:<port>

where WebLogic Server hostname is the hostname or IP address of the WebLogic Server and port is the
port number for the WebLogic Server.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 729 of 1017


JMS Connection

The JMS connection is configured from the BEA WebLogic Server console. Select JMS -> Connection
Factory.

The JMS Destination is also configured from the BEA WebLogic Server console.

From the Console pane, select Services > JMS > Servers > <JMS Server name> > Destinations under
your domain.

Click Configure a New JMSQueue or Configure a New JMSTopic.

The following table shows the JMS object types and the corresponding attributes in the JMS application
connection in the Workflow Manager:

WebLogic Server JMS Object JMS Application Connection Attribute

Connection Factory Settings: JNDIName JMS Application Connection Attribute

Connection Factory Settings: JNDIName JMS Connection Factory Name

Destination Settings: JNDIName JMS Destination

In addition to JNDI and JMS setting, BEA WebLogic also offers a function called JMS Store, which can be
used for persistent messaging when reading and writing JMS messages. The JMS Stores configuration is
available from the Console pane: select Services > JMS > Stores under your domain.

Configuring the JNDI and JMS Connection for TIBCO

TIBCO Rendezvous Server does not adhere to JMS specifications. As a result, PowerCenter Connect for
JMS can’t connect directly with the Rendezvous Server. TIBCO Enterprise Server, which is JMS-
compliant, acts as a bridge between the PowerCenter Connect for JMS and TIBCO Rendezvous Server.
Configure a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server for
PowerCenter Connect for JMS to be able to read messages from and write messages to TIBCO
Rendezvous Server.

To create a connection-bridge between PowerCenter Connect for JMS and TIBCO Rendezvous Server,
follow these steps:

1. Configure PowerCenter Connect for JMS to communicate with TIBCO Enterprise Server.
2. Configure TIBCO Enterprise Server to communicate with TIBCO Rendezvous Server.

Configure the following information in your JNDI application connection:

● JNDI Context Factory.com.tibco.tibjms.naming.TibjmsInitialContextFactory

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 730 of 1017


● Provider URL.tibjmsnaming://<host>:<port> where host and port are the host name and port
number of the Enterprise Server.

To make a connection-bridge between TIBCO Rendezvous Server and TIBCO Enterpriser Server:

1. In the file tibjmsd.conf, enable the tibrv transport configuration parameter as in the example below,
so that TIBCO Enterprise Server can communicate with TIBCO Rendezvous messaging systems:

tibrv_transports = enabled

2.
Enter the following transports in the transports.conf file:

[RV]
type = tibrv // type of external messaging system
topic_import_dm = TIBJMS_RELIABLE // only reliable/certified messages can transfer
daemon = tcp:localhost:7500 // default daemon for the Rendezvous server

The transports in the transports.conf configuration file specify the communication protocol between
TIBCO Enterprise for JMS and the TIBCO Rendezvous system. The import and export properties
on a destination can list one or more transports to use to communicate with the TIBCO
Rendezvous system.

3. Optionally, specify the name of one or more transports for reliable and certified message delivery
in the export property in the file topics.conf. as in the following example:

topicname export="RV"

The export property allows messages published to a topic by a JMS client to be exported to the external
systems with configured transports. Currently, you can configure transports for TIBCO Rendezvous
reliable and certified messaging protocols.

PowerExchange for webMethods

When importing webMethods sources into the Designer, be sure the webMethods host file doesn’t contain
‘.’ character. You can’t use fully-qualified names for the connection when importing webMethods sources.
You can use fully-qualified names for the connection when importing webMethods targets because
PowerCenter doesn’t use the same grouping method for importing sources and targets. To get around
this, modify the host file to resolve the name to the IP address.

For example:

Host File:

crpc23232.crp.informatica.com crpc23232

Use crpc23232 instead of crpc23232.crp.informatica.com as the host name when importing webMethods
source definition. This step is only required for importing PowerExchange for webMethods sources into

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 731 of 1017


the Designer.

If you are using the request/reply model in webMethods, PowerCenter needs to send an appropriate
document back to the broker for every document it receives. PowerCenter populates some of the
envelope fields of the webMethods target to enable webMethods broker to recognize that the published
document is a reply from PowerCenter. The envelope fields ‘destid’ and ‘tag’ are populated for the request/
reply model. ‘Destid’ should be populated from the ‘pubid’ of the source document and ‘tag’ should be
populated from ‘tag’ of the source document. Use the option ‘Create Default Envelope Fields’ when
importing webMethods sources and targets into the Designer in order to make the envelope fields
available in PowerCenter.

Configuring the PowerExchange for webMethods Connection

To create or edit the PowerExchange for webMethods connection select Connections > Application >
webMethods Broker from the Workflow Manager.

PowerExchange for webMethods connection attributes are:

● Name
● Broker Host
● Broker Name
● Client ID
● Client Group
● Application Name
● Automatic Reconnect
● Preserve Client State

Enter the connection to the Broker Host in the following format <hostname: port>.

If you are using the request/reply method in webMethods, you have to specify a client ID in the
connection. Be sure that the client ID used in the request connection is the same as the client ID used in
the reply connection. Note that if you are using multiple request/reply document pairs, you need to setup
different webMethods connections for each pair because they cannot share a client ID.

Last updated: 04-Jun-08 15:21

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 732 of 1017


Session and Data Partitioning

Challenge

Improving performance by identifying strategies for partitioning relational tables, XML,


COBOL and standard flat files, and by coordinating the interaction between sessions,
partitions, and CPUs. These strategies take advantage of the enhanced partitioning
capabilities in PowerCenter.

Description

On hardware systems that are under-utilized, you may be able to improve performance
by processing partitioned data sets in parallel in multiple threads of the same session
instance running onthe PowerCenter Server engine. However, parallel execution may
impair performance on over-utilized systems or systems with smaller I/O capacity.

In addition to hardware, consider these other factors when determining if a session is


an ideal candidate for partitioning: source and target database setup, target type,
mapping design, and certain assumptions that are explained in the following
paragraphs. Use the Workflow Manager client tool to implement session partitioning.

Assumptions

The following assumptions pertain to the source and target systems of a session that is
a candidate for partitioning. These factors can help to maximize the benefits that can
be achieved through partitioning.

● Indexing has been implemented on the partition key when using a relational
source.
● Source files are located on the same physical machine as the PowerCenter
Server process when partitioning flat files, COBOL, and XML, to reduce
network overhead and delay.
● All possible constraints are dropped or disabled on relational targets.
● All possible indexes are dropped or disabled on relational targets.
● Table spaces and database partitions are properly managed on the target
system.
● Target files are written to same physical machine that hosts the PowerCenter

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 733 of 1017


process in order to reduce network overhead and delay.
● Oracle External Loaders are utilized whenever possible

First, determine if you should partition your session. Parallel execution benefits
systems that have the following characteristics:

Check idle time and busy percentage for each thread. This gives the high-level
information of the bottleneck point/points. In order to do this, open the session log and
look for messages starting with “PETL_” under the “RUN INFO FOR TGT LOAD
ORDER GROUP” section. These PETL messages give the following details against the
reader, transformation, and writer threads:

● Total Run Time


● Total Idle Time
● Busy Percentage

Under-utilized or intermittently-used CPUs. To determine if this is the case, check


the CPU usage of your machine. The column ID displays the percentage utilization of
CPU idling during the specified interval without any I/O wait. If there are CPU cycles
available (i.e., twenty percent or more idle time), then this session's performance may
be improved by adding a partition.

● Windows 2000/2003 - check the task manager performance tab.


● UNIX - type VMSTAT 1 10 on the command line.

Sufficient I/O. To determine the I/O statistics:

● Windows 2000/2003 - check the task manager performance tab.


● UNIX - type IOSTAT on the command line. The column %IOWAIT displays the
percentage of CPU time spent idling while waiting for I/O requests. The
column %idle displays the total percentage of the time that the CPU spends
idling (i.e., the unused capacity of the CPU.)

Sufficient memory. If too much memory is allocated to your session, you will receive a
memory allocation error. Check to see that you're using as much memory as you can. If
the session is paging, increase the memory. To determine if the session is paging:

● Windows 2000/2003 - check the task manager performance tab.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 734 of 1017


● UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages
swapped in from the page space during the specified interval. PO displays the
number of pages swapped out to the page space during the specified interval.
If these values indicate that paging is occurring, it may be necessary to
allocate more memory, if possible.

If you determine that partitioning is practical, you can begin setting up the partition.

Partition Types

PowerCenter provides increased control of the pipeline threads. Session performance


can be improved by adding partitions at various pipeline partition points. When you
configure the partitioning information for a pipeline, you must specify a partition type.
The partition type determines how the PowerCenter Server redistributes data across
partition points. The Workflow Manager allows you to specify the following partition
types:

Round-robin Partitioning

The PowerCenter Server distributes data evenly among all partitions. Use round-robin
partitioning when you need to distribute rows evenly and do not need to group data
among partitions.

In a pipeline that reads data from file sources of different sizes, use round-robin
partitioning. For example, consider a session based on a mapping that reads data from
three flat files of different sizes.

● Source file 1: 100,000 rows


● Source file 2: 5,000 rows
● Source file 3: 20,000 rows

In this scenario, the recommended best practice is to set a partition point after the
Source Qualifier and set the partition type to round-robin. The PowerCenter Server
distributes the data so that each partition processes approximately one third of the
data.

Hash Partitioning

The PowerCenter Server applies a hash function to a partition key to group data among

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 735 of 1017


partitions.

Use hash partitioning where you want to ensure that the PowerCenter Server
processes groups of rows with the same partition key in the same partition. For
example, in a scenario where you need to sort items by item ID, but do not know the
number of items that have a particular ID number. If you select hash auto-keys, the
PowerCenter Server uses all grouped or sorted ports as the partition key. If you select
hash user keys, you specify a number of ports to form the partition key.

An example of this type of partitioning is when you are using Aggregators and need to
ensure that groups of data based on a primary key are processed in the same partition.

Key Range Partitioning

With this type of partitioning, you specify one or more ports to form a compound
partition key for a source or target. The PowerCenter Server then passes data to each
partition depending on the ranges you specify for each port.

Use key range partitioning where the sources or targets in the pipeline are partitioned
by key range. Refer to Workflow Administration Guide for further directions on setting
up Key range partitions.

For example, with key range partitioning set at End range = 2020, the PowerCenter
Server passes in data where values are less than 2020. Similarly, for Start range =
2020, the PowerCenter Server passes in data where values are equal to greater than
2020. Null values or values that may not fall in either partition are passed through the
first partition.

Pass-through Partitioning

In this type of partitioning, the PowerCenter Server passes all rows at one partition
point to the next partition point without redistributing them.

Use pass-through partitioning where you want to create an additional pipeline stage to
improve performance, but do not want to (or cannot) change the distribution of data
across partitions. The Data Transformation Manager spawns a master thread on each
session run, which in turn creates three threads (reader, transformation, and writer
threads) by default. Each of these threads can, at the most, process one data set at a
time and hence, three data sets simultaneously. If there are complex transformations in
the mapping, the transformation thread may take a longer time than the other threads,
which can slow data throughput.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 736 of 1017


It is advisable to define partition points at these transformations. This creates another
pipeline stage and reduces the overhead of a single transformation thread.

When you have considered all of these factors and selected a partitioning strategy, you
can begin the iterative process of adding partitions. Continue adding partitions to the
session until you meet the desired performance threshold or observe degradation in
performance.

Tips for Efficient Session and Data Partitioning

● Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before adding additional partitions.
Refer to Workflow Administrator Guide, for more information on Restrictions on
the Number of Partitions.
● Set DTM buffer memory. For a session with n partitions, set this value to at
least n times the original value for the non-partitioned session.
● Set cached values for sequence generator. For a session with n partitions,
there is generally no need to use the Number of Cached Values property of
the sequence generator. If you must set this value to a value greater than
zero, make sure it is at least n times the original value for the non-partitioned
session.
● Partition the source data evenly. The source data should be partitioned into
equal sized chunks for each partition.
● Partition tables. A notable increase in performance can also be realized when
the actual source and target tables are partitioned. Work with the DBA to
discuss the partitioning of source and target tables, and the setup of
tablespaces.
● Consider using external loader. As with any session, using an external
loader may increase session performance. You can only use Oracle external
loaders for partitioning. Refer to the Session and Server Guide for more
information on using and setting up the Oracle external loader for partitioning.
● Write throughput. Check the session statistics to see if you have increased
the write throughput.
● Paging. Check to see if the session is now causing the system to page. When
you partition a session and there are cached lookups, you must make sure
that DTM memory is increased to handle the lookup caches. When you
partition a source that uses a static lookup cache, the PowerCenter Server
creates one memory cache for each partition and one disk cache for each
transformation. Thus, memory requirements grow for each partition. If the
memory is not bumped up, the system may start paging to disk, causing

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 737 of 1017


degradation in performance.

When you finish partitioning, monitor the session to see if the partition is degrading or
improving session performance. If the session performance is improved and the
session meets your requirements, add another partition

Session on Grid and Partitioning Across Nodes

Session on Grid (provides the ability to run a session on multi-node integration


services. This is most suitable for large-size sessions. For small and medium size
sessions, it is more practical to distribute whole sessions to different nodes using
Workflow on Grid. Session on Grid leverages existing partitions of a session b
executing threads in multiple DTMs. Log service can be used to get the cumulative log.
See PowerCenter Enterprise Grid Option for detailed configuration information.

Dynamic Partitioning

Dynamic partitioning is also called parameterized partitioning because a single


parameter can determine the number of partitions. With the Session on Grid option,
more partitions can be added when more resources are available. Also the number of
partitions in a session can be tied to partitions in the database to facilitate maintenance
of PowerCenter partitioning to leverage database partitioning.

Last updated: 06-Dec-07 15:04

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 738 of 1017


Using Parameters, Variables and Parameter Files

Challenge

Understanding how parameters, variables, and parameter files work and using them for maximum efficiency.

Description

Prior to the release of PowerCenter 5, the only variables inherent to the product were defined to specific
transformations and to those server variables that were global in nature. Transformation variables were defined as
variable ports in a transformation and could only be used in that specific transformation object (e.g., Expression,
Aggregator, and Rank transformations). Similarly, global parameters defined within Server Manager would affect
the subdirectories for source files, target files, log files, and so forth.

More current versions of PowerCenter made variables and parameters available across the entire mapping rather
than for a specific transformation object. In addition, they provide built-in parameters for use within Workflow
Manager. Using parameter files, these values can change from session-run to session-run. With the addition of
workflows, parameters can now be passed to every session contained in the workflow, providing more flexibility
and reducing parameter file maintenance. Other important functionality that has been added in recent releases is
the ability to dynamically create parameter files that can be used in the next session in a workflow or in other
workflows.

Parameters and Variables

Use a parameter file to define the values for parameters and variables used in a workflow, worklet, mapping, or
session. A parameter file can be created using a text editor such as WordPad or Notepad. List the parameters or
variables and their values in the parameter file. Parameter files can contain the following types of parameters and
variables:

● Workflow variables
● Worklet variables
● Session parameters
● Mapping parameters and variables

When using parameters or variables in a workflow, worklet, mapping, or session, the Integration Service checks
the parameter file to determine the start value of the parameter or variable. Use a parameter file to initialize
workflow variables, worklet variables, mapping parameters, and mapping variables. If not defining start values for
these parameters and variables, the Integration Service checks for the start value of the parameter or variable in
other places.

Session parameters must be defined in a parameter file. Because session parameters do not have default values,
if the Integration Service cannot locate the value of a session parameter in the parameter file, it fails to initialize
the session. To include parameter or variable information for more than one workflow, worklet, or session in a
single parameter file, create separate sections for each object within the parameter file.

Also, create multiple parameter files for a single workflow, worklet, or session and change the file that these tasks
use, as necessary. To specify the parameter file that the Integration Service uses with a workflow, worklet, or
session, do either of the following:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 739 of 1017


● Enter the parameter file name and directory in the workflow, worklet, or session properties.
● Start the workflow, worklet, or session using pmcmd and enter the parameter filename and directory in
the command line.

If entering a parameter file name and directory in the workflow, worklet, or session properties and in the pmcmd
command line, the Integration Service uses the information entered in the pmcmd command line.

Parameter File Format

When entering values in a parameter file, precede the entries with a heading that identifies the workflow, worklet
or session whose parameters and variables are to be assigned. Assign individual parameters and variables
directly below this heading, entering each parameter or variable on a new line. List parameters and variables in
any order for each task.

The following heading formats can be defined:

● Workflow variables - [folder name.WF:workflow name]


● Worklet variables -[folder name.WF:workflow name.WT:worklet name]
● Worklet variables in nested worklets - [folder name.WF:workflow name.WT:worklet name.WT:worklet
name...]
● Session parameters, plus mapping parameters and variables - [folder name.WF:workflow name.ST:
session name] or [folder name.session name] or [session name]

Below each heading, define parameter and variable values as follows:

● parameter name=value
● parameter2 name=value
● variable name=value
● variable2 name=value

For example, a session in the production folder, s_MonthlyCalculations, uses a string mapping parameter, $
$State, that needs to be set to MA, and a datetime mapping variable, $$Time. $$Time already has an initial value
of 9/30/2000 00:00:00 saved in the repository, but this value needs to be overridden to 10/1/2000 00:00:00. The
session also uses session parameters to connect to source files and target databases, as well as to write session
log to the appropriate session log file. The following table shows the parameters and variables that can be defined
in the parameter file:

Parameter and Parameter and Variable


Desired Definition
Variable Type Name
String Mapping
$$State MA
Parameter
Datetime Mapping
$$Time 10/1/2000 00:00:00
Variable
Source File
(Session $InputFile1 Sales.txt
Parameter)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 740 of 1017


Database
Connection Sales (database
$DBConnection_Target
(Session connection)
Parameter)
Session Log File
d:/session logs/firstrun.
(Session $PMSessionLogFile
txt
Parameter)

The parameter file for the session includes the folder and session name, as well as each parameter and variable:

● [Production.s_MonthlyCalculations]
● $$State=MA
● $$Time=10/1/2000 00:00:00
● $InputFile1=sales.txt
● $DBConnection_target=sales
● $PMSessionLogFile=D:/session logs/firstrun.txt

The next time the session runs, edit the parameter file to change the state to MD and delete the $$Time variable.
This allows the Integration Service to use the value for the variable that was set in the previous session run

Mapping Variables

Declare mapping variables in PowerCenter Designer using the menu option Mappings -> Parameters and
Variables (See the first figure, below). After selecting mapping variables, use the pop-up window to create a
variable by specifying its name, data type, initial value, aggregation type, precision, and scale. This is similar to
creating a port in most transformations (See the second figure, below).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 741 of 1017


Variables, by definition, are objects that can change value dynamically. PowerCenter has four functions to affect
change to mapping variables:

● SetVariable
● SetMaxVariable
● SetMinVariable
● SetCountVariable

A mapping variable can store the last value from a session run in the repository to be used as the starting value
for the next session run.

● Name. The name of the variable should be descriptive and be preceded by $$ (so that it is easily
identifiable as a variable). A typical variable name is: $$Procedure_Start_Date.
● Aggregation type. This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the repository at the end of each
session run would be the maximum value across ALL records until the value is deleted.
● Initial value. This value is used during the first session run when there is no corresponding and
overriding parameter file. This value is also used if the stored repository value is deleted. If no initial value
is identified, then a data-type specific default value is used.

Variable values are not stored in the repository when the session:

● Fails to complete.
● Is configured for a test load.
● Is a debug session.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 742 of 1017


● Runs in debug mode and is configured to discard session output.

Order of Evaluation

The start value is the value of the variable at the start of the session. The start value can be a value defined in the
parameter file for the variable, a value saved in the repository from the previous run of the session, a user-defined
initial value for the variable, or the default value based on the variable data type. The Integration Service looks for
the start value in the following order:

1. Value in session parameter file


2. Value saved in the repository
3. Initial value
4. Default value

Mapping Parameters and Variables

Since parameter values do not change over the course of the session run, the value used is based on:

● Value in session parameter file


● Initial value
● Default value

Once defined, mapping parameters and variables can be used in the Expression Editor section of the following
transformations:

● Expression
● Filter
● Router
● Update Strategy
● Aggregator

Mapping parameters and variables also can be used within the Source Qualifier in the SQL query, user-defined
join, and source filter sections, as well as in a SQL override in the lookup transformation.

Guidelines for Creating Parameter Files

Use the following guidelines when creating parameter files:

● Enter folder names for non-unique session names. When a session name exists more than once in a
repository, enter the folder name to indicate the location of the session.
● Create one or more parameter files. Assign parameter files to workflows, worklets, and sessions
individually. Specify the same parameter file for all of these tasks or create several parameter files.
● If including parameter and variable information for more than one session in the file, create a new
section for each session. The folder name is optional.

[folder_name.session_name]

parameter_name=value

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 743 of 1017


variable_name=value

mapplet_name.parameter_name=value

[folder2_name.session_name]

parameter_name=value

variable_name=value

mapplet_name.parameter_name=value

● Specify headings in any order. Place headings in any order in the parameter file. However, if defining
the same parameter or variable more than once in the file, the Integration Service assigns the parameter
or variable value using the first instance of the parameter or variable.
● Specify parameters and variables in any order. Below each heading, the parameters and variables
can be specified in any order.
● When defining parameter values, do not use unnecessary line breaks or spaces. The Integration
Service may interpret additional spaces as part of the value.
● List all necessary mapping parameters and variables. Values entered for mapping parameters and
variables become the start value for parameters and variables in a mapping. Mapping parameter and
variable names are not case sensitive.
● List all session parameters. Session parameters do not have default values. An undefined session
parameter can cause the session to fail. Session parameter names are not case sensitive.
● Use correct date formats for datetime values. When entering datetime values, use the following date
formats:

MM/DD/RR

MM/DD/RR HH24:MI:SS

MM/DD/YYYY

MM/DD/YYYY HH24:MI:SS

● Do not enclose parameters or variables in quotes. The Integration Service interprets everything after
the equal sign as part of the value.
● Do enclose parameters in single quotes. In a Source Qualifier SQL Override use single quotes if the
parameter represents a string or date/time value to be used in the SQL Override.
● Precede parameters and variables created in mapplets with the mapplet name as follows:

mapplet_name.parameter_name=value

mapplet2_name.variable_name=value

Sample: Parameter Files and Session Parameters

Parameter files, along with session parameters, allow you to change certain values between sessions. A

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 744 of 1017


commonly-used feature is the ability to create user-defined database connection session parameters to reuse
sessions for different relational sources or targets. Use session parameters in the session properties, and then
define the parameters in a parameter file. To do this, name all database connection session parameters with the
prefix $DBConnection, followed by any alphanumeric and underscore characters. Session parameters and
parameter files help reduce the overhead of creating multiple mappings when only certain attributes of a mapping
need to be changed.

Using Parameters in Source Qualifiers

Another commonly used feature is the ability to create parameters in the source qualifiers, which allows you to
reuse the same mapping, with different sessions, to extract specified data from the parameter files the session
references. Moreover, there may be a time when it is necessary to create a mapping that will create a parameter
file and the second mapping to use that parameter file created from the first mapping. The second mapping pulls
the data using a parameter in the Source Qualifier transformation, which reads the parameter from the parameter
file created in the first mapping. In the first case, the idea is to build a mapping that creates the flat file, which is a
parameter file for another session to use.

Sample: Variables and Parameters in an Incremental Strategy

Variables and parameters can enhance incremental strategies. The following example uses a mapping variable,
an expression transformation object, and a parameter file for restarting.

Scenario

Company X wants to start with an initial load of all data, but wants subsequent process runs to select only new
information. The environment data has an inherent Post_Date that is defined within a column named
Date_Entered that can be used. The process will run once every twenty-four hours.

Sample Solution

Create a mapping with source and target objects. From the menu create a new mapping variable named $
$Post_Date with the following attributes:

● TYPE Variable
● DATATYPE Date/Time
● AGGREGATION TYPE MAX
● INITIAL VALUE 01/01/1900

Note that there is no need to encapsulate the INITIAL VALUE with quotation marks. However, if this value is used
within the Source Qualifier SQL, it may be necessary to use native RDBMS functions to convert (e.g., TO DATE
(--,--)). Within the Source Qualifier Transformation, use the following in the Source_Filter Attribute:
DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS') [please be aware that this sample
refers to Oracle as the source RDBMS]. Also note that the initial value 01/01/1900 will be expanded by the
Integration Service to 01/01/1900 00:00:00, hence the need to convert the parameter to a datetime.

The next step is to forward $$Post_Date and Date_Entered to an Expression transformation. This is where the
function for setting the variable will reside. An output port named Post_Date is created with a data type of date/
time. In the expression code section, place the following function:

SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 745 of 1017


The function evaluates each value for DATE_ENTERED and updates the variable with the Max value to be
passed forward. For example:

DATE_ENTERED Resultant POST_DATE


9/1/2000 9/1/2000
10/30/2001 10/30/2001
9/2/2000 10/30/2001

Consider the following with regard to the functionality:

1. In order for the function to assign a value, and ultimately store it in the repository, the port must be
connected to a downstream object. It need not go to the target, but it must go to another Expression
Transformation. The reason is that the memory will not be instantiated unless it is used in a downstream
transformation object.
2. In order for the function to work correctly, the rows have to be marked for insert. If the mapping is an
update-only mapping (i.e., Treat Rows As is set to Update in the session properties) the function will not
work. In this case, make the session Data Driven and add an Update Strategy after the transformation
containing the SETMAXVARIABLE function, but before the Target.
3. If the intent is to store the original Date_Entered per row and not the evaluated date value, then add an
ORDER BY clause to the Source Qualifier. This way, the dates are processed and set in order and data is
preserved.

The first time this mapping is run, the SQL will select from the source where Date_Entered is > 01/01/1900
providing an initial load. As data flows through the mapping, the variable gets updated to the Max Date_Entered it
encounters. Upon successful completion of the session, the variable is updated in the repository for use in the
next session run. To view the current value for a particular variable associated with the session, right-click on the
session in the Workflow Monitor and choose View Persistent Values.

The following graphic shows that after the initial run, the Max Date_Entered was 02/03/1998. The next time this
session is run, based on the variable in the Source Qualifier Filter, only sources where Date_Entered >
02/03/1998 will be processed.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 746 of 1017


Resetting or Overriding Persistent Values

To reset the persistent value to the initial value declared in the mapping, view the persistent value from Workflow
Manager (see graphic above) and press Delete Values. This deletes the stored value from the repository, causing
the Order of Evaluation to use the Initial Value declared from the mapping.

If a session run is needed for a specific date, use a parameter file. There are two basic ways to accomplish this:

● Create a generic parameter file, place it on the server, and point all sessions to that parameter file. A
session may (or may not) have a variable, and the parameter file need not have variables and
parameters defined for every session using the parameter file. To override the variable, either change,
uncomment, or delete the variable in the parameter file.
● Run pmcmd for that session, but declare the specific parameter file within the pmcmd command.

Configuring the Parameter File Location

Specify the parameter filename and directory in the workflow or session properties. To enter a parameter file in
the workflow or session properties:

● Select either the Workflow or Session, choose, Edit, and click the Properties tab.
● Enter the parameter directory and name in the Parameter Filename field.
● Enter either a direct path or a server variable directory. Use the appropriate delimiter for the Integration
Service operating system.

The following graphic shows the parameter filename and location specified in the session task.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 747 of 1017


The next graphic shows the parameter filename and location specified in the Workflow.

In this example, after the initial session is run, the parameter file contents may look like:

[Test.s_Incremental]

;$$Post_Date=

By using the semicolon, the variable override is ignored and the Initial Value or Stored Value is used. If, in the
subsequent run, the data processing date needs to be set to a specific date (for example: 04/21/2001), then a
simple Perl script or manual change can update the parameter file to:

[Test.s_Incremental]

$$Post_Date=04/21/2001

Upon running the sessions, the order of evaluation looks to the parameter file first, sees a valid variable and value
and uses that value for the session run. After successful completion, run another script to reset the parameter file.

Sample: Using Session and Mapping Parameters in Multiple Database Environments

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 748 of 1017


Reusable mappings that can source a common table definition across multiple databases, regardless of differing
environmental definitions (e.g., instances, schemas, user/logins), are required in a multiple database environment.

Scenario

Company X maintains five Oracle database instances. All instances have a common table definition for sales
orders, but each instance has a unique instance name, schema, and login.

DB Instance Schema Table User Password


ORC1 aardso orders Sam max
ORC99 environ orders Help me
HALC hitme order_done Hi Lois
UGLY snakepit orders Punch Judy
GORF gmer orders Brer Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID NUMBER (28) NOT NULL,


DATE_ENTERED DATE NOT NULL,
DATE_PROMISED DATE NOT NULL,
DATE_SHIPPED DATE NOT NULL,
EMPLOYEE_ID NUMBER (28) NOT NULL,
CUSTOMER_ID NUMBER (28) NOT NULL,
SALES_TAX_RATE NUMBER (5,4) NOT NULL,
STORE_ID NUMBER (28) NOT NULL

Sample Solution

Using Workflow Manager, create multiple relational connections. In this example, the strings are named according
to the DB Instance name. Using Designer, create the mapping that sources the commonly defined table. Then
create a Mapping Parameter named $$Source_Schema_Table with the following attributes:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 749 of 1017


Note that the parameter attributes vary based on the specific environment. Also, the initial value is not required
since this solution uses parameter files.

Open the Source Qualifier and use the mapping parameter in the SQL Override as shown in the following graphic.

Open the Expression Editor and select Generate SQL. The generated SQL statement shows the columns.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 750 of 1017


Override the table names in the SQL statement with the mapping parameter.

Using Workflow Manager, create a session based on this mapping. Within the Source Database connection drop-
down box, choose the following parameter:

$DBConnection_Source.

Point the target to the corresponding target and finish.

Now create the parameter files. In this example, there are five separate parameter files.

Parmfile1.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=aardso.orders

$DBConnection_Source= ORC1

Parmfile2.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=environ.orders

$DBConnection_Source= ORC99

Parmfile3.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=hitme.order_done

$DBConnection_Source= HALC

Parmfile4.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=snakepit.orders

$DBConnection_Source= UGLY

Parmfile5.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table= gmer.orders

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 751 of 1017


$DBConnection_Source= GORF

Use pmcmd to run the five sessions in parallel. The syntax for pmcmd for starting sessions with a particular
parameter file is as follows:

pmcmd startworkflow -s serveraddress:portno -u Username -p Password -paramfile parmfilename


s_Incremental

You may also use "-pv pwdvariable" if the named environment variable contains the encrypted form of the actual
password.

Notes on Using Parameter Files with Startworkflow

When starting a workflow, you can optionally enter the directory and name of a parameter file. The PowerCenter
Integration Service runs the workflow using the parameters in the file specified. For UNIX shell users, enclose the
parameter file name in single quotes:

-paramfile '$PMRootDir/myfile.txt'

For Windows command prompt users, the parameter file name cannot have beginning or trailing spaces. If the
name includes spaces, enclose the file name in double quotes:

-paramfile "$PMRootDir\my file.txt"

Note: When writing a pmcmd command that includes a parameter file located on another machine, use the
backslash (\) with the dollar sign ($). This ensures that the machine where the variable is defined expands the
server variable.

pmcmd startworkflow -uv USERNAME -pv PASSWORD -s SALES:6258 -f east -w wSalesAvg -paramfile '\
$PMRootDir/myfile.txt'

In the event that it is necessary to run the same workflow with different parameter files, use the following five
separate commands:

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -


paramfile \$PMRootDir\ParmFiles\Parmfile1.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -


paramfile \$PMRootDir\ParmFiles\Parmfile2.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -


paramfile \$PMRootDir\ParmFiles\Parmfile3.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -


paramfile \$PMRootDir\ParmFiles\Parmfile4.txt 1 1

pmcmd startworkflow -u tech_user -p pwd -s 127.0.0.1:4001 -f Test s_Incremental_SOURCE_CHANGES -


paramfile \$PMRootDir\ParmFiles\Parmfile5.txt 1 1

Alternatively, run the sessions in sequence with one parameter file. In this case, a pre- or post-session script
can change the parameter file for the next session.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 752 of 1017


Dynamically creating Parameter Files with a mapping

Using advanced techniques a PowerCenter mapping can be built that produces as a target file a parameter file (.
parm) that can be referenced by other mappings and sessions. When many mappings use the same parameter
file it is desirable to be able to easily re-create the file when mapping parameters are changed or updated. This
also can be beneficial when parameters change from run to run. There are a few different methods of creating a
parameter file with a mapping.

There is a mapping template example on the my.informatica.com that illustrates a method of using a PowerCenter
mapping to source from a process table containing mapping parameters and to create a parameter file. This same
feat can be accomplished also by sourcing a flat file in a parameter file format with code characters in the fields to
be altered.

[folder_name.session_name]

parameter_name= <parameter_code>

variable_name=value

mapplet_name.parameter_name=value

[folder2_name.session_name]

parameter_name= <parameter_code>

variable_name=value

mapplet_name.parameter_name=value

In place of the text <parameter_code> one could place the text filename_<timestamp>.dat. The mapping would
then perform a string replace wherever the text <timestamp> occurred and the output might look like:

Src_File_Name= filename_20080622.dat

This method works well when values change often and parameter groupings utilize different parameter sets. The
overall benefits of using this method are such that if many mappings use the same parameter file, changes can be
made by updating the source table and recreating the file. Using this process is faster than manually updating the
file line by line.

Final Tips for Parameters and Parameter Files

Use a single parameter file to group parameter information for related sessions.

When sessions are likely to use the same database connection or directory, you might want to include them in the
same parameter file. When connections or directories change, you can update information for all sessions by
editing one parameter file. Sometimes you reuse session parameters in a cycle. For example, you might run a
session against a sales database everyday, but run the same session against sales and marketing databases
once a week. You can create separate parameter files for each session run. Instead of changing the parameter
file in the session properties each time you run the weekly session, use pmcmd to specify the parameter file to
use when you start the session.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 753 of 1017


Use reject file and session log parameters in conjunction with target file or target database connection
parameters.

When you use a target file or target database connection parameter with a session, you can keep track of reject
files by using a reject file parameter. You can also use the session log parameter to write the session log to the
target machine.

Use a resource to verify the session runs on a node that has access to the parameter file.

In the Administration Console, you can define a file resource for each node that has access to the parameter file
and configure the Integration Service to check resources. Then, edit the session that uses the parameter file and
assign the resource. When you run the workflow, the Integration Service runs the session with the required
resource on a node that has the resource available.

Save all parameter files in one of the process variable directories.

If you keep all parameter files in one of the process variable directories, such as $SourceFileDir, use the process
variable in the session property sheet. If you need to move the source and parameter files at a later date, you can
update all sessions by changing the process variable to point to the new directory.

Last updated: 29-May-08 17:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 754 of 1017


Error Handling Process

Challenge
For an error handling strategy to be implemented successfully, it must be integral to the load process as a
whole. The method of implementation for the strategy will vary depending on the data integration
requirements for each project.

The resulting error handling process should however, always involve the following three steps:

1. Error identification
2. Error retrieval
3. Error correction

This Best Practice describes how each of these steps can be facilitated within the PowerCenter
environment.

Description
A typical error handling process leverages the best-of-breed error management technology available in
PowerCenter, such as:

• Relational database error logging


• Email notification of workflow failures
• Session error thresholds
• The reporting capabilities of PowerCenter Data Analyzer
• Data profiling

These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in
the flow chart below:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 755 of 1017


Error Identification

The first step in the error handling process is error identification. Error identification is often achieved
through the use of the ERROR() function within mappings, enablement of relational error logging in
PowerCenter, and referential integrity constraints at the database.

This approach ensures that row-level issues such as database errors (e.g., referential integrity failures),
transformation errors, and business rule exceptions for which the ERROR() function was called are captured
in relational error logging tables.

Enabling the relational error logging functionality automatically writes row-level data to a set of four error
handling tables (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can
be centralized in the PowerCenter repository and store information such as error messages, error data, and
source row data. Row-level errors trapped in this manner include any database errors, transformation errors,
and business rule exceptions for which the ERROR() function was called within the mapping.

Error Retrieval

The second step in the error handling process is error retrieval. After errors have been captured in the
PowerCenter repository, it is important to make their retrieval simple and automated so that the process is
as efficient as possible. Data Analyzer can be customized to create error retrieval reports from the
information stored in the PowerCenter repository. A typical error report prompts a user for the folder and
workflow name, and returns a report with information such as the session, error message, and data that
caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved
through a Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed
(such as “number of errors is greater than zero”).

Error Correction

The final step in the error handling process is error correction. As PowerCenter automates the process of
error identification, and Data Analyzer can be used to simplify error retrieval, error correction is
straightforward. After retrieving an error through Data Analyzer, the error report (which contains information
such as workflow name, session name, error date, error message, error data, and source row data) can
be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and others. Upon retrieval of
an error, the error report can be extracted into a supported format and emailed to a developer or DBA to
resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface
supports emailing a report directly through the web-based interface to make the process even easier.

For further automation, a report broadcasting rule that emails the error report to a developer’s inbox can be
set up to run on a pre-defined schedule. After the developer or DBA identifies the condition that caused the
error, a fix for the error can be implemented. The exact method of data correction depends on various
factors such as the number of records with errors, data availability requirements per SLA, the level of data
criticality to the business unit(s), and the type of error that occurred. Considerations made during error
correction include:

• The ‘owner’ of the data should always fix the data errors. For example, if the source data is
coming from an external system, then the errors should be sent back to the source system to be
fixed.
• In some situations, a simple re-execution of the session will reprocess the data.
• Does partial data that has been loaded into the target systems need to be backed-out in order to
avoid duplicate processing of rows.
• Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of
errors is low, the rejected data can be easily exported to Microsoft Excel or CSV format and
corrected in a spreadsheet from the Data Analyzer error reports. The corrected data can then be
manually inserted into the target table using a SQL statement.

Any approach to correct erroneous data should be precisely documented and followed as a standard.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 756 of 1017


If the data errors occur frequently, then the reprocessing process can be automated by designing a special
mapping or session to correct the errors and load the corrected data into the ODS or staging area.

Data Profiling Option

For organizations that want to identify data irregularities post-load but do not want to reject such rows at load
time, the PowerCenter Data Profiling option can be an important part of the error management solution. The
PowerCenter Data Profiling option enables users to create data profiles through a wizard-driven GUI that
provides profile reporting such as orphan record identification, business rule violation, and data irregularity
identification (such as NULL or default values). The Data Profiling option comes with a license to use Data
Analyzer reports that source the data profile warehouse to deliver data profiling information through an
intuitive BI tool. This is a recommended best practice since error handling reports and data profile reports
can be delivered to users through the same easy-to-use application.

Integrating Error Handling, Load Management, and Metadata

Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the
load management process and the load metadata; it is the integration of all these approaches that ensures
the system is sufficiently robust for successful operation and management. The flow chart below illustrates
this in the end-to-end load process.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 757 of 1017


INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 758 of 1017
Error handling underpins the data integration system from end-to-end. Each of the load
components performs validation checks, the results of which must be reported to the operational team.
These components are not just PowerCenter processes such as business rule and field validation, but cover
the entire data integration architecture, for example:

• Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity
to source systems)?
• Source File Validation. Is the source file datestamp later than the previous load?
• File Check. Does the number of rows successfully loaded match the source rows read?

Last updated: 09-Feb-07 13:42

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 759 of 1017


Error Handling Strategies - Data Warehousing

Challenge
A key requirement for any successful data warehouse or data integration project is that it attains credibility
within the user community. At the same time, it is imperative that the warehouse be as up-to-date as
possible since the more recent the information derived from it is, the more relevant it is to the business
operations of the organization, thereby providing the best opportunity to gain an advantage over the
competition.

Transactional systems can manage to function even with a certain amount of error since the impact of an
individual transaction (in error) has a limited effect on the business figures as a whole, and corrections can
be applied to erroneous data after the event (i.e., after the error has been identified). In data warehouse
systems, however, any systematic error (e.g., for a particular load instance) not only affects a larger number
of data items, but may potentially distort key reporting metrics. Such data cannot be left in the warehouse
"until someone notices" because business decisions may be driven by such information.

Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors
occur, it is equally important either to prevent them from getting to the warehouse at all, or to remove them
from the warehouse immediately (i.e., before the business tries to use the information in error).

The types of error to consider include:

• Source data structures


• Sources presented out-of-sequence
• ‘Old’ sources represented in error
• Incomplete source files
• Data-type errors for individual fields
• Unrealistic values (e.g., impossible dates)
• Business rule breaches
• Missing mandatory data
• O/S errors
• RDBMS errors

These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or
column-related errors) concerns.

Description
In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you
can be sure that every source element was populated correctly, with meaningful values, never missing a
value, and fulfilling all relational constraints. At the same time, source data sets always have a fixed
structure, are always available on time (and in the correct order), and are never corrupted during transfer to
the data warehouse. In addition, the OS and RDBMS never run out of resources, or have permissions and
privileges change.

Realistically, however, the operational applications are rarely able to cope with every possible business
scenario or combination of events; operational systems crash, networks fall over, and users may not use the
transactional systems in quite the way they were designed. The operational systems also typically need
some flexibility to allow non-fixed data to be stored (typically as free-text comments). In every case, there is
a risk that the source data does not match what the data warehouse expects.

Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by
the business managers. If erroneous data does reach the warehouse, it must be identified and removed
immediately (before the current version of the warehouse can be published). Preferably, error data should

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 760 of 1017


be identified during the load process and prevented from reaching the warehouse at all. Ideally, erroneous
source data should be identified before a load even begins, so that no resources are wasted trying to load it.

As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors
within the warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point
on, it becomes impossible to guarantee that a metric or data item came from a specific source via a specific
chain of processes. As a by-product, adopting this principle also helps to tie both the end-users and those
responsible for the source data into the warehouse process; source data staff understand that their
professionalism directly affects the quality of the reports, and end-users become owners of their data.

As a final consideration, error management (the implementation of an error handling strategy) complements
and overlaps load management, data quality and key management, and operational processes and
procedures.

Load management processes record at a high-level if a load is unsuccessful; error management records the
details of why the failure occurred.

Quality management defines the criteria whereby data can be identified as in error; and error management
identifies the specific error(s), thereby allowing the source data to be corrected.

Operational reporting shows a picture of loads over time, and error management allows analysis to identify
systematic errors, perhaps indicating a failure in operational procedure.

Error management must therefore be tightly integrated within the data warehouse load process. This is
shown in the high level flow chart below:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 761 of 1017


INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 762 of 1017
Error Management Considerations

High-Level Issues

From previous discussion of load management, a number of checks can be performed before any attempt is
made to load a source data set. Without load management in place, it is unlikely that the warehouse process
will be robust enough to satisfy any end-user requirements, and error correction processing becomes moot
(in so far as nearly all maintenance and development resources will be working full time to manually correct
bad data in the warehouse). The following assumes that you have implemented load management
processes similar to Informatica’s best practices.

• Process Dependency checks in the load management can identify when a source data set is
missing, duplicates a previous version, or has been presented out of sequence, and where the
previous load failed but has not yet been corrected.
• Load management prevents this source data from being loaded. At the same time, error
management processes should record the details of the failed load; noting the source instance, the
load affected, and when and why the load was aborted.
• Source file structures can be compared to expected structures stored as metadata, either from
header information or by attempting to read the first data row.
• Source table structures can be compared to expectations; typically this can be done by interrogating
the RDBMS catalogue directly (and comparing to the expected structure held in metadata), or by
simply running a ‘describe’ command against the table (again comparing to a pre-stored version in
metadata).
• Control file totals (for file sources) and row number counts (table sources) are also used to
determine if files have been corrupted or truncated during transfer, or if tables have no new data in
them (suggesting a fault in an operational application).
• In every case, information should be recorded to identify where and when an error occurred, what
sort of error it was, and any other relevant process-level details.

Low-Level Issues

Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load
to abort), further error management processes need to be applied to the individual source rows and fields.

• Individual source fields can be compared to expected data-types against standard metadata within
the repository, or additional information added by the development. In some instances, this is
enough to abort the rest of the load; if the field structure is incorrect, it is much more likely that the
source data set as a whole either cannot be processed at all or (more worryingly) is likely to be
processed unpredictably.
• Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-
in error handling can be used to spot failed date conversions, conversions of string to numbers, or
missing required data. In rare cases, stored procedures can be called if a specific conversion fails;
however this cannot be generally recommended because of the potentially crushing impact on
performance if a particularly error-filled load occurs.
• Business rule breaches can then be picked up. It is possible to define allowable values, or
acceptable value ranges within PowerCenter mappings (if the rules are few, and it is clear from the
mapping metadata that the business rules are included in the mapping itself). A more flexible
approach is to use external tables to codify the business rules. In this way, only the rules tables
need to be amended if a new business rule needs to be applied. Informatica has suggested
methods to implement such a process.
• Missing Key/Unknown Key issues have already been defined in their own best practice document
Key Management in Data Warehousing Solutions with suggested management techniques for
identifying and handling them. However, from an error handling perspective, such errors must still
be identified and recorded, even when key management techniques do not formally fail source rows
with key errors. Unless a record is kept of the frequency with which particular source data fails, it is
difficult to realize when there is a systematic problem in the source systems.
• Inter-row errors may also have to be considered. These may occur when a business process
expects a certain hierarchy of events (e.g., a customer query, followed by a booking request,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 763 of 1017


followed by a confirmation, followed by a payment). If the events arrive from the source system in
the wrong order, or where key events are missing, it may indicate a major problem with the source
system, or the way in which the source system is being used.
• An important principle to follow is to try to identify all of the errors on a particular row before halting
processing, rather than rejecting the row at the first instance. This seems to break the rule of not
wasting resources trying to load a sourced data set if we already know it is in error; however, since
the row needs to be corrected at source, then reprocessed subsequently, it is sensible to identify all
the corrections that need to be made before reloading, rather than fixing the first, re-running, and
then identifying a second error (which halts the load for a second time).

OS and RDBMS Issues

Since best practice means that referential integrity (RI) issues are proactively managed within the loads,
instances where the RDBMS rejects data for referential reasons should be very rare (i.e., the load should
already have identified that reference information is missing).

However, there is little that can be done to identify the more generic RDBMS problems that are likely to
occur; changes to schema permissions, running out of temporary disk space, dropping of tables and
schemas, invalid indexes, no further table space extents available, missing partitions and the like.

Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space,
command syntax, and authentication may occur outside of the data warehouse. Often such changes are
driven by Systems Administrators who, from an operational perspective, are not aware that there is likely to
be an impact on the data warehouse, or are not aware that the data warehouse managers need to be kept
up to speed.

In both of the instances above, the nature of the errors may be such that not only will they cause a load to
fail, but it may be impossible to record the nature of the error at that point in time. For example, if RDBMS
user ids are revoked, it may be impossible to write a row to an error table if the error process depends on
the revoked id; if disk space runs out during a write to a target table, this may affect all other tables
(including the error tables); if file permissions on a UNIX host are amended, bad files themselves (or even
the log files) may not be accessible.

Most of these types of issues can be managed by a proper load management process, however. Since
setting the status of a load to ‘complete’ should be absolutely the last step in a given process, any failure
before, or including, that point leaves the load in an ‘incomplete’ state. Subsequent runs should note this,
and enforce correction of the last load before beginning the new one.

The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational
Administrators and DBAs have proper and working communication with the data warehouse management to
allow proactive control of changes. Administrators and DBAs should also be available to the data warehouse
operators to rapidly explain and resolve such errors if they occur.

Auto-Correction vs. Manual Correction

Load management and key management best practices (Key Management in Data Warehousing Solutions)
have already defined auto-correcting processes; the former to allow loads themselves to launch, rollback,
and reload without manual intervention, and the latter to allow RI errors to be managed so that the
quantitative quality of the warehouse data is preserved, and incorrect key values are corrected as soon as
the source system provides the missing data.

We cannot conclude from these two specific techniques, however, that the warehouse should attempt to
change source data as a general principle. Even if this were possible (which is debatable), such functionality
would mean that the absolute link between the source data and its eventual incorporation into the data
warehouse would be lost. As soon as one of the warehouse metrics was identified as incorrect, unpicking
the error would be impossible, potentially requiring a whole section of the warehouse to be reloaded entirely
from scratch.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 764 of 1017


In addition, such automatic correction of data might hide the fact that one or other of the source systems had
a generic fault, or more importantly, had acquired a fault because of on-going development of the
transactional applications, or a failure in user training.

The principle to apply here is to identify the errors in the load, and then alert the source system users that
data should be corrected in the source system itself, ready for the next load to pick up the right data. This
maintains the data lineage, allows source system errors to be identified and ameliorated in good time, and
permits extra training needs to be identified and managed.

Error Management Techniques

Simple Error Handling Structure

The following data structure is an example of the error metadata that should be captured as a minimum
within the error handling strategy.

The example defines three main sets of information:

• The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including:

o process-level (e.g., incorrect source file, load started out-of-sequence)


o row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and
o reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

• The ERROR_HEADER table provides a high-level view on the process, allowing a quick
identification of the frequency of error for particular loads and of the distribution of error types. It is
linked to the load management processes via the SRC_INST_ID and PROC_INST_ID, from which
other process-level information can be gathered.
• The ERROR_DETAIL table stores information about actual rows with errors, including how to
identify the specific row that was in error (using the source natural keys and row number) together
with a string of field identifier/value pairs concatenated together. It is not expected that this

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 765 of 1017


information will be deconstructed as part of an automatic correction load, but if necessary this can
be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for subsequent
reporting.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 766 of 1017


Error Handling Strategies - General

Challenge

The challenge is to accurately and efficiently load data into the target data architecture. This Best Practice describes
various loading scenarios, the use of data profiles, an alternate method for identifying data errors, methods for
handling data errors, and alternatives for addressing the most common types of problems. For the most part, these
strategies are relevant whether your data integration project is loading an operational data structure (as with data
migrations, consolidations, or loading various sorts of operational data stores) or loading a data warehousing
structure.

Description

Regardless of target data structure, your loading process must validate that the data conforms to known rules of the
business. When the source system data does not meet these rules, the process needs to handle the exceptions in
an appropriate manner. The business needs to be aware of the consequences of either permitting invalid data to
enter the target or rejecting it until it is fixed. Both approaches present complex issues. The business must decide
what is acceptable and prioritize two conflicting goals:

● The need for accurate information.


● The ability to analyze or process the most complete information available with the understanding that errors
can exist.

Data Integration Process Validation

In general, there are three methods for handling data errors detected in the loading process:

● Reject All. This is the simplest to implement since all errors are rejected from entering the target when they
are detected. This provides a very reliable target that the users can count on as being correct, although it
may not be complete. Both dimensional and factual data can be rejected when any errors are encountered.
Reports indicate what the errors are and how they affect the completeness of the data.

Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key
relationship cannot be created. These errors need to be fixed in the source systems and reloaded on a
subsequent load. Once the corrected rows have been loaded, the factual data will be reprocessed and
loaded, assuming that all errors have been fixed. This delay may cause some user dissatisfaction since the
users need to take into account that the data they are looking at may not be a complete picture of the
operational systems until the errors are fixed. For an operational system, this delay may affect downstream
transactions.

The development effort required to fix a Reject All scenario is minimal, since the rejected data can be
processed through existing mappings once it has been fixed. Minimal additional code may need to be written
since the data will only enter the target if it is correct, and it would then be loaded into the data mart using
the normal process.

● Reject None. This approach gives users a complete picture of the available data without having to consider
data that was not available due to it being rejected during the load process. The problem is that the data
may not be complete or accurate. All of the target data structures may contain incorrect information that
can lead to incorrect decisions or faulty transactions.

With Reject None, the complete set of data is loaded, but the data may not support correct transactions or

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 767 of 1017


aggregations. Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total
numbers that are correct, but incorrect detail numbers. After the data is fixed, reports may change, with
detail information being redistributed along different hierarchies.

The development effort to fix this scenario is significant. After the errors are corrected, a new loading
process needs to correct all of the target data structures, which can be a time-consuming effort based on the
delay between an error being detected and fixed. The development strategy may include removing
information from the target, restoring backup tapes for each night’s load, and reprocessing the data. Once
the target is fixed, these changes need to be propagated to all downstream data structures or data marts.

● Reject Critical. This method provides a balance between missing information and incorrect information. It
involves examining each row of data and determining the particular data elements to be rejected. All
changes that are valid are processed into the target to allow for the most complete picture. Rejected
elements are reported as errors so that they can be fixed in the source systems and loaded on a
subsequent run of the ETL process.

This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts
or updates.

Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be
summarized at various levels in the organization. Attributes provide additional descriptive information per
key element.

Inserts are important for dimensions or master data because subsequent factual data may rely on the
existence of the dimension data row in order to load properly. Updates do not affect the data integrity as
much because the factual data can usually be loaded with the existing dimensional data unless the update is
to a key element.

The development effort for this method is more extensive than Reject All since it involves classifying fields
as critical or non-critical, and developing logic to update the target and flag the fields that are in error. The
effort also incorporates some tasks from the Reject None approach, in that processes must be developed to
fix incorrect data in the entire target data architecture.

Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target.
By providing the most fine-grained analysis of errors, this method allows the greatest amount of valid data to
enter the target on each run of the ETL process, while at the same time screening out the unverifiable data
fields. However, business management needs to understand that some information may be held out of the
target, and also that some of the information in the target data structures may be at least temporarily
allocated to the wrong hierarchies.

Handling Errors in Dimension Profiles

Profiles are tables used to track history changes to the source data. As the source systems change, profile records
are created with date stamps that indicate when the change took place. This allows power users to review the target
data using either current (As-Is) or past (As-Was) views of the data.

A profile record should occur for each change in the source data. Problems occur when two fields change in the
source system and one of those fields results in an error. The first value passes validation, which produces a new
profile record, while the second value is rejected and is not included in the new profile. When this error is fixed, it
would be desirable to update the existing profile rather than creating a new one, but the logic needed to perform this
UPDATE instead of an INSERT is complicated. If a third field is changed in the source before the error is fixed, the
correction process is complicated further.

The following example represents three field values in a source system. The first row on 1/1/2000 shows the original
values. On 1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is
invalid. On 1/10/2000, Field 3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 768 of 1017


2 is finally fixed to Red.

Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 Closed Sunday Black Open 9 – 5

1/5/2000 Open Sunday BRed Open 9 – 5

1/10/2000 Open Sunday BRed Open 24hrs

1/15/2000 Open Sunday Red Open 24hrs

Three methods exist for handling the creation and update of profiles:

1. The first method produces a new profile record each time a change is detected in the source. If a field value
was invalid, then the original field value is maintained.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5

1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5

1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

1/15/2000 1/15/2000 Open Sunday Red Open 24hrs

By applying all corrections as new profiles in this method, we simplify the process by directly applying all
changes to the source system directly to the target. Each change -- regardless if it is a fix to a previous error
-- is applied as a new change that creates a new profile. This incorrectly shows in the target that two
changes occurred to the source information when, in reality, a mistake was entered on the first change and
should be reflected in the first profile. The second profile should not have been created.

2. The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000,
which loses the profile record for the change to Field 3.

If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile
information. If the third field changes before the second field is fixed, we show the third field changed at the
same time as the first. When the second field was fixed, it would also be added to the existing profile, which
incorrectly reflects the changes in the source system.

3. The third method creates only two new profiles, but then causes an update to the profile records on
1/15/2000 to fix the Field 2 value in both.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 769 of 1017


1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5

1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5

1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

1/15/2000 1/5/2000 Open Sunday Red Open 9-5


(Update)

1/15/2000 1/10/2000 Open Sunday Red Open 24hrs


(Update)

If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create
complex algorithms that handle the process correctly. It involves being able to determine when an error occurred
and examining all profiles generated since then and updating them appropriately. And, even if we create the
algorithms to handle these methods, we still have an issue of determining if a value is a correction or a new value. If
an error is never fixed in the source system, but a new value is entered, we would identify it as a previous error,
causing an automated process to update old profile records, when in reality a new profile record should have been
entered.

Recommended Method

A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters
a new, correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile
records. In this way, the corrected data enters the target as a new Profile record, but the process of fixing old Profile
records, and potentially deleting the newly inserted record, is delayed until the data is examined and an action is
decided. Once an action is decided, another process examines the existing Profile records and corrects them as
necessary. This method only delays the As-Was analysis of the data until the correction method is determined
because the current information is reflected in the new Profile.

Data Quality Edits

Quality indicators can be used to record definitive statements regarding the quality of the data received and stored
in the target. The indicators can be append to existing data tables or stored in a separate table linked by the primary
key. Quality indicators can be used to:

● Show the record and field level quality associated with a given record at the time of extract.
● Identify data sources and errors encountered in specific records.
● Support the resolution of specific record error types via an update and resubmission process.

Quality indicators can be used to record several types of errors – e.g., fatal errors (missing primary key value),
missing data in a required field, wrong data type/format, or invalid data value. If a record contains even one error,
data quality (DQ) fields will be appended to the end of the record, one field for every field in the record. A data
quality indicator code is included in the DQ fields corresponding to the original fields in the record where the errors
were encountered. Records containing a fatal error are stored in a Rejected Record Table and associated to the
original file name and record number. These records cannot be loaded to the target because they lack a primary
key field to be used as a unique record identifier in the target.

The following types of errors cannot be processed:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 770 of 1017


● A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be
saved and used to generate a notice to the sending system indicating that x number of invalid records were
received and could not be processed. However, in the absence of a primary key, no tracking is possible to
determine whether the invalid record has been replaced or not.
● The source file or record is illegible. The file or record would be sent to a reject queue. Metadata indicating
that x number of invalid records were received and could not be processed may or may not be available for
a general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is
possible to determine whether the invalid record has been replaced or not. If the file or record is illegible, it
is likely that individual unique records within the file are not identifiable. While information can be provided
to the source system site indicating there are file errors for x number of records, specific problems may not
be identifiable on a record-by-record basis.

In these error types, the records can be processed, but they contain errors:

● A required (non-key) field is missing.


● The value in a numeric or date field is non-numeric.
● The value in a field does not fall within the range of acceptable values identified for the field. Typically, a
reference table is used for this validation.

When an error is detected during ingest and cleansing, the identified error type is recorded.

Quality Indicators (Quality Code Table)

The requirement to validate virtually every data element received from the source data systems mandates the
development, implementation, capture and maintenance of quality indicators. These are used to indicate the quality
of incoming data at an elemental level. Aggregated and analyzed over time, these indicators provide the
information necessary to identify acute data quality problems, systemic issues, business process problems and
information technology breakdowns.

The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a Required Field, “3”-Wrong Data Type/
Format, “4”-Invalid Data Value and “5”-Outdated Reference Table in Use, apply a concise indication of the quality of
the data within specific fields for every data type. These indicators provide the opportunity for operations staff, data
quality analysts and users to readily identify issues potentially impacting the quality of the data. At the same time,
these indicators provide the level of detail necessary for acute quality problems to be remedied in a timely manner.

Handling Data Errors

The need to periodically correct data in the target is inevitable. But how often should these corrections be
performed?

The correction process can be as simple as updating field information to reflect actual values, or as complex as
deleting data from the target, restoring previous loads from tape, and then reloading the information correctly.
Although we try to avoid performing a complete database restore and reload from a previous point in time, we
cannot rule this out as a possible solution.

Reject Tables vs. Source System

As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data
and the related error messages indicating the causes of error. The business needs to decide whether analysts
should be allowed to fix data in the reject tables, or whether data fixes will be restricted to source systems. If errors
are fixed in the reject tables, the target will not be synchronized with the source systems. This can present credibility
problems when trying to track the history of changes in the target data architecture. If all fixes occur in the source

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 771 of 1017


systems, then these fixes must be applied correctly to the target data.

Attribute Errors and Default Values

Attributes provide additional descriptive information about a dimension concept. Attributes include things like the
color of a product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate
characters in the address. These types of errors do not generally affect the aggregated facts and statistics in the
target data; the attributes are most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find
specific patterns for market research). Attribute errors can be fixed by waiting for the source system to be corrected
and reapplied to the data in the target.

When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new
record enter thetarget. Some rules that have been proposed for handling defaults are as follows:

Value Types Description Default

Reference Values Attributes that are foreign keys to Unknown


other tables

Small Value Sets Y/N indicator fields No

Other Any other type of attribute Null or Business provided


value

Reference tables are used to normalize the target model to prevent the duplication of data. When a source value
does not translate into a reference table value, we use the ‘Unknown’ value. (All reference tables contain a value of
‘Unknown’ for this purpose.)

The business should provide default values for each identified attribute. Fields that are restricted to a limited domain
of values (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in
translating these values, we use the value that represents off or ‘No’ as the default. Other values, like numbers, are
handled on a case-by-case basis. In many cases, the data integration process is set to populate ‘Null’ into these
fields, which means “undefined” in the target. After a source system value is corrected and passes validation, it is
corrected in the target.

Primary Key Errors

The business also needs to decide how to handle new dimensional values such as locations. Problems occur when
the new key is actually an update to an old key in the source system. For example, a location number is assigned
and the new location is transferred to the target using the normal process; then the location number is changed due
to some source business rule such as: all Warehouses should be in the 5000 range. The process assumes that the
change in the primary key is actually a new warehouse and that the old warehouse was deleted. This type of error
causes a separation of fact data, with some data being attributed to the old primary key and some to the new. An
analyst would be unable to get a complete picture.

Fixing this type of error involves integrating the two records in the target data, along with the related facts.
Integrating the two rows involves combining the profile information, taking care to coordinate the effective dates of
the profiles to sequence properly. If two profile records exist for the same day, then a manual decision is required as
to which is correct. If facts were loaded using both primary keys, then the related fact rows must be added together
and the originals deleted in order to correct the data.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 772 of 1017


The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same
target data ID really represent two different IDs). In this case, it is necessary to restore the source information for
both dimensions and facts from the point in time at which the error was introduced, deleting affected records from
the target and reloading from the restore to correct the errors.

DM Facts Calculated from EDW Dimensions

If information is captured as dimensional data from the source, but used as measures residing on the fact records in
the target, we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until
the value is corrected. If we load the facts with the incorrect data, the process to fix the target can be time
consuming and difficult to implement.

If we let the facts enter downstream target structures, we need to create processes that update them after the
dimensional data is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes
simpler. After the errors are fixed, the affected rows can simply be loaded and applied to the target data.

Fact Errors

If there are no business rules that reject fact records except for relationship errors to dimensional data, then when
we encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the
following night. This nightly reprocessing continues until the data successfully enters the target data structures.
Initial and periodic analyses should be performed on the errors to determine why they are not being loaded.

Data Stewards

Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new
entities in dimensional data, and designating one primary data source when multiple sources exist. Reference data
and translation tables enable the target data architecture to maintain consistent descriptions across multiple source
systems, regardless of how the source system stores the data. New entities in dimensional data include new
locations, products, hierarchies, etc. Multiple source data occurs when two source systems can contain different
data for the same dimensional entity.

Reference Tables

The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a
short code value as a primary key and a long description for reporting purposes. A translation table is associated
with each reference table to map the codes to the source system values. Using both of these tables, the ETL
process can load data from the source systems into the target structures.

The translation tables contain one or more rows for each source value and map the value to a matching row in the
reference table. For example, the SOURCE column in FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data
steward would be responsible for entering in the translation table the following values:

Source Value Code Translation

O OFFICE

S STORE

W WAREHSE

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 773 of 1017


These values are used by the data integration process to correctly load the target. Other source systems that
maintain a similar field may use a two-letter abbreviation like ‘OF’, ‘ST’ and ‘WH’. The data steward would make the
following entries into the translation table to maintain consistency across systems:

Source Value Code Translation

OF OFFICE

ST STORE

WH WAREHSE

The data stewards are also responsible for maintaining the reference table that translates the codes into
descriptions. The ETL process uses the reference table to populate the following values into the target:

Code Translation Code Description

OFFICE Office

STORE Retail Store

WAREHSE Distribution Warehouse

Error handling results when the data steward enters incorrect information for these mappings and needs to correct
them after data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered
ST as translating to OFFICE by mistake). The only way to determine which rows should be changed is to restore
and reload source data from the first time the mistake was entered. Processes should be built to handle these types
of situations, including correction of the entire target data architecture.

Dimensional Data

New entities in dimensional data present a more complex issue. New entities in the target may include Locations
and Products, at a minimum. Dimensional data uses the same concept of translation as reference tables. These
translation tables map the source system value to the target value. For location, this is straightforward, but over
time, products may have multiple source system values that map to the same product in the target. (Other similar
translation issues may also exist, but Products serves as a good example for error handling.)

There are two possible methods for loading new dimensional entities. Either require the data steward to enter the
translation data before allowing the dimensional data into the target, or create the translation data through the ETL
process and force the data steward to review it. The first option requires the data steward to create the translation
for new entities, while the second lets the ETL process create the translation, but marks the record as ‘Pending
Verification’ until the data steward reviews it and changes the status to ‘Verified’ before any facts that reference it
can be loaded.

When the dimensional value is left as ‘Pending Verification’ however, facts may be rejected or allocated to dummy
values. This requires the data stewards to review the status of new values on a daily basis. A potential solution to
this issue is to generate an email each night if there are any translation table entries pending verification. The data
steward then opens a report that lists them.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 774 of 1017


A problem specific to Product is that when it is created as new, it is really just a changed SKU number. This causes
additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is
fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would
also have to be merged, requiring manual intervention.

The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same
product, but really represent two different products). In this case, it is necessary to restore the source information for
all loads since the error was introduced. Affected records from the target should be deleted and then reloaded from
the restore to correctly split the data. Facts should be split to allocate the information correctly and dimensions split
to generate correct profile information.

Manual Updates

Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs
to be established for manually entering fixed data and applying it correctly to the entire target data architecture,
including beginning and ending effective dates. These dates are useful for both profile and date event fixes. Further,
a log of these fixes should be maintained to enable identifying the source of the fixes as manual rather than part of
the normal load process.

Multiple Sources

The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources
contain subsets of the required information. For example, one system may contain Warehouse and Store
information while another contains Store and Hub information. Because they share Store information, it is difficult to
decide which source contains the correct information.

When this happens, both sources have the ability to update the same row in the target. If both sources are allowed
to update the shared information, data accuracy and profile problems are likely to occur. If we update the shared
information on only one source system, the two systems then contain different information. If the changed system is
loaded into the target, it creates a new profile indicating the information changed. When the second system is
loaded, it compares its old unchanged value to the new profile, assumes a change occurred and creates another
new profile with the old, unchanged value. If the two systems remain different, the process causes two profiles to be
loaded every day until the two source systems are synchronized with the same information.

To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source
that should be considered primary for the field. Then, only if the field changes on the primary source would it be
changed. While this sounds simple, it requires complex logic when creating Profiles, because multiple sources can
provide information toward the one profile record created for that day.

One solution to this problem is to develop a system of record for all sources. This allows developers to pull the
information from the system of record, knowing that there are no conflicts for multiple sources. Another solution is to
indicate, at the field level, a primary source where information can be shared from multiple sources. Developers can
use the field level information to update only the fields that are marked as primary. However, this requires additional
effort by the data stewards to mark the correct source fields as primary and by the data integration team to
customize the load process.

Last updated: 05-Jun-08 12:48

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 775 of 1017


Error Handling Techniques - PowerCenter Mappings

Challenge

Identifying and capturing data errors using a mapping approach, and making such errors available for further processing or correction.

Description

Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the
production environment, data must be checked and validated prior to entry into the target system. One strategy for catching data
errors is to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and
unexpected transformation or database constraint errors.

Data Validation Errors

The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements.

Consider the following questions:

● What types of data errors are likely to be encountered?


● Of these errors, which ones should be captured?
● What process can capture the possible errors?
● Should errors be captured before they have a chance to be written to the target database?
● Will any of these errors need to be reloaded or corrected?
● How will the users know if errors are encountered?
● How will the errors be stored?
● Should descriptions be assigned for individual errors?
● Can a table be designed to store captured errors and the error descriptions?

Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and
improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g., executing
a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of functionality in
a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the target table;
constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to the session log
and the reject/bad file, thus improving performance.

Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to them.
This approach can be effective for many types of data content error, including: date conversion, null values intended for not null
target fields, and incorrect data formats or data types.

Sample Mapping Approach for Data Validation Errors

In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written to
not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated from
the data flow and logged in an error table.

One solution is to implement a mapping similar to the one shown below:

An expression transformation can be employed to validate the source data, applying rules and flagging records with one or more errors.

A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows with
a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would refer to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 776 of 1017


the mapping name and the ROW_ID would be created by a sequence generator.

The composite key is designed to allow developers to trace rows written to the error tables that store information useful for
error reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.

The table ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the
error description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of
reference for reporting.

The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns:
ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the entire
row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to reprocess them.

The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended for
a not null target field could generate an error message such as ‘NAME is NULL’ or ‘DOB is NULL’. This step can be done in
an expression transformation (e.g., EXP_VALIDATION in the sample mapping).

After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a
normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that
are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would
be generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL.

The following table shows how the error data produced may look.

Table Name: CUSTOMER_ERR

NAME DOB ADDRESS ROW_ID MAPPING_ID

NULL NULL NULL 1 DIM_LOAD

Table Name: ERR_DESC_TBL

FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target

CUST DIM_LOAD 1 Name is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

CUST DIM_LOAD 1 DOB is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

CUST DIM_LOAD 1 Address is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in
mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of
data validation errors.

Data validation error handling can be extended by including mapping logic to grade error severity. For example, flagging data
validation errors as ‘soft’ or ‘hard’.

● A ‘hard’ error can be defined as one that would fail when being written to the database, such as a constraint error.
● A ‘soft’ error can be defined as a data content error.

A record flagged as ‘hard’ can be filtered from the target and written to the error tables, while a record flagged as ‘soft’ can be written
to both the target system and the error tables. This gives business analysts an opportunity to evaluate and correct data
imperfections while still allowing the records to be processed for end-user reporting.

Ultimately, business organizations need to decide if the analysts should fix the data in the reject table or in the source systems.
The advantage of the mapping approach is that all errors are identified as either data errors or constraint errors and can be
properly addressed. The mapping approach also reports errors based on projects or categories by identifying the mappings
that contain errors. The most important aspect of the mapping approach however, is its flexibility. Once an error type is identified,
the error handling logic can be placed anywhere within a mapping. By using the mapping approach to capture identified errors,
the operations team can effectively communicate data quality issues to the business users.

Constraint and Transformation Errors

Perfect data can never be guaranteed. In implementing the mapping approach described above to detect errors and log them to
an error table, how can we handle unexpected errors that arise in the load? For example, PowerCenter may apply the validated data
to the database; however the relational database management system (RDBMS) may reject it for some unexpected
reason. An RDBMS may, for example, reject data if constraints are violated. Ideally, we would like to detect these database-level
errors automatically and send them to the same error table used to store the soft errors caught by the mapping approach
described above.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 777 of 1017


In some cases, the ‘stop on errors’ session property can be set to ‘1’ to stop source data for which unhandled errors were
encountered from being loaded. In this case, the process will stop with a failure, the data must be corrected, and the entire source
may need to be reloaded or recovered. This is not always an acceptable approach.

An alternative might be to have the load process continue in the event of records being rejected, and then reprocess only the
records that were found to be in error. This can be achieved by configuring the ‘stop on errors’ property to 0 and switching on
relational error logging for a session. By default, the error-messages from the RDBMS and any un-caught transformation errors
are sent to the session log. Switching on relational error logging redirects these messages to a selected database in which four
tables are automatically created: PMERR_MSG, PMERR_DATA, PMERR_TRANS and PMERR_SESS.

The PowerCenter Workflow Administration Guide contains detailed information on the structure of these tables. However,
the PMERR_MSG table stores the error messages that were encountered in a session. The following four columns of this table
allow us to retrieve any RDBMS errors:

• SESS_INST_ID: A unique identifier for the session. Joining this table with the Metadata Exchange (MX)
View REP_LOAD_SESSIONS in the repository allows the MAPPING_ID to be retrieved.

• TRANS_NAME: Name of the transformation where an error occurred. When a RDBMS error occurs, this is the name of the
target transformation.

• TRANS_ROW_ID: Specifies the row ID generated by the last active source. This field contains the row number at the target
when the error occurred.

• ERROR_MSG: Error message generated by the RDBMS

With this information, all RDBMS errors can be extracted and stored in an applicable error table. A post-load session (i.e., an
additional PowerCenter session) can be implemented to read the PMERR_MSG table, join it with the MX View
REP_LOAD_SESSION in the repository, and insert the error details into ERR_DESC_TBL. When the post process
ends, ERR_DESC_TBL will contain both ‘soft’ errors and ‘hard’ errors.

One problem with capturing RDBMS errors in this way is mapping them to the relevant source key to provide lineage. This can
be difficult when the source and target rows are not directly related (i.e., one source row can actually result in zero or more rows at
the target). In this case, the mapping that loads the source must write translation data to a staging table (including the source key
and target row number). The translation table can then be used by the post-load session to identify the source key by the target
row number retrieved from the error log. The source key stored in the translation table could be a row number in the case of a flat
file, or a primary key in the case of a relational data source.

Reprocessing

After the load and post-load sessions are complete, the error table (e.g., ERR_DESC_TBL) can be analyzed by members of
the business or operational teams. The rows listed in this table have not been loaded into the target database. The operations
team can, therefore, fix the data in the source that resulted in ‘soft’ errors and may be able to explain and remediate the ‘hard’ errors.

Once the errors have been fixed, the source data can be reloaded. Ideally, only the rows resulting in errors during the first run
should be reprocessed in the reload. This can be achieved by including a filter and a lookup in the original load mapping and using
a parameter to configure the mapping for an initial load or for a reprocess load. If the mapping is reprocessing, the lookup searches
for each source row number in the error table, while the filter removes source rows for which the lookup has not found errors. If
initial loading, all rows are passed through the filter, validated, and loaded.

With this approach, the same mapping can be used for initial and reprocess loads. During a reprocess run, the records
successfully loaded should be deleted (or marked for deletion) from the error table, while any new errors encountered should
be inserted as if an initial run. On completion, the post-load process is executed to capture any new RDBMS errors. This ensures
that reprocessing loads are repeatable and result in reducing numbers of records in the error table over time.

Last updated: 01-Feb-07 18:53

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 778 of 1017


Error Handling Techniques - PowerCenter Workflows and Data
Analyzer

Challenge

Implementing an efficient strategy to identify different types of errors in the ETL process, correct the errors, and
reprocess the corrected data.

Description

Identifying errors and creating an error handling strategy is an essential part of a data warehousing project. The errors in
an ETL process can be broadly categorized into two types: data errors in the load process, which are defined by the
standards of acceptable data quality; and process errors, which are driven by the stability of the process itself.

The first step in implementing an error handling strategy is to understand and define the error handling requirement.
Consider the following questions:

● What tools and methods can help in detecting all the possible errors?
● What tools and methods can help in correcting the errors?
● What is the best way to reconcile data across multiple systems?
● Where and how will the errors be stored? (i.e., relational tables or flat files)

A robust error handling strategy can be implemented using PowerCenter’s built-in error handling capabilities along with
Data Analyzer as follows:

● Process Errors: Configure an email task to notify the PowerCenter Administrator immediately of any process
failures.
● Data Errors: Setup the ETL process to:

❍ Use the Row Error Logging feature in PowerCenter to capture data errors in the PowerCenter error tables
for analysis, correction, and reprocessing.
❍ Setup Data Analyzer alerts to notify the PowerCenter Administrator in the event of any rejected rows.
❍ Setup customized Data Analyzer reports and dashboards at the project level to provide information on
failed sessions, sessions with failed rows, load time, etc.

Configuring an Email Task to Handle Process Failures

Configure all workflows to send an email to the PowerCenter Administrator, or any other designated recipient, in the
event of a session failure. Create a reusable email task and use it in the “On Failure Email” property settings in the
Components tab of the session, as shown in the following figure.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 779 of 1017


When you configure the subject and body of a post-session email, use email variables to include information about the
session run, such as session name, mapping name, status, total number of records loaded, and total number of records
rejected. The following table lists all the available email variables:

Email Variables for Post-Session Email

Email Variable Description

%s Session name.

%e Session status.

%b Session start time.

%c Session completion time.

%i Session elapsed time (session completion time-session start time).

%l Total rows loaded.

%r Total rows rejected.

Source and target table details, including read throughput in bytes per second and write
%t throughput in rows per second. The PowerCenter Server includes all information displayed
in the session detail dialog box.

%m Name of the mapping used in the session.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 780 of 1017


%n Name of the folder containing the session.

%d Name of the repository containing the session.

%g Attach the session log to the message.

Attach the named file. The file must be local to the PowerCenter Server. The following are
valid file names: %a<c:\data\sales.txt> or %a</users/john/data/sales.txt>.
%a<filename>

Note: The file name cannot include the greater than character (>) or a line break.

Note: The PowerCenter Server ignores %a, %g, or %t when you include them in the email subject. Include
these variables in the email message only.

Configuring Row Error Logging in PowerCenter

PowerCenter provides you with a set of four centralized error tables into which all data errors can be logged. Using these
tables to capture data errors greatly reduces the time and effort required to implement an error handling strategy when
compared with a custom error handling solution.

When you configure a session, you can choose to log row errors in this central location. When a row error occurs, the
PowerCenter Server logs error information that allows you to determine the cause and source of the error. The
PowerCenter Server logs information such as source name, row ID, current row data, transformation, timestamp, error
code, error message, repository name, folder name, session name, and mapping information. This error metadata is
logged for all row-level errors, including database errors, transformation errors, and errors raised through the ERROR()
function, such as business rule violations.

Logging row errors into relational tables rather than flat files enables you to report on and fix the errors easily. When you
enable error logging and chose the ‘Relational Database’ Error Log Type, the PowerCenter Server offers you the
following features:

● Generates the following tables to help you track row errors:

❍ PMERR_DATA. Stores data and metadata about a transformation row error and its corresponding source
row.
❍ PMERR_MSG. Stores metadata about an error and the error message.
❍ PMERR_SESS. Stores metadata about the session.
❍ PMERR_TRANS. Stores metadata about the source and transformation ports, such as name and datatype,
when a transformation error occurs.

■ Appends error data to the same tables cumulatively, if they already exist, for the further runs of the
session.
■ Allows you to specify a prefix for the error tables. For instance, if you want all your EDW session errors
to go to one set of error tables, you can specify the prefix as ‘EDW_’
■ Allows you to collect row errors from multiple sessions in a centralized set of four error tables. To do
this, you specify the same error log table name prefix for all sessions.

Example:

In the following figure, the session ‘s_m_Load_Customer’ loads Customer Data into the EDW Customer table. The
Customer Table in EDW has the following structure:

CUSTOMER_ID NOT NULL NUMBER (PRIMARY KEY)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 781 of 1017


CUSTOMER_NAME NULL VARCHAR2(30)

CUSTOMER_STATUS NULL VARCHAR2(10)

There is a primary key constraint on the column CUSTOMER_ID.

To take advantage of PowerCenter’s built-in error handling features, you would set the session properties as shown
below:

The session property ‘Error Log Type’ is set to ‘Relational Database’, and ‘Error Log DB Connection’ and ‘Table name
Prefix’ values are given accordingly.

When the PowerCenter server detects any rejected rows because of Primary Key Constraint violation, it writes
information into the Error Tables as shown below:

EDW_PMERR_DATA

WORKFLOW_ WORKLET_ SESS_ TRANS_NAME TRANS_ TRANS_ROW SOURCE_ SOURCE_ SOURCE_ LINE_
RUN_ID RUN_ID INST_ ROW_ID DATA ROW_ID ROW_ ROW_ NO
ID TYPE DATA

8 0 3 Customer_Table 1 D:1001:00000000 -1 -1 N/A 1


0000|D:Elvis Pres|
D:Valid

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 782 of 1017


8 0 3 Customer_Table 2 D:1002:00000000 -1 -1 N/A 1
0000|D:James
Bond|D:Valid

8 0 3 Customer_Table 3 D:1003:00000000 -1 -1 N/A 1


0000|D:Michael Ja|
D:Valid

EDW_PMERR_MSG

WORKFLOW_ SESS_ SESS_ REPOSITORY_ FOLDER_ WORKFLOW_ TASK_ MAPPING_ LINE_


RUN_ID INST_ID START_TIME NAME NAME NAME INST_PATH NAME NO

6 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1


18:31

7 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1


18:33

8 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1


18:34

EDW_PMERR_SESS

WORKFLOW_ SESS_ SESS_ REPOSITORY_ FOLDER_ WORKFLOW_ TASK_ MAPPING_ LINE_


RUN_ID INST_ID START_TIME NAME NAME NAME INST_PATH NAME NO

6 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1


18:31

7 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1


18:33

8 3 9/15/2004 pc711 Folder1 wf_test1 s_m_test1 m_test1 1


18:34

EDW_PMERR_TRANS

WORKFLOW_RUN_ID SESS_INST_ID TRANS_NAME TRANS_GROUP TRANS_ATTR LINE_


NO

8 3 Customer_Table Input Customer 1


_Id:3,
Customer
_Name:12,
Customer
_Status:12

By looking at the workflow run id and other fields, you can analyze the errors and reprocess them after fixing the errors.

Error Detection and Notification using Data Analyzer

Informatica provides Data Analyzer for PowerCenter Repository Reports with every PowerCenter license. Data Analyzer
is Informatica’s powerful business intelligence tool that is used to provide insight into the PowerCenter repository
metadata.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 783 of 1017


You can use the Operations Dashboard provided with the repository reports as one central location to gain insight into
production environment ETL activities. In addition, the following capabilities of Data Analyzer are recommended best
practices:

● Configure alerts to send an email or a pager message to the PowerCenter Administrator whenever there is an
entry made into the error tables PMERR_DATA or PMERR_TRANS.
● Configure reports and dashboards to provide detailed session run information grouped by projects/PowerCenter
folders for easy analysis.
● Configure reports to provide detailed information of the row level errors for each session. This can be
accomplished by using the four error tables as sources of data for the reports

Data Reconciliation Using Data Analyzer

Business users often like to see certain metrics matching from one system to another (e.g., source system to ODS, ODS
to targets, etc.) to ascertain that the data has been processed accurately. This is frequently accomplished by writing
tedious queries, comparing two separately produced reports, or using constructs such as DBLinks.

Upgrading the Data Analyzer licence from Repository Reports to a full license enables Data Analyzer to source your
company’s data (e.g., source systems, staging areas, ODS, data warehouse, and data marts) and provide a reliable and
reusable way to accomplish data reconciliation. Using Data Analyzer’s reporting capabilities, you can select data from
various data sources such as ODS, data marts, and data warehouses to compare key reconciliation metrics and numbers
through aggregate reports. You can further schedule the reports to run automatically every time the relevant
PowerCenter sessions complete, and setup alerts to notify the appropriate business or technical users in case of any
discrepancies.

For example, a report can be created to ensure that the same number of customers exist in the ODS in comparison to a
data warehouse and/or any downstream data marts. The reconciliation reports should be relevant to a business user by
comparing key metrics (e.g., customer counts, aggregated financial metrics, etc) across data silos. Such reconciliation
reports can be run automatically after PowerCenter loads the data, or they can be run by technical users or business on
demand. This process allows users to verify the accuracy of data and builds confidence in the data warehouse solution.

Last updated: 09-Feb-07 14:22

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 784 of 1017


Creating Inventories of Reusable Objects &
Mappings

Challenge

Successfully identify the need and scope of reusability. Create inventories of reusable
objects with in a folder or shortcuts across folders (Local shortcuts) or shortcuts across
repositories (Global shortcuts).

Successfully identify and create inventories of mappings based on business rules.

Description
Reusable Objects

Prior to creating an inventory of reusable objects or shortcut objects, be sure to review


the business requirements and look for any common routines and/or modules that may
appear in more than one data movement. These common routines are excellent
candidates for reusable objects or shortcut objects. In PowerCenter, these objects can
be created as:

● single transformations (i.e., lookups, filters, etc.)


● a reusable mapping component (i.e., a group of transformations - mapplets)
● single tasks in workflow manager (i.e., command, email, or session)
● a reusable workflow component (i.e., a group of tasks in workflow manager -
worklets).

Please note that shortcuts are not supported for workflow level objects (Tasks).

Identify the need for reusable objects based on the following criteria:

● Is there enough usage and complexity to warrant the development of a common


object?
● Are the data types of the information passing through the reusable object the
same from case to case or is it simply the same high-level steps with different
fields and data.
Identify the Scope based on the following criteria:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 785 of 1017


● Do these objects need to be shared with in the same folder. If so, then create re-
usable objects with in the folder
● Do these objects need to be shared in several other PowerCenter repository
folders? If so, then create local shortcuts
● Do these objects need to be shared across repositories? If so, then create a
global repository and maintain these re-usable objects in the global repository.
Create global shortcuts to these reusable objects from the local repositories.
Note: Shortcuts cannot be created for workflow objects.

PowerCenter Designer Objects

Creating and testing common objects does not always save development time or
facilitate future maintenance. For example, if a simple calculation like subtracting a
current rate from a budget rate that is going to be used for two different mappings,
carefully consider whether the effort to create, test, and document the common object is
worthwhile. Often, it is simpler to add the calculation to both mappings. However, if the
calculation were to be performed in a number of mappings, if it was very difficult, and if
all occurrences would be updated following any change or fix, then the calculation would
be an ideal case for a reusable object. When you add instances of a reusable
transformation to mappings, be careful that the changes do not invalidate the mapping or
generate unexpected data. The Designer stores each reusable transformation as
metadata, separate from any mapping that uses the transformation.

The second criterion for a reusable object concerns the data that will pass through the
reusable object. Developers often encounter situations where they may perform a certain
type of high-level process (i.e., a filter, expression, or update strategy) in two or more
mappings. For example, if you have several fact tables that require a series of dimension
keys, you can create a mapplet containing a series of lookup transformations to find each
dimension key. You can then use the mapplet in each fact table mapping, rather than
recreating the same lookup logic in each mapping. This seems like a great candidate for
a mapplet. However, after performing half of the mapplet work, the developers may
realize that the actual data or ports passing through the high-level logic are totally
different from case to case, thus making the use of a mapplet impractical. Consider
whether there is a practical way to generalize the common logic so that it can be
successfully applied to multiple cases. Remember, when creating a reusable object, the
actual object will be replicated in one to many mappings. Thus, in each mapping using
the mapplet or reusable transformation object, the same size and number of ports must
pass into and out of the mapping/reusable object.

Document the list of the reusable objects that pass this criteria test, providing a high-level
description of what each object will accomplish. The detailed design will occur in a future
subtask, but at this point the intent is to identify the number and functionality of reusable
objects that will be built for the project. Keep in mind that it will be impossible to identify
one hundred percent of the reusable objects at this point; the goal here is to create an

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 786 of 1017


inventory of as many as possible, and hopefully the most difficult ones. The remainder
will be discovered while building the data integration processes.

PowerCenter Workflow Manager Objects

In some cases, we may have to read data from different sources and go through the
same transformation logic and write the data to either one destination database or
multiple destination databases. Also, sometimes, depending on the availability of the
source, these loads have to be scheduled at different time. This case would be the ideal
one to create a re-usable session and do Session overrides at the session instance level
for the database connections/pre-session commands / post session commands.

Logging load statistics, failure criteria and success criteria are usually common pieces of
code that would be executed for multiple loads in most Projects. Some of these common
tasks include:

● Notification when number of rows loaded is less then expected


● Notification when there are any reject rows using email tasks and link conditions
● Successful completion notification based on success criteria like number of rows
loaded using email tasks and link conditions
● Fail the load based on failure criteria like load statistics or status of some critical
session using control task
● Stop/Abort a Workflow based on some failure criteria using control task
● Based on some previous session completion times, calculate the amount of time
the down stream session has to wait before it can start using worklet variables,
timer task and assignment task

Re-usable worklets can be developed to encapsulate the above-mentioned tasks and


can be used in multiple loads. By passing workflow variable values to the worklets and
assign then to worklet variables, one can easily encapsulate common workflow logic.

Mappings

A mapping is a set of source and target definitions linked by transformation objects that
define the rules for data transformation. Mappings represent the data flow between
sources and targets. In a simple world, a single source table would populate a single
target table. However, in practice, this is usually not the case. Sometimes multiple
sources of data need to be combined to create a target table, and sometimes a single
source of data creates many target tables. The latter is especially true for mainframe
data sources where COBOL OCCURS statements litter the landscape. In a typical
warehouse or data mart model, each OCCURS statement decomposes to a separate

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 787 of 1017


table.

The goal here is to create an inventory of the mappings needed for the project. For this
exercise, the challenge is to think in individual components of data movement. While the
business may consider a fact table and its three related dimensions as a single ‘object’ in
the data mart or warehouse, five mappings may be needed to populate the
corresponding star schema with data (i.e., one for each of the dimension tables and two
for the fact table, each from a different source system).

Typically, when creating an inventory of mappings, the focus is on the target tables, with
an assumption that each target table has its own mapping, or sometimes multiple
mappings. While often true, if a single source of data populates multiple tables, this
approach yields multiple mappings. Efficiencies can sometimes be realized by loading
multiple tables from a single source. By simply focusing on the target tables, however,
these efficiencies can be overlooked.

A more comprehensive approach to creating the inventory of mappings is to create a


spreadsheet listing all of the target tables. Create a column with a number next to each
target table. For each of the target tables, in another column, list the source file or table
that will be used to populate the table. In the case of multiple source tables per target,
create two rows for the target, each with the same number, and list the additional source
(s) of data.

The table would look similar to the following:

Number Target Table Source


1 Customers Cust_File
2 Products Items
3 Customer_Type Cust_File
4 Orders_Item Tickets
4 Orders_Item Ticket_Items

When completed, the spreadsheet can be sorted either by target table or source table.
Sorting by source table can help determine potential mappings that create multiple
targets.

When using a source to populate multiple tables at once for efficiency, be sure to keep
restartabilty and reloadability in mind. The mapping will always load two or more target
tables from the source, so there will be no easy way to rerun a single table. In this
example, potentially the Customers table and the Customer_Type tables can be loaded
in the same mapping.

When merging targets into one mapping in this manner, give both targets the same

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 788 of 1017


number. Then, re-sort the spreadsheet by number. For the mappings with multiple
sources or targets, merge the data back into a single row to generate the inventory of
mappings, with each number representing a separate mapping.

The resulting inventory would look similar to the following:

Number Target Table Source


1 Customers Customer_Type Cust_File
2 Products Items
4 Orders_Item Tickets Ticket_Items

At this point, it is often helpful to record some additional information about each mapping
to help with planning and maintenance.

First, give each mapping a name. Apply the naming standards generated in 3.2 Design
Development Architecture. These names can then be used to distinguish mappings from
one other and also can be put on the project plan as individual tasks.

Next, determine for the project a threshold for a high, medium, or low number of target
rows. For example, in a warehouse where dimension tables are likely to number in the
thousands and fact tables in the hundred thousands, the following thresholds might apply:

● Low – 1 to 10,000 rows


● Medium – 10,000 to 100,000 rows
● High – 100,000 rows +

Assign a likely row volume (high, medium or low) to each of the mappings based on the
expected volume of data to pass through the mapping. These high level estimates will
help to determine how many mappings are of ‘high’ volume; these mappings will be the
first candidates for performance tuning.

Add any other columns of information that might be useful to capture about each
mapping, such as a high-level description of the mapping functionality, resource
(developer) assigned, initial estimate, actual completion time, or complexity rating.

Last updated: 05-Jun-08 13:10

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 789 of 1017


Metadata Reporting and Sharing

Challenge

Using Informatica's suite of metadata tools effectively in the design of the end-user analysis application.

Description

The Informatica tool suite can capture extensive levels of metadata but the amount of metadata that is entered depends on
the metadata strategy. Detailed information or metadata comments can be entered for all repository objects (e.g. mapping,
sources, targets, transformations, ports etc.). Also, all information about column size and scale, data types, and primary keys
are stored in the repository. The decision on how much metadata to create is often driven by project timelines. While it may
be beneficial for a developer to enter detailed descriptions of each column, expression, variable, etc, it will also require extra
amount of time and efforts to do so. But once that information is fed to the Informatica repository ,the same information can
be retrieved using Metadata reporter any time. There are several out-of-box reports and customized reports can also be created
to view that information. There are several options available to export these reports (e.g. Excel spreadsheet, Adobe .pdf file
etc.). Informatica offers two ways to access the repository metadata:

● Metadata Reporter, which is a web-based application that allows you to run reports against the repository metadata. This is a
very comprehensive tool that is powered by the functionality of Informatica’s BI reporting tool, Data Analyzer. It is included on the
PowerCenter CD.
● Because Informatica does not support or recommend direct reporting access to the repository, even for Select Only queries, the
second way of repository metadata reporting is through the use of views written using Metadata Exchange (MX).

Metadata Reporter

The need for the Informatica Metadata Reporter arose from the number of clients requesting custom and complete metadata
reports from their repositories. Metadata Reporter is based on the Data Analyzer and PowerCenter products. It provides Data
Analyzer dashboards and metadata reports to help you administer your day-to-day PowerCenter operations, reports to access to
every Informatica object stored in the repository, and even reports to access objects in the Data Analyzer repository. The
architecture of the Metadata Reporter is web-based, with an Internet browser front end. Because Metadata Reporter runs on
Data Analyzer, you must have Data Analyzer installed and running before you proceed with Metadata Reporter setup.

Metadata Reporter setup includes the following .XML files to be imported from the PowerCenter CD in the same sequence as they
are listed below:

● Schemas.xml
● Schedule.xml
● GlobalVariables_Oracle.xml (This file is database specific, Informatica provides GlobalVariable files for DB2, SQLServer,
Sybase and Teradata. You need to select the appropriate file based on your PowerCenter repository environment)
● Reports.xml
● Dashboards.xml

Note : If you have setup a new instance of Data Analyzer exclusively for Metadata reporter, you should have no problem
importing these files. However, if you are using an existing instance of Data Analyzer which you currently use for some other
reporting purpose, be careful while importing these files. Some of the file (e.g., Global variables, schedules, etc.) may already exist
with the same name. You can rename the conflicting objects.

The following are the folders that are created in Data Analyzer when you import the above-listed files:

● Data Analyzer Metadata Reporting - contains reports for Data Analyzer repository itself e.g. Today’s Login ,Reports accessed by
Users Today etc.
● PowerCenter Metadata Reports - contains reports for PowerCenter repository. To better organize reports based on their
functionality these reports are further grouped into subfolders as following:
● Configuration Management - contains a set of reports that provide detailed information on configuration management, including
deployment and label details. This folder contains following subfolders:

❍ Deployment
❍ Label

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 790 of 1017


❍ Object Version

● Operations - contains a set of reports that enable users to analyze operational statistics including server load, connection usage,
run times, load times, number of runtime errors, etc. for workflows, worklets and sessions. This folder contains following
subfolders:

❍ Session Execution
❍ Workflow Execution

● PowerCenter Objects - contains a set of reports that enable users to identify all types of PowerCenter objects, their properties,
and interdependencies on other objects within the repository. This folder contains following subfolders:

❍ Mappings
❍ Mapplets
❍ Metadata Extension
❍ Server Grids
❍ Sessions
❍ Sources
❍ Target
❍ Transformations
❍ Workflows
❍ Worklets

● Security - contains a set of reports that provide detailed information on the users, groups and their association within the
repository.

Informatica recommends retaining this folder organization, adding new folders if necessary.

The Metadata Reporter provides 44 standard reports which can be customized with the use of parameters and wildcards.
Metadata Reporter is accessible from any computer with a browser that has access to the web server where the Metadata Reporter
is installed, even without the other Informatica client tools being installed on that computer. The Metadata Reporter connects to
the PowerCenter repository using JDBC drivers. Be sure the proper JDBC drivers are installed for your database platform.

(Note: You can also use the JDBC to ODBC bridge to connect to the repository (e.g., Syntax - jdbc:odbc:<data_source_name>)

● Metadata Reporter is comprehensive. You can run reports on any repository. The reports provide information about all types of
metadata objects.
● Metadata Reporter is easily accessible. Because the Metadata Reporter is web-based, you can generate reports from any
machine that has access to the web server. The reports in the Metadata Reporter are customizable. The Metadata Reporter
allows you to set parameters for the metadata objects to include in the report.
● The Metadata Reporter allows you to go easily from one report to another. The name of any metadata object that displays on a
report links to an associated report. As you view a report, you can generate reports for objects on which you need more
information.

The following table shows list of reports provided by the Metadata Reporter, along with their location and a brief description:

Reports For PowerCenter Repository

Sr No Name Folder Description

1 Deployment Group Public Folders>PowerCenter Metadata Displays deployment groups by


Reports>Configuration repository
Management>Deployment>Deployment Group

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 791 of 1017


2 Deployment Group Public Folders>PowerCenter Metadata Displays, by group, deployment groups
History Reports>Configuration and the dates they were deployed. It
Management>Deployment>Deployment Group also displays the source and target
History repository names of the deployment
group for all deployment dates. This is
a primary report in an analytic
workflow.

3 Labels Public Folders>PowerCenter Metadata Displays labels created in the


Reports>Configuration repository for any versioned object by
Management>Labels>Labels repository.

4 All Object Version Public Folders>PowerCenter Metadata Displays all versions of an object by
History Reports>Configuration Management>Object the date the object is saved in the
Version>All Object Version History repository. This is a standalone report.

5 Server Load by Day of Public Folders>PowerCenter Metadata Displays the total number of sessions
Week Reports>Operations>Session that ran, and the total session run
Execution>Server Load by Day of Week duration for any day of week in any
given month of the year by server by
repository. For example, all Mondays
in September are represented in one
row if that month had 4 Mondays

6 Session Run Details Public Folders>PowerCenter Metadata Displays session run details for any
Reports>Operations>Session start date by repository by folder. This
Execution>Session Run Details is a primary report in an analytic
workflow.

7 Target Table Load Public Folders>PowerCenter Metadata Displays the load statistics for each
Analysis (Last Month) Reports>Operations>Session table for last month by repository by
Execution>Target Table Load Analysis (Last folder. This is a primary report in an
Month) analytic workflow.

8 Workflow Run Details Public Folders>PowerCenter Metadata Displays the run statistics of all
Reports>Operations>Workflow workflows by repository by folder. This
Execution>Workflow Run Details is a primary report in an analytic
workflow.

9 Worklet Run Details Public Folders>PowerCenter Metadata Displays the run statistics of all
Reports>Operations>Workflow worklets by repository by folder. This is
Execution>Worklet Run Details a primary report in an analytic
workflow.

10 Mapping List Public Folders>PowerCenter Metadata Displays mappings by repository and


Reports>PowerCenter folder. It also displays properties of the
Objects>Mappings>Mapping List mapping such as the number of
sources used in a mapping, the
number of transformations, and the
number of targets. This is a primary
report in an analytic workflow.

11 Mapping Lookup Public Folders>PowerCenter Metadata Displays Lookup transformations used


Transformations Reports>PowerCenter in a mapping by repository and folder.
Objects>Mappings>Mapping Lookup This report is a standalone report and
Transformations also the first node in the analytic
workflow associated with the Mapping
List primary report.

12 Mapping Shortcuts Public Folders>PowerCenter Metadata Displays mappings defined as a


Reports>PowerCenter shortcut by repository and folder.
Objects>Mappings>Mapping Shortcuts

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 792 of 1017


13 Source to Target Public Folders>PowerCenter Metadata Displays the data flow from the source
Dependency Reports>PowerCenter to the target by repository and folder.
Objects>Mappings>Source to Target The report lists all the source and
Dependency target ports, the mappings in which the
ports are connected, and the
transformation expression that shows
how data for the target port is derived.

14 Mapplet List Public Folders>PowerCenter Metadata Displays mapplets available by


Reports>PowerCenter repository and folder. It displays
Objects>Mapplets>Mapplet List properties of the mapplet such as the
number of sources used in a mapplet,
the number of transformations, or the
number of targets. This is a primary
report in an analytic workflow.

15 Mapplet Lookup Public Folders>PowerCenter Metadata Displays all Lookup transformations


Transformations Reports>PowerCenter used in a mapplet by folder and
Objects>Mapplets>Mapplet Lookup repository. This report is a standalone
Transformations report and also the first node in the
analytic workflow associated with the
Mapplet List primary report.

16 Mapplet Shortcuts Public Folders>PowerCenter Metadata Displays mapplets defined as a


Reports>PowerCenter shortcut by repository and folder.
Objects>Mapplets>Mapplet Shortcuts

17 Unused Mapplets in Public Folders>PowerCenter Metadata Displays mapplets defined in a folder


Mappings Reports>PowerCenter but not used in any mapping in that
Objects>Mapplets>Unused Mapplets in folder.
Mappings

18 Metadata Extensions Public Folders>PowerCenter Metadata Displays, by repository by folder,


Usage Reports>PowerCenter Objects>Metadata reusable metadata extensions used by
Extensions>Metadata Extensions Usage any object. Also displays the counts of
all objects using that metadata
extension.

19 Server Grid List Public Folders>PowerCenter Metadata Displays all server grids and servers
Reports>PowerCenter Objects>Server associated with each grid. Information
Grid>Server Grid List includes host name, port number, and
internet protocol address of the
servers.

20 Session List Public Folders>PowerCenter Metadata Displays all sessions and their
Reports>PowerCenter properties by repository by folder. This
Objects>Sessions>Session List is a primary report in an analytic
workflow.

21 Source List Public Folders>PowerCenter Metadata Displays relational and non-relational


Reports>PowerCenter sources by repository and folder. It
Objects>Sources>Source List also shows the source properties. This
report is a primary report in an analytic
workflow.

22 Source Shortcuts Public Folders>PowerCenter Metadata Displays sources that are defined as
Reports>PowerCenter shortcuts by repository and folder
Objects>Sources>Source Shortcuts

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 793 of 1017


23 Target List Public Folders>PowerCenter Metadata Displays relational and non-relational
Reports>PowerCenter targets available by repository and
Objects>Targets>Target List folder. It also displays the target
properties. This is a primary report in
an analytic workflow.

24 Target Shortcuts Public Folders>PowerCenter Metadata Displays targets that are defined as
Reports>PowerCenter shortcuts by repository and folder.
Objects>Targets>Target Shortcuts

25 Transformation List Public Folders>PowerCenter Metadata Displays transformations defined by


Reports>PowerCenter repository and folder. This is a primary
Objects>Transformations>Transformation List report in an analytic workflow.

26 Transformation Public Folders>PowerCenter Metadata Displays transformations that are


Shortcuts Reports>PowerCenter defined as shortcuts by repository and
Objects>Transformations>Transformation folder.
Shortcuts

27 Scheduler (Reusable) Public Folders>PowerCenter Metadata Displays all the reusable schedulers
List Reports>PowerCenter defined in the repository and their
Objects>Workflows>Scheduler (Reusable) List description and properties by
repository by folder. This is a primary
report in an analytic workflow.

28 Workflow List Public Folders>PowerCenter Metadata Displays workflows and workflow


Reports>PowerCenter properties by repository by folder. This
Objects>Workflows>Workflow List report is a primary report in an analytic
workflow.

29 Worklet List Public Folders>PowerCenter Metadata Displays worklets and worklet


Reports>PowerCenter properties by repository by folder. This
Objects>Worklets>Worklet List is a primary report in an analytic
workflow.

30 Users By Group Public Folders>PowerCenter Metadata Displays users by repository and group.
Reports>Security>Users By Group

Reports For Data Analyzer Repository

Sr No Name Folder Description

1 Bottom 10 Least Public Folders>Data Analyzer Metadata Displays the ten least accessed
Accessed Reports this Reporting>Bottom 10 Least Accessed Reports reports for the current year. It has an
Year this Year analytic workflow that provides access
details such as user name and access
time.

2 Report Activity Details Public Folders>Data Analyzer Metadata Part of the analytic workflows "Top 10
Reporting>Report Activity Details Most Accessed Reports This Year",
"Bottom 10 Least Accessed Reports
this Year" and "Usage by Login (Month
To Date)".

3 Report Activity Details Public Folders>Data Analyzer Metadata Provides information about reports
for Current Month Reporting>Report Activity Details for Current accessed in the current month until
Month current date.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 794 of 1017


4 Report Refresh Public Folders>Data Analyzer Metadata Provides information about the next
Schedule Reporting>Report Refresh Schedule scheduled update for scheduled
reports. It can be used to decide
schedule timing for various reports for
optimum system performance.

5 Reports Accessed by Public Folders>Data Analyzer Metadata Part of the analytic workflow for
Users Today Reporting>Reports Accessed by Users Today "Today's Logins". It provides detailed
information on the reports accessed by
users today. This can be used
independently to get comprehensive
information about today's report
activity details.

6 Todays Logins Public Folders>Data Analyzer Metadata Provides the login count and average
Reporting>Todays Logins login duration for users who logged in
today.

7 Todays Report Usage Public Folders>Data Analyzer Metadata Provides information about the number
by Hour Reporting>Todays Report Usage by Hour of reports accessed today for each
hour. The analytic workflow attached
to it provides more details on the
reports accessed and users who
accessed them during the selected
hour.

8 Top 10 Most Accessed Public Folders>Data Analyzer Metadata Shows the ten most accessed reports
Reports this Year Reporting>Top 10 Most Accessed Reports this for the current year. It has an analytic
Year workflow that provides access details
such as user name and access time.

9 Top 5 Logins (Month Public Folders>Data Analyzer Metadata Provides information about users and
To Date) Reporting>Top 5 Logins (Month To Date) their corresponding login count for the
current month to date. The analytic
workflow attached to it provides more
details about the reports accessed by
a selected user.

10 Top 5 Longest Public Folders>Data Analyzer Metadata Shows the five longest running on-
Running On-Demand Reporting>Top 5 Longest Running On- demand reports for the current month
Reports (Month To Demand Reports (Month To Date) to date. It displays the average total
Date) response time, average DB response
time, and the average Data Analyzer
response time (all in seconds) for each
report shown.

11 Top 5 Longest Public Folders>Data Analyzer Metadata Shows the five longest running
Running Scheduled Reporting>Top 5 Longest Running Scheduled scheduled reports for the current
Reports (Month To Reports (Month To Date) month to date. It displays the average
Date) response time (in seconds) for each
report shown.

12 Total Schedule Errors Public Folders>Data Analyzer Metadata Provides the number of errors
for Today Reporting>Total Schedule Errors for Today encountered during execution of
reports attached to schedules. The
analytic workflow "Scheduled Report
Error Details for Today" is attached to
it.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 795 of 1017


13 User Logins (Month To Public Folders>Data Analyzer Metadata Provides information about users and
Date) Reporting>User Logins (Month To Date) their corresponding login count for the
current month to date. The analytic
workflow attached to it provides more
details about the reports accessed by
a selected user.

14 Users Who Have Public Folders>Data Analyzer Metadata Provides information about users who
Never Logged On Reporting>Users Who Have Never Logged On exist in the repository but have never
logged in. This information can be
used to make administrative decisions
about disabling accounts.

Customizing a Report or Creating New Reports

Once you select the report, you can customize it by setting the parameter values and/or creating new attributes or metrics.
Data Analyzer includes simples steps to create new reports or modify existing ones. Adding filters or modifying filters
offers tremendous reporting flexibility. Additionally, you can setup report templates and export them as Excel files, which can
be refreshed as necessary. For more information on the attributes, metrics, and schemas included with the Metadata Reporter,
consult the product documentation.

Wildcards

The Metadata Reporter supports two wildcard characters:

● Percent symbol (%) - represents any number of characters and spaces.


● Underscore (_) - represents one character or space.

You can use wildcards in any number and combination in the same parameter. Leaving a parameter blank returns all values and is
the same as using %. The following examples show how you can use the wildcards to set parameters.

Suppose you have the following values available to select:

items, items_in_promotions, order_items, promotions

The following list shows the return values for some wildcard combinations you can use:

Wildcard Combination Return Values

% items, items_in_promotions, order_items, promotions

<blank> items, items_in_promotions, order_items, promotions

%items items, order_items

item_ Items

item% items, items_in_promotions

___m% items, items_in_promotions, promotions

%pr_mo% items_in_promotions, promotions

A printout of the mapping object flow is also useful for clarifying how objects are connected. To produce such a printout, arrange
the mapping in Designer so the full mapping appears on the screen, and then use Alt+PrtSc to copy the active window to the
clipboard. Use Ctrl+V to paste the copy into a Word document.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 796 of 1017


For a detailed description of how to run these reports, consult the Metadata Reporter Guide included in the
PowerCenter documentation.

Security Awareness for Metadata Reporter

Metadata Reporter uses Data Analyzer for reporting out of the PowerCenter /Data Analyzer repository. Data Analyzer has a
robust security mechanism that is inherited by Metadata Reporter. You can establish groups, roles, and/or privileges for users
based on their profiles. Since the information in PowerCenter repository does not change often after it goes to production,
the Administrator can create some reports and export them to files that can be distributed to the user community. If the numbers
of users for Metadata Reporter are limited, you can implement security using report filters or data restriction feature. For example, if
a user in PowerCenter repository has access to certain folders, you can create a filter for those folders and apply it to the user's
profile. For more information on the ways in which you can implement security in Data Analyzer, refer to the Data
Analyzer documentation.

Metadata Exchange: the Second Generation (MX2)

The MX architecture was intended primarily for BI vendors who wanted to create a PowerCenter-based data warehouse and
display the warehouse metadata through their own products. The result was a set of relational views that encapsulated the
underlying repository tables while exposing the metadata in several categories that were more suitable for external parties.
Today, Informatica and several key vendors, including Brio, Business Objects, Cognos, and MicroStrategy are effectively using the
MX views to report and query the Informatica metadata.

Informatica currently supports the second generation of Metadata Exchange called MX2. Although the overall motivation for
creating the second generation of MX remains consistent with the original intent, the requirements and objectives of MX2
supersede those of MX.

The primary requirements and features of MX2 are:

Incorporation of object technology in a COM-based API. Although SQL provides a powerful mechanism for accessing
and manipulating records of data in a relational paradigm, it is not suitable for procedural programming tasks that can be achieved
by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity and use of object-oriented software tools require
interfaces that can fully take advantage of the object technology. MX2 is implemented in C++ and offers an advanced object-based
API for accessing and manipulating the PowerCenter Repository from various programming languages.

Self-contained Software Development Kit (SDK). One of the key advantages of MX views is that they are part of the
repository database and thus can be used independent of any of the Informatica software products. The same requirement also
holds for MX2, thus leading to the development of a self-contained API Software Development Kit that can be used independently
of the client or server products.

Extensive metadata content, especially multidimensional models for OLAP. A number of BI tools and upstream data
warehouse modeling tools require complex multidimensional metadata, such as hierarchies, levels, and various relationships. This
type of metadata was specifically designed and implemented in the repository to accommodate the needs of the Informatica
partners by means of the new MX2 interfaces.

Ability to write (push) metadata into the repository. Because of the limitations associated with relational views, MX could not
be used for writing or updating metadata in the Informatica repository. As a result, such tasks could only be accomplished by
directly manipulating the repository's relational tables. The MX2 interfaces provide metadata write capabilities along with
the appropriate verification and validation features to ensure the integrity of the metadata in the repository.

Complete encapsulation of the underlying repository organization by means of an API. One of the main challenges with
MX views and the interfaces that access the repository tables is that they are directly exposed to any schema changes of
the underlying repository database. As a result, maintaining the MX views and direct interfaces requires a major effort with every
major upgrade of the repository. MX2 alleviates this problem by offering a set of object-based APIs that are abstracted away from
the details of the underlying relational tables, thus providing an easier mechanism for managing schema evolution.

Integration with third-party tools. MX2 offers the object-based interfaces needed to develop more sophisticated
procedural programs that can tightly integrate the repository with the third-party data warehouse modeling and query/reporting tools.

Synchronization of metadata based on changes from up-stream and down-stream tools. Given that metadata is likely to
reside in various databases and files in a distributed software environment, synchronizing changes and updates ensures the
validity and integrity of the metadata. The object-based technology used in MX2 provides the infrastructure needed to
implement automatic metadata synchronization and change propagation across different tools that access the PowerCenter Repository.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 797 of 1017


Interoperability with other COM-based programs and repository interfaces. MX2 interfaces comply with Microsoft's
Component Object Model (COM) interoperability protocol. Therefore, any existing or future program that is COM-compliant
can seamlessly interface with the PowerCenter Repository by means of MX2.

Last updated: 27-May-08 12:03

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 798 of 1017


Repository Tables & Metadata Management

Challenge

Maintaining the repository for regular backup, quick response, and querying metadata for metadata reports.

Description

Regular actions such as backups, testing backup and restore procedures, and deleting unwanted information
from the repository maintains the repository for better performance.

Managing Repository

The PowerCenter Administrator plays a vital role in managing and maintaining the repository and metadata. The
role involves tasks such as securing the repository, managing the users and roles, maintaining backups, and
managing the repository through such activities as removing unwanted metadata, analyzing tables, and updating
statistics.

Repository backup

Repository back up can be performed using the client tool Repository Server Admin Console or the command line
program pmrep. Backup using pmrep can be automated and scheduled for regular backups.

This shell script can be scheduled to run as cron job for regular backups. Alternatively, this shell script can be
called from PowerCenter via a command task. The command task can be placed in a workflow and scheduled to
run daily.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 799 of 1017


The following paragraphs describe some useful practices for maintaining backups:

Frequency: Backup frequency depends on the activity in repository. For Production repositories, backup is
recommended once a month or prior to major release. For development repositories, backup is recommended
once a week or once a day, depending upon the team size.

Backup file sizes: Because backup files can be very large, Informatica recommends compressing them using a
utility such as winzip or gzip.

Storage: For security reasons, Informatica recommends maintaining backups on a different physical device that
the repository itself.

Move backups offline: Review the backups on a regular basis to determine how long they need to remain
online. Any that are not required online should be moved offline, to tape, as soon as possible.

Restore repository

Although the Repository restore function is used primarily as part of disaster recovery, it can also be useful for
testing the validity of the backup files and for testing the recovery process on a regular basis. Informatica
recommends testing the backup files and recovery process at least once each quarter. The repository can be
restored using the client tool, Repository Server Administrator Console, or the command line programs
pmrepagent.

Restore folders

There is no easy way to restore only one particular folder from backup. First the backup repository has to be
restored into a new repository, then you can use the client tool, repository manager, to copy the entire folder from
the restored repository into the target repository.

Remove older versions

Use the purge command to remove older versions of objects from repository. To purge a specific version of an
object, view the history of the object, select the version, and purge it.

Finding deleted objects and removing them from repository

If a PowerCenter repository is enabled for versioning through the use of the Team Based Development option.
Objects that have been deleted from the repository are not be visible in the client tools. To list or view deleted
objects, use either the find checkouts command in the client tools or a query generated in the repository

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 800 of 1017


manager, or a specific query.

After an object has been deleted from the repository, you cannot create another object with the same name
unless the deleted object has been completely removed from the repository. Use the purge command to
completely remove deleted objects from the repository. Keep in mind, however, that you must remove all versions
of a deleted object to completely remove it from repository.

Truncating Logs

You can truncate the log information (for sessions and workflows) stored in the repository either by using
repository manager or the pmrep command line program. Logs can be truncated for the entire repository or for a
particular folder.

Options allow truncating all log entries or selected entries based on date and time.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 801 of 1017


Repository Performance

Analyzing (or updating the statistics) of repository tables can help to improve the repository performance.
Because this process should be carried out for all tables in the repository, a script offers the most efficient means.
You can then schedule the script to run using either an external scheduler or a PowerCenter workflow with a
command task to call the script.

Repository Agent and Repository Server performance

Factors such as team size, network, number of objects involved in a specific operation, number of old locks (on
repository objects), etc. may reduce the efficiency of the repository server (or agent). In such cases, the various
causes should be analyzed and the repository server (or agent) configuration file modified to improve
performance.

Managing Metadata

The following paragraphs list the queries that are most often used to report on PowerCenter metadata. The
queries are written for PowerCenter repositories on Oracle and are based on PowerCenter 6 and PowerCenter 7.
Minor changes in the queries may be required for PowerCenter repositories residing on other databases.

Failed Sessions

The following query lists the failed sessions in the last day. To make it work for the last ‘n’ days, replace
SYSDATE-1 with SYSDATE - n

SELECT Subject_Area AS Folder,

Session_Name,

Last_Error AS Error_Message,

DECODE (Run_Status_Code,3,'Failed',4,'Stopped',5,'Aborted') AS Status,

Actual_Start AS Start_Time,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 802 of 1017


Session_TimeStamp

FROM rep_sess_log

WHERE run_status_code != 1

AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

Long running Sessions

The following query lists long running sessions in the last day. To make it work for the last ‘n’ days, replace
SYSDATE-1 with SYSDATE - n

SELECT Subject_Area AS Folder,

Session_Name,

Successful_Source_Rows AS Source_Rows,

Successful_Rows AS Target_Rows,

Actual_Start AS Start_Time,

Session_TimeStamp

FROM rep_sess_log

WHERE run_status_code = 1

AND TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

AND (Session_TimeStamp - Actual_Start) > (10/(24*60))

ORDER BY Session_timeStamp

Invalid Tasks

The following query lists folder names and task name, version number, and last saved for all invalid tasks.

SELECT SUBJECT_AREA AS FOLDER_NAME,

DECODE(IS_REUSABLE,1,'Reusable',' ') || ' ' ||TASK_TYPE_NAME AS TASK_TYPE,

TASK_NAME AS OBJECT_NAME,

VERSION_NUMBER, -- comment out for V6

LAST_SAVED

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 803 of 1017


FROM REP_ALL_TASKS

WHERE IS_VALID=0

AND IS_ENABLED=1

--AND CHECKOUT_USER_ID = 0 -- Comment out for V6

--AND is_visible=1 -- Comment out for V6

ORDER BY SUBJECT_AREA,TASK_NAME

Load Counts

The following query lists the load counts (number of rows loaded) for the successful sessions.

SELECT

subject_area,

workflow_name,

session_name,

DECODE (Run_Status_Code,1,'Succeeded',3,'Failed',4,'Stopped',5,'Aborted') AS Session_Status,

successful_rows,

failed_rows,

actual_start

FROM

REP_SESS_LOG

WHERE

TRUNC(Actual_Start) BETWEEN TRUNC(SYSDATE -1) AND TRUNC(SYSDATE)

ORDER BY

subject_area

workflow_name,

session_name,

Session_status

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 804 of 1017


Last updated: 27-May-08 12:04

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 805 of 1017


Using Metadata Extensions

Challenge

To provide for efficient documentation and achieve extended metadata reporting


through the use of metadata extensions in repository objects.

Description

Metadata Extensions, as the name implies, help you to extend the metadata stored in
the repository by associating information with individual objects in the repository.

Informatica Client applications can contain two types of metadata extensions: vendor-
defined and user-defined.

● Vendor-defined. Third-party application vendors create vendor-defined


metadata extensions. You can view and change the values of vendor-defined
metadata extensions, but you cannot create, delete, or redefine them.
● User-defined. You create user-defined metadata extensions using
PowerCenter clients. You can create, edit, delete, and view user-defined
metadata extensions. You can also change the values of user-defined
extensions.

You can create reusable or non-reusable metadata extensions. You associate reusable
metadata extensions with all repository objects of a certain type. So, when you create a
reusable extension for a mapping, it is available for all mappings. Vendor-defined
metadata extensions are always reusable.

Non-reusable extensions are associated with a single repository object. Therefore, if


you edit a target and create a non-reusable extension for it, that extension is available
only for the target you edit. It is not available for other targets. You can promote a non-
reusable metadata extension to reusable, but you cannot change a reusable metadata
extension to non-reusable.

Metadata extensions can be created for the following repository objects:

● Source definitions
● Target definitions

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 806 of 1017


● Transformations (Expressions, Filters, etc.)
● Mappings
● Mapplets
● Sessions
● Tasks
● Workflows
● Worklets

Metadata Extensions offer a very easy and efficient method of documenting important
information associated with repository objects. For example, when you create a
mapping, you can store the mapping owners name and contact information with the
mapping OR when you create a source definition, you can enter the name of the
person who created/imported the source.

The power of metadata extensions is most evident in the reusable type. When you
create a reusable metadata extension for any type of repository object, that metadata
extension becomes part of the properties of that type of object. For example, suppose
you create a reusable metadata extension for source definitions called SourceCreator.
When you create or edit any source definition in the Designer, the SourceCreator
extension appears on the Metadata Extensions tab. Anyone who creates or edits a
source can enter the name of the person that created the source into this field.

You can create, edit, and delete non-reusable metadata extensions for sources,
targets, transformations, mappings, and mapplets in the Designer. You can create,
edit, and delete non-reusable metadata extensions for sessions, workflows, and
worklets in the Workflow Manager. You can also promote non-reusable metadata
extensions to reusable extensions using the Designer or the Workflow Manager. You
can also create reusable metadata extensions in the Workflow Manager or Designer.

You can create, edit, and delete reusable metadata extensions for all types of
repository objects using the Repository Manager. If you want to create, edit, or delete
metadata extensions for multiple objects at one time, use the Repository Manager.
When you edit a reusable metadata extension, you can modify the properties Default
Value, Permissions and Description.

Note: You cannot create non-reusable metadata extensions in the Repository


Manager. All metadata extensions created in the Repository Manager are reusable.
Reusable metadata extensions are repository wide.

You can also migrate Metadata Extensions from one environment to another. When

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 807 of 1017


you do a copy folder operation, the Copy Folder Wizard copies the metadata extension
values associated with those objects to the target repository. A non-reusable metadata
extension will be copied as a non-reusable metadata extension in the target repository.
A reusable metadata extension is copied as reusable in the target repository, and the
object retains the individual values. You can edit and delete those extensions, as well
as modify the values.

Metadata Extensions provide for extended metadata reporting capabilities. Using


Informatica MX2 API, you can create useful reports on metadata extensions. For
example, you can create and view a report on all the mappings owned by a specific
team member. You can use various programming environments such as Visual Basic,
Visual C++, C++ and Java SDK to write API modules. The Informatica Metadata
Exchange SDK 6.0 installation CD includes sample Visual Basic and Visual C++
applications.

Additionally, Metadata Extensions can also be populated via data modeling tools such
as ERWin, Oracle Designer, and PowerDesigner via Informatica Metadata Exchange
for Data Models. With the Informatica Metadata Exchange for Data Models, the
Informatica Repository interface can retrieve and update the extended properties of
source and target definitions in PowerCenter repositories. Extended Properties are the
descriptive, user defined, and other properties derived from your Data Modeling tool
and you can map any of these properties to the metadata extensions that are already
defined in the source or target object in the Informatica repository.

Last updated: 27-May-08 12:04

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 808 of 1017


Daily Operations

Challenge

Once the data warehouse has been moved to production, the most important task is keeping the system running and
available for the end users.

Description

In most organizations, the day-to-day operation of the data warehouse is the responsibility of a Production Support
team. This team is typically involved with the support of other systems and has expertise in database systems and
various operating systems. The Data Warehouse Development team becomes, in effect, a customer to the Production
Support team. To that end, the Production Support team needs two documents, a Service Level Agreement and an
Operations Manual, to help in the support of the production data warehouse.

Monitoring the System

Monitoring the system is useful for identifying any problems or outages before the users notice. The Production
Support team must know what failed, where it failed, when it failed, and who needs to be working on the
solution. Identifying outages and/or bottlenecks can help to identify trends associated with various technologies. The
goal of monitoring is to reduce downtime for the business user. Comparing the monitoring data against threshold
violations, service level agreements, and other organizational requirements helps to determine the effectiveness of the
data warehouse and any need for changes.

Service Level Agreement

The Service Level Agreement (SLA) outlines how the overall data warehouse system is to be maintained. This is a
high-level document that discusses system maintenance and the components of the system, and identifies the groups
responsible for monitoring the various components. The SLA should be able to be measured against key performance
indicators. At a minimum, it should contain the following information:

● Times when the system should be available to users.


● Scheduled maintenance window.
● Who is expected to monitor the operating system.
● Who is expected to monitor the database.
● Who is expected to monitor the PowerCenter sessions.
● How quickly the support team is expected to respond to notifications of system failures.
● Escalation procedures that include data warehouse team contacts in the event that the support team cannot
resolve the system failure.

Operations Manual

The Operations Manual is crucial to the Production Support team because it provides the information needed to
perform the data warehouse system maintenance. This manual should be self-contained, providing all of the
information necessary for a production support operator to maintain the system and resolve most problems that can
arise. This manual should contain information on how to maintain all data warehouse system components. At a
minimum, the Operations Manual should contain:

● Information on how to stop and re-start the various components of the system.
● Ids and passwords (or how to obtain passwords) for the system components.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 809 of 1017


● Information on how to re-start failed PowerCenter sessions and recovery procedures.
● A listing of all jobs that are run, their frequency (daily, weekly, monthly, etc.), and the average run times.
● Error handling strategies.
● Who to call in the event of a component failure that cannot be resolved by the Production Support team.

PowerExchange Operations Manual

The need to maintain archive logs and listener logs, use started tasks, perform recovery, and other operation functions
on MVS are challenges that need to be addressed in the Operations Manual. If listener logs are not cleaned up on a
regular basis, operations is likely to face space issues. Setting up archive logs on MVS requires datasets to be
allocated and sized. Recovery after failure requires operations intervention to restart workflows and set the restart
tokens. For Change Data Capture, operations are required to start the started tasks in a scheduler and/or after an
IPL. There are certain commands that need to be executed by operations.

The PowerExchange Reference Guide (8.1.1) and the related Adapter Guide provides detailed information on the
operation of PowerExchange Change Data Capture.

Archive/Listener Log Maintenace

The archive log should be controlled by using the Retention Period specified in the EDMUPARM ARCHIVE_OPTIONS
in parameter ARCHIVE_RTPD=. The default supplied in the Install (in RUNLIB member SETUPCC2) is 9999. This is
generally longer than most organizations need. To change it, just rerun the first step (and only the first step) in
SETUPCC2 after making the appropriate changes. Any new archive log datasets will be created with the new retention
period. This does not, however, fix the old archive datasets; to do that, use SMS to override the specification, removing
the need to change the EDMUPARM.

The listener default log are part of the joblog of the running listener. If the listener job runs continuously, there is a
potential risk of the spool file reaching the maximum and causing issues with the listener. For example, if the listener
started task is scheduled to restart every weekend, the log will be refreshed and a new spool file will be created.

If necessary, change the started task listener jobs from //DTLLOG DD SYSOUT=* //DTLLOG DD DSN=&HLQ..LOG,
this will log the file to the member LOG in the HLQ..RUNLIB.

Recovery After Failure

The last resort recovery procedure is to re-execute your initial extraction and load, and restart the CDC process from
the new initial load start point. Fortunately there are other solutions. In any case, if you do need every change, re-
initializing may not be an option.

Application ID

PowerExchange documentation talks about “consuming” applications – the processes that extract changes, whether
they are realtime or change (periodic batch extraction).

Each “consuming” application must identify itself to PowerExchange. Realistically, this means that each session must
have an application id parameter containing a unique “label”.

Restart Tokens

Power Exchange remembers each time that a consuming application successfully extracts changes. The end-point of
the extraction (Address in the database Log – RBA or SCN) is stored in a file on the server hosting the Listener that
reads the changed data. Each of these memorized end-points (i.e., Restart Tokens) is a potential restart point. It is
possible, using the Navigator interface directly, or by updating the restart file, to force the next extraction to restart from
any of these points. If you’re using the ODBC interface for PowerExchange, this is the best solution to implement.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 810 of 1017


If you are running periodic extractions of changes and everything finishes cleanly, the restart token history is a good
approach to recovery back to a previous extraction. You simply chose the recovery point from the list and re-use it.

There are more likely scenarios though. If you are running realtime extractions, potentially never-ending or until there’s
a failure, there are no end-points to memorize for restarts. If your batch extraction fails, you may already have
processed and committed many changes. You can’t afford to “miss” any changes and you don’t want to reapply the
same changes you’ve just processed, but the previous restart token does not correspond to the reality of what you’ve
processed.

If you are using the Power Exchange Client for PowerCenter (PWXPC), the best answer to the recovery problem lies
with PowerCenter, which has historically been able to deal with restarting this type of process – Guaranteed Message
Delivery. This functionality is applicable to both realtime and change CDC options.

The PowerExchange Client for PowerCenter stores the Restart Token of the last successful extraction run for each
Application Id in files on the PowerCenter Server. The directory and file name are required parameters when
configuring the PWXPC connection in the Workflow Manager. This functionality greatly simplifies recovery procedures
compared to using the ODBC interface to PowerExchange.

To enable recovery, select the Enable Recovery option in the Error Handling settings of the Configuration tab in the
session properties. During normal session execution, PowerCenter Server stores recovery information in cache files in
the directory specified for $PMCacheDir.

Normal CDC Execution

If the session ends "cleanly" (i.e., zero return code), PowerCenter writes tokens to the restart file, and the GMD cache
is purged.

If the session fails, you are left with unprocessed changes in the GMD cache and a Restart Token corresponding to
the point in time of the last of the unprocessed changes. This information is useful for recovery.

Recovery

If a CDC session fails, and it was executed with recovery enabled, you can restart it in recovery mode – either from the
PowerCenter Client interfaces or using the pmcmd command line instruction. Obviously, this assumes that you are
able to identify that the session failed previously.

1. Start from the point in time specified by the Restart Token in the GMD cache.
2. PowerCenter reads the change records from the GMD cache.
3. PowerCenter processes and commits the records to the target system(s).
4. Once the records in the GMD cache have been processed and committed, PowerCenter purges the records
from the GMD cache and writes a restart token to the restart file.
5. The PowerCenter session ends “cleanly”.

The CDC session is now ready for you to execute in normal mode again.

Recovery Using PWX ODBC Interface

You can, of course, successfully recover if you are using the ODBC connectivity to PowerCenter, but you have to build
in some things yourself – coping with processing all the changes from the last restart token, even if you’ve already
processed some of them.

When you re-execute a failed CDC session, you receive all the changed data since the last Power Exchange restart
token. Your session has to cope with processing some of the same changes you already processed at the start of the
failed execution – either using lookups/joins to the target to see if you’ve already applied the change you are

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 811 of 1017


processing, or simply ignoring database error messages such as trying to delete a record you already deleted.

If you run DTLUAPPL to generate a restart token periodically during the execution of your CDC extraction and save
the results, you can use the generated restart token to force a recovery at a more recent point in time than the last
session-end restart token. This is especially useful if you are running realtime extractions using ODBC, otherwise you
may find yourself re-processing several days of changes you’ve already processed.

Finally, you can always re-initialize the target and the CDC processing:

● Take an image copy of the tablespace containing the table to be captured, with QUIESCE option.
● Monitor the EDMMSG output from the PowerExchange Logger job.
● Look for message DTLEDM172774I which identifies the PowerExchange Logger sequence number
corresponding to the QUIESCE event.
● The output logger show detail with the following format:

DB2 QUIESCE of TABLESPACE TSNAME.TBNAME at DB2 RBA/LRSN 000849C56185


EDP Logger RBA . . . . . . . . . : D5D3D3D34040000000084E0000000000
Sequence number . . . . . . . . . : 000000084E0000000000
Edition number . . . . . . . . . : B93C4F9C2A79B000
Source EDMNAME(s) . . . . . . . . : DB2DSN1CAPTNAME1

● Take note of the log sequence number


● Repeat for all tables that form part of the same PowerExchange Application.
● Run the DTLUAPPL utility specifying the application name and the registration name for each table in the
application.
● Alter the SYSIN as follows:

MOD APPL REGDEMO DSN1 (where REGDEMO is Registration name on Navigator)


add RSTTKN CAPDEMO (where CAPDEMO is Capture name from Navigator)
SEQUENCE 000000084E0000000000000000084E0000000000
RESTART D5D3D3D34040000000084E0000000000
END APPL REGDEMO (where REGDEMO is Registration name from Navigator)

● Note how the sequence number is a repeated string from the sequence number found in the Logger
messages after the Copy/Quiesce.

Note that the Restart parameter specified in the DTLUAPPL job is the EDP Logger RBA generated in the same
message sequence. This sets the extraction start point on the PowerExchange Logger to the point at which the
QUIESCE was done above.

The image copy obtained above can be used for the initial materialization of the target tables.

PowerExchange Tasks: MVS Start and Stop Command Summary

Start
Task Stop Command Notes Description of Task
Command*

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 812 of 1017


/F DTLLST,
CLOSE
Preferred method /F DTLLST, CLOSE
The PowerExchange
/F DTLLST, listener is used for
CLOSE, FORCE If CLOSE doesn’t work bulk data movement
Listener /S DTLLST
and registering
If FORCE doesn’t work
/P DTLLST sources for Change
Data Capture
If STOP doesn’t work
/C DTLLST

The PowerExchange
Agent, used to manage
connections to the
/DTLA DRAIN and
PowerExchange
SHUTDOWN COMPLETELY can be
Agent /S DTLA /DTLA shutdown Logger and handle
used only at the request of Informatica
repository and other
Support
tasks. This must be
started before the
Logger.
The PowerExchange
/P DTLL Logger used to
****(if you are installing, you need to
manage the Linear
Logger /S DTLL run setup2 here prior to starting the
/F DTLL, STOP Logger) /f DTLL, display datasets and
hiperspace that hold
change capture data.
STOP command just cancel ECCR,
/F DTLDB2EC, QUIESCE wait for open UOWs to
STOP or /F complete. There must be
DTLDB2EC, registrations present
ECCR (DB2) /S DTLDB2EC QUIESCE or /P prior to bringing up
DTLDB2EC /F DTLDB2EC, display will publish most adaptor ECCRs.
stats into the ECCR sysout

The PowerExchange
Condenser used to run
condense jobs against
the PowerExchange
Logger. This is used
with PowerExchange
/F DTLC,
Condense /S DTLC CHANGE to organize
SHUTDOWN the data by table, allow
for interval-based
extraction, and
optionally fully
condense multiple
changes to a single row.

The PowerExchange
(1) To identify all tasks running through Apply process used in
a certain listener issue the following:
situations
(2) Then to stop the Apply issue the
where straight
Submit JCL or / (1) F <Listener following where: name = DBN2 (apply
Apply job>, D A replication is required
S DTLAPP name)
and the data is not
(2) F DTLLST,
If the CAPX access and apply is
STOPTASK name moved through
running locally not through a listener
PowerCenter before
then issue the following command:
landing in the target.
<Listener job>, CLOSE

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 813 of 1017


Notes:

1. /p is an MVS STOP command , /f is an MVS MODIFY command.


2. REMOVE the / if the command is done from the console not SDSF.

If you attempt to shut down the Logger before the ECCR(s), a message indicates that there are still active ECCRs and
that the logger will come down AFTER the ECCRS go away. What you should do is:

You can shut the Listener and the ECCR(s) down at the same time.

The Listener:

1. F <Listener_job>,CLOSE
2. If this isn’t coming down fast enough for you, issue F <Listener_job>,CLOSE FORCE
3. If it still isn’t coming down fast enough, issue C <Listener_job>

Note that these commands are listed in the order of most to least desirable method for bringing the listener
down.

The DB2 ECCR:

1. F <DB2 ECCR>,QUIESCE - this waits for all OPEN UOWs to finish, which can be awhile if a long-
running batch job is running.
2. F <DB2 ECCR>,STOP - this terminates immediately
3. P <DB2 ECCR> - this also terminates immediately

Once the ECCR(s) are down, you can then bring the Logger down.

The Logger: P <Logger job_name>

The Agent: CMDPREFIX SHUTDOWN

If you know that you are headed for an IPL, you can issue all these commands at the same time. The Listener and
ECCR(s) should start down, if you are looking for speed, issue F <Listener_job>,CLOSE FORCE to shut down the
Listener, then issue F <DB2 ECCR>,STOP to terminate DB2 ECCR, then shut down the Logger and the Agent.

Note: Bringing the Agent down before the ECCR(S) are down can result in a loss of captured data. If a new file/DB2
table/IMS database is being updated during this shutdown process and the Agent is not available, the call to see if the
source is registered returns a “Not being captured” answer. The update, therefore, occurs without you capturing
it, leaving your target in a broken state (which you won't know about until too late!)

Sizing the Logger

When you install PWX-CHANGE, up to two active log data sets are allocated with minimum size requirements. The
information in this section can help to determine if you need to increase the size of the data sets, and if you should
allocate additional log data sets. When you define your active log data sets, consider your system’s capacity and your
changed data requirements, including archiving and performance issues.

After the PWX Logger is active, you can change the log data set configuration as necessary. In general,
remember that you must balance the following variables:

● Data set size

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 814 of 1017


● Number of data sets
● Amount of archiving

The choices you make depend on the following factors:

● Resource availability requirements


● Performance requirements
● Whether you are running near-realtime or batch replication
● Data recovery requirements

An inverse relationship exists between the size of the log data sets and the frequency of archiving required. Larger
data sets need to be archived less often than smaller data sets.

Note: Although smaller data sets require more frequent archiving, the archiving process requires less time.

Use the following formulas to estimate the total space you need for each active log data set. For an example of the
calculated data set size, refer to the PowerExchange Reference Guide.

● active log data set size in bytes = (average size of captured change record * number of changes captured per
hour * desired number of hours between archives) * (1 + overhead rate)
● active log data set size in cylinders = active log data set size in tracks/number of tracks per cylinder
● active log data set size in tracks = active log data set size in bytes/number of usable bytes per track

When determining the average size of your captured change records, note the following information:

● PWX Change Capture captures the full object that is changed. For example, if one field in an IMS
segment has changed, the product captures the entire segment.
● The PWX header adds overhead to the size of the change record. Per record, the overhead is
approximately 300 bytes plus the key length.
● The type of change transaction affects whether PWX Change Capture includes a before-image, after-
image, or both:

❍ DELETE includes a before-image.


❍ INSERT includes an after-image.
❍ UPDATE includes both.

Informatica suggests using an overhead rate of 5 to 10 percent, which includes the following factors:

● Overhead for control information


● Overhead for writing recovery-related information, such as system checkpoints.

You have some control over the frequency of system checkpoints when you define your PWX Logger parameters.
See CHKPT_FREQUENCY in the PowerExchange Reference Guide for more information about this parameter.

DASD Capacity Conversion Table

Space Information Model 3390 Model 3380

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 815 of 1017


usable bytes per track 49,152 40,960

tracks per cylinder 15 15

This example is based on the following assumptions:

● estimated average size of a changed record = 600 bytes


● estimated rate of captured changes = 40,000 changes per hour
● desired number of hours between archives = 12
● overhead rate = 5 percent
● DASD model = 3390

The estimated size of each active log data set in bytes is calculated as follows:

600 * 40,000 * 12 * 1.05 = 302,400,000

The number of cylinders to allocate is calculated as follows:

302,400,000 / 49,152 = approximately 6152 tracks

6152 / 15 = approximately 410 cylinders

The following example shows an IDCAMS DEFINE statement that uses the above calculations:

DEFINE CLUSTER -
(NAME (HLQ.EDML.PRILOG.DS01) -
LINEAR -
VOLUMES(volser) -
SHAREOPTIONS(2,3) -
CYL(410) ) -
DATA -
(NAME(HLQ.EDML.PRILOG.DS01.DATA) )

The variable HLQ represents the high-level qualifier that you defined for the log data sets during installation.

Additional Logger Tips

The Logger format utility (EDMLUTL0) formats only the primary space allocation. This means that the Logger
does not use secondary allocation. This includes Candidate Volumes and Space, such as that allocated by SMS
when using a STORCLAS with the Guaranteed Space attribute. Logger active logs should be defined through
IDCAMS with:

● No secondary allocation.
● A single VOLSER in the VOLUME parameter.
● An SMS STORCLAS, if used, without GUARANTEED SPACE=YES.

PowerExchange Agent Commands

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 816 of 1017


You can use commands from the MVS system to control certain aspects of PowerExchange Agent processing. To
issue a PowerExchange Agent command, enter the PowerExchange Agent command prefix (as specified by
CmdPrefix in your configuration parameters), followed by the command. For example, if CmdPrefix=AG01, issue
the following command to close the Agent's message log:

AG01 LOGCLOSE

The PowerExchange Agent intercepts agent commands issued on the MVS console and processes them in the
agent address space. If the PowerExchange Agent address space is inactive, MVS rejects any PowerExchange
Agent commands that you issue. If the PowerExchange Agent has not been started during the current IPL, or if
you issue the command with the wrong prefix, MVS generates the following message:

IEE305I command COMMAND INVALID

See PowerExchange Reference Guide (8.1.1) for detailed information on Agent commands.

PowerExchange Logger Commands

The PowerExchange Logger uses two types of commands: interactive and batch

You run interactive commands from the MVS console when the PowerExchange logger is running. You can use
PowerExchange Logger interactive commands to:

● Display PowerExchange Logger log data sets, units of work (UOWs), and reader/writer connections.
● Resolve in-doubt UOWs.
● Stop a PowerExchange Logger.
● Print the contents of the PowerExchange active log file (in hexadecimal format).

You use batch commands primarily in batch change utility jobs to make changes to parameters and configurations
when the PowerExchange Logger is stopped. Use PowerExchange Logger batch commands to:

● Define PowerExchange Loggers and PowerExchange Logger options, including PowerExchange Logger
names, archive log options, buffer options, and mode (single or dual).
● Add log definitions to the restart data set.
● Delete data set records from the restart data set.
● Display log data sets, UOWs, and reader/writer connections.

See PowerExchange Reference Guide (8.1.1) for detailed information on Logger Commands (Chapter 4, Page 59)

Last updated: 05-Jun-08 14:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 817 of 1017


Third Party Scheduler

Challenge

Successfully integrate a third-party scheduler with PowerCenter. This Best Practice


describes various levels to integrate a third-party scheduler.

Description

Tasks such as getting server and session properties, session status, or starting or
stopping a workflow or a task can be performed either through the Workflow Monitor or
by integrating a third-party scheduler with PowerCenter. A third-party scheduler can be
integrated with PowerCenter at any of several levels. The level of integration depends
on the complexity of the workflow/schedule and the skill sets of production support
personnel.

Many companies want to automate the scheduling process by using scripts or third-
party schedulers. In some cases, they are using a standard scheduler and want to
continue using it to drive the scheduling process.

A third-party scheduler can start or stop a workflow or task, obtain session statistics,
and get server details using the pmcmd commands. Pmcmd is a program used to
communicate with the PowerCenter server.

Third Party Scheduler Integration Levels

In general, there are three levels of integration between a third-party scheduler and
PowerCenter: Low, Medium, and High.

Low Level

Low-level integration refers to a third-party scheduler kicking off the initial PowerCenter
workflow. This process subsequently kicks off the rest of the tasks or sessions. The
PowerCenter scheduler handles all processes and dependencies after the third-party
scheduler has kicked off the initial workflow. In this level of integration, nearly all control
lies with the PowerCenter scheduler.

This type of integration is very simple to implement because the third-party scheduler

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 818 of 1017


kicks off only one process. It is only used as a loophole to fulfil a corporate mandate on
a standard scheduler. This type of integration also takes advantage of the robust
functionality offered by the Workflow Monitor.

Low-level integration requires production support personnel to have a thorough


knowledge of PowerCenter. Because Production Support personnel in many
companies are only knowledgeable about the company’s standard scheduler, one of
the main disadvantages of this level of integration is that if a batch fails at some point,
the Production Support personnel may not be able to determine the exact breakpoint.
Thus, the majority of the production support burden falls back on the Project
Development team.

Medium Level

With Medium-level integration, a third-party scheduler kicks off some, but not all,
workflows or tasks. Within the tasks, many sessions may be defined with
dependencies. PowerCenter controls the dependencies within the tasks.

With this level of integration, control is shared between PowerCenter and the third-party
scheduler, which requires more integration between the third-party scheduler and
PowerCenter. Medium-level integration requires Production Support personnel to have
a fairly good knowledge of PowerCenter and also of the scheduling tool. If they do not
have in-depth knowledge about the tool, they may be unable to fix problems that arise,
so the production support burden is shared between the Project Development team
and the Production Support team.

High Level

With High-level integration, the third-party scheduler has full control of scheduling and
kicks off all PowerCenter sessions. In this case, the third-party scheduler is responsible
for controlling all dependencies among the sessions. This type of integration is the
most complex to implement because there are many more interactions between the
third-party scheduler and PowerCenter.

Production Support personnel may have limited knowledge of PowerCenter but must
have thorough knowledge of the scheduling tool. Because Production Support
personnel in many companies are knowledgeable only about the company’s standard
scheduler, one of the main advantages of this level of integration is that if the batch
fails at some point, the Production Support personnel are usually able to determine the
exact breakpoint. Thus, the production support burden lies with the Production Support
team.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 819 of 1017


Sample Scheduler Script

There are many independent scheduling tools on the market. The following is an
example of a AutoSys script that can be used to start tasks; it is included here simply
as an illustration of how a scheduler can be implemented in the PowerCenter
environment. This script can also capture the return codes, and abort on error,
returning a success or failure (with associated return codes to the command line or the
Autosys GUI monitor).

# Name: jobname.job
# Author: Author Name
# Date: 01/03/2005
# Description:
# Schedule: Daily
#
# Modification History
# When Who Why
#
#------------------------------------------------------------------

. jobstart $0 $*

# set variables
ERR_DIR=/tmp

# Temporary file will be created to store all the Error Information


# The file format is TDDHHMISS<PROCESS-ID>.lst
curDayTime=`date +%d%H%M%S`
FName=T$CurDayTime$$.lst

if [ $STEP -le 1 ]
then
echo "Step 1: RUNNING wf_stg_tmp_product_xref_table..."

cd /dbvol03/vendor/informatica/pmserver/
#pmcmd startworkflow -s ah-hp9:4001 -u Administrator -p informat01
wf_stg_tmp_product_xref_table
#pmcmd starttask -s ah-hp9:4001 -u Administrator -p informat01 -f
FINDW_SRC_STG -w WF_STG_TMP_PRODUCT_XREF_TABLE -wait
s_M_S

# The above lines need to be edited to include the name of the workflow or the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 820 of 1017


task that you are attempting to start.

TG_TMP_PRODUCT_XREF_TABLE

# Checking whether to abort the Current Process or not


RetVal=$?
echo "Status = $RetVal"
if [ $RetVal -ge 1 ]
then
jobend abnormal "Step 1: Failed wf_stg_tmp_product_xref_table...\n"
exit 1
fi
echo "Step 1: Successful"
fi

jobend normal

exit 0

Last updated: 06-Dec-07 15:10

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 821 of 1017


Determining Bottlenecks

Challenge

Because there are many variables involved in identifying and rectifying performance
bottlenecks, an efficient method for determining where bottlenecks exist is crucial to
good data warehouse management.

Description

The first step in performance tuning is to identify performance bottlenecks. Carefully


consider the following five areas to determine where bottlenecks exist; using a process
of elimination, investigating each area in the order indicated:

1. Target
2. Source
3. Mapping
4. Session
5. System

Best Practice Considerations

Use Thread Statistics to Identify Target, Source, and Mapping


Bottlenecks

Use thread statistics to identify source, target or mapping (transformation) bottlenecks.


By default, an Integration Service uses one reader, one transformation, and one target
thread to process a session. Within each session log, the following thread statistics are
available:

● Run time – Amount of time the thread was running


● Idle time – Amount of time the thread was idle due to other threads within
application or Integration Service. This value does not include time the thread
is blocked due to the operating system.
● Busy – Percentage of the overall run time the thread is not idle. This
percentage is calculated using the following formula:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 822 of 1017


(run time – idle time) / run time x 100

By analyzing the thread statistics found in an Integration Service session log, it is


possible to determine which thread is being used the most.

If a transformation thread is 100 percent busy and there are additional resources (e.g.,
CPU cycles and memory) available on the Integration Service server, add a partition
point in the segment.

If reader or writer thread is 100 percent busy, consider using string data types in source
or target ports since non-string ports require more processing.

Use the Swap Method to Test Changes in Isolation

Attempt to isolate performance problems by running test sessions. You should be able
to compare the session’s original performance with that of tuned session’s performance.

The swap method is very useful for determining the most common bottlenecks. It
involves the following five steps:

1. Make a temporary copy of the mapping, session and/or workflow that is to be


tuned, then tune the copy before making changes to the original.
2. Implement only one change at a time and test for any performance
improvements to gauge which tuning methods work most effectively in the
environment.
3. Document the change made to the mapping, session and/or workflow and the
performance metrics achieved as a result of the change. The actual execution
time may be used as a performance metric.
4. Delete the temporary mapping, session and/or workflow upon completion of
performance tuning.
5. Make appropriate tuning changes to mappings, sessions and/or workflows.

Evaluating the Five Areas of Consideration

Target Bottlenecks

Relational Targets

The most common performance bottleneck occurs when the Integration Service writes
to a target database. This type of bottleneck can easily be identified with the following

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 823 of 1017


procedure:

1. Make a copy of the original workflow


2. Configure the session in the test workflow to write to a flat file and run the
session.
3. Read the thread statistics in session log

If session performance increases significantly when writing to a flat file, you have a
write bottleneck. Consider performing the following tasks to improve performance:

● Drop indexes and key constraints


● Increase checkpoint intervals
● Use bulk loading
● Use external loading
● Minimize deadlocks
● Increase database network packet size
● Optimize target databases

Flat file targets

If the session targets a flat file, you probably do not have a write bottleneck. If the
session is writing to a SAN or a non-local file system, performance may be slower than
writing to a local file system. If possible, a session can be optimized by writing to a flat
file target local to the Integration Service. If the local flat file is very large, you can
optimize the write process by dividing it among several physical drives.

If the SAN or non-local file system is significantly slower than the local file system, work
with the appropriate network/storage group to determine if there are configuration
issues within the SAN.

Source Bottlenecks

Relational sources

If the session reads from a relational source, you can use a filter transformation, a read
test mapping, or a database query to identify source bottlenecks.

Using a Filter Transformation.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 824 of 1017


Add a filter transformation in the mapping after each source qualifier. Set the filter
condition to false so that no data is processed past the filter transformation. If the time it
takes to run the new session remains about the same, then you have a source
bottleneck.

Using a Read Test Session.

You can create a read test mapping to identify source bottlenecks. A read test mapping
isolates the read query by removing any transformation logic from the mapping. Use
the following steps to create a read test mapping:

1. Make a copy of the original mapping.


2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries.
3. Remove all transformations.
4. Connect the source qualifiers to a file target.

Use the read test mapping in a test session. If the test session performance is similar to
the original session, you have a source bottleneck.

Using a Database Query

You can also identify source bottlenecks by executing a read query directly against the
source database. To do so, perform the following steps:

● Copy the read query directly from the session log.


● Run the query against the source database with a query tool such as SQL
Plus.
● Measure the query execution time and the time it takes for the query to return
the first row.

If there is a long delay between the two time measurements, you have a source
bottleneck.

If your session reads from a relational source and is constrained by a source


bottleneck, review the following suggestions for improving performance:

● Optimize the query.


● Create tempdb as in-memory database.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 825 of 1017


● Use conditional filters.
● Increase database network packet size.
● Connect to Oracle databases using IPC protocol.

Flat file sources

If your session reads from a flat file source, you probably do not have a read
bottleneck. Tuning the line sequential buffer length to a size large enough to hold
approximately four to eight rows of data at a time (for flat files) may improve
performance when reading flat file sources. Also, ensure the flat file source is local to
the Integration Service.

Mapping Bottlenecks

If you have eliminated the reading and writing of data as bottlenecks, you may have a
mapping bottleneck. Use the swap method to determine if the bottleneck is in the
mapping.

Begin by adding a Filter transformation in the mapping immediately before each target
definition. Set the filter condition to false so that no data is loaded into the target tables.
If the time it takes to run the new session is the same as the original session, you have
a mapping bottleneck. You can also use the performance details to identify mapping
bottlenecks: high Rowsinlookupcache and High Errorrows counters indicate mapping
bottlenecks.

Follow these steps to identify mapping bottlenecks:

Create a test mapping without transformations

1. Make a copy of the original mapping.


2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries.
3. Remove all transformations.
4. Connect the source qualifiers to the target.

Check for High Rowsinlookupcache counters

Multiple lookups can slow the session. You may improve session performance by
locating the largest lookup tables and tuning those lookup expressions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 826 of 1017


Check for High Errorrows counters

Transformation errors affect session performance. If a session has large numbers in


any of the Transformation_errorrows counters, you may improve performance by
eliminating the errors.

For further details on eliminating mapping bottlenecks, refer to the Best Practice:
Tuning Mappings for Better Performance

Session Bottlenecks

Session performance details can be used to flag other problem areas. Create
performance details by selecting “Collect Performance Data” in the session properties
before running the session.

View the performance details through the Workflow Monitor as the session runs, or
view the resulting file. The performance details provide counters about each source
qualifier, target definition, and individual transformation within the mapping to help you
understand session and mapping efficiency.

To view the performance details during the session run:

● Right-click the session in the Workflow Monitor.


● Choose Properties.
● Click the Properties tab in the details dialog box.

To view the resulting performance daa file, look for the file session_name.perf in the
same directory as the session log and open the file in any text editor.

All transformations have basic counters that indicate the number of input row, output
rows, and error rows. Source qualifiers, normalizers, and targets have additional
counters indicating the efficiency of data moving into and out of buffers. Some
transformations have counters specific to their functionality. When reading performance
details, the first column displays the transformation name as it appears in the mapping,
the second column contains the counter name, and the third column holds the resulting
number or efficiency percentage.

Low buffer input and buffer output counters

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 827 of 1017


If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all
sources and targets, increasing the session DTM buffer pool size may improve
performance.

Aggregator, Rank, and Joiner readfromdisk and writetodisk counters

If a session contains Aggregator, Rank, or Joiner transformations, examine each


Transformation_readfromdisk and Transformation_writetodisk counter. If these
counters display any number other than zero, you can improve session performance by
increasing the index and data cache sizes.

If the session performs incremental aggregation, the Aggregator_readtodisk and


writetodisk counters display a number other than zero because the Integration Service
reads historical aggregate data from the local disk during the session and writes to disk
when saving historical data. Evaluate the incremental Aggregator_readtodisk and
writetodisk counters during the session. If the counters show any numbers other than
zero during the session run, you can increase performance by tuning the index and
data cache sizes.

Note: PowerCenter versions 6.x and above include the ability to assign memory
allocation per object. In versions earlier than 6.x, aggregators, ranks, and joiners were
assigned at a global/session level.

For further details on eliminating session bottlenecks, refer to the Best Practice: Tuning
Sessions for Better Performance and Tuning SQL Overrides and Environment for
Better Performance.

System Bottlenecks

After tuning the source, target, mapping, and session, you may also consider tuning the
system hosting the Integration Service.

The Integration Service uses system resources to process transformations, session


execution, and the reading and writing of data. The Integration Service also uses
system memory for other data tasks such as creating aggregator, joiner, rank, and
lookup table caches.

You can use system performance monitoring tools to monitor the amount of system
resources the Server uses and identify system bottlenecks.

● Windows NT/2000. Use system tools such as the Performance and

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 828 of 1017


Processes tab in the Task Manager to view CPU usage and total memory
usage. You can also view more detailed performance information by using the
Performance Monitor in the Administrative Tools on Windows.
● UNIX. Use the following system tools to monitor system performance
and identify system bottlenecks:
❍ lsattr -E -I sys0 - To view current system settings
❍ iostat - To monitor loading operation for every disk attached to the
database server
❍ vmstat or sar –w - To monitor disk swapping actions
❍ sar –u - To monitor CPU loading.

For further information regarding system tuning, refer to the Best


Practices: Performance Tuning UNIX Systems and Performance Tuning Windows
2000/2003 Systems.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 829 of 1017


Performance Tuning Databases (Oracle)

Challenge

Database tuning can result in a tremendous improvement in loading performance. This


Best Practice covers tips on tuning Oracle.

Description
Performance Tuning Tools

Oracle offers many tools for tuning an Oracle instance. Most DBAs are already familiar
with these tools, so we’ve included only a short description of some of the major ones
here.

V$ Views

V$ views are dynamic performance views that provide real-time information on


database activity, enabling the DBA to draw conclusions about database performance.
Because SYS is the owner of these views, only SYS can query them. Keep in mind that
querying these views impacts database performance; with each query having an
immediate hit. With this in mind, carefully consider which users should be granted the
privilege to query these views. You can grant viewing privileges with either the
‘SELECT’ privilege, which allows a user to view for individual V$ views or the ‘SELECT
ANY TABLE’ privilege, which allows the user to view all V$ views. Using the SELECT
ANY TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’ parameter be
set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to SYS owned objects.

Explain Plan

Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing bottlenecks
and developing a strategy to avoid them.

Explain Plan allows the DBA or developer to determine the execution path of a block of
SQL code. The SQL in a source qualifier or in a lookup that is running for a long time
should be generated and copied to SQL*PLUS or other SQL tool and tested to avoid
inefficient execution of these statements. Review the PowerCenter session log for long
initialization time (an indicator that the source qualifier may need tuning) and the time it
takes to build a lookup cache to determine if the SQL for these transformations should

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 830 of 1017


be tested.
SQL Trace
SQL Trace extends the functionality of Explain Plan by providing statistical information
about the SQL statements executed in a session that has tracing enabled. This utility is
run for a session with the ‘ALTER SESSION SET SQL_TRACE = TRUE’ statement.
TKPROF
The output of SQL Trace is provided in a dump file that is difficult to read. TKPROF
formats this dump file into a more understandable report.
UTLBSTAT & UTLESTAT
Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics and
begins the statistics collection process. Run this utility after the database has been up
and running (for hours or days). Accumulating statistics may take time, so you need to
run this utility for a long while and through several operations (i.e., both loading and
querying).
‘UTLESTAT’ ends the statistics collection process and generates an output file called
‘report.txt.’ This report should give the DBA a fairly complete idea about the level of
usage the database experiences and reveal areas that should be addressed.

Disk I/O
Disk I/O at the database level provides the highest level of performance gain in most
systems. Database files should be separated and identified. Rollback files should be
separated onto their own disks because they have significant disk I/O. Co-locate tables
that are heavily used with tables that are rarely used to help minimize disk contention.
Separate indexes so that when queries run indexes and tables, they are not fighting for
the same resource. Also be sure to implement disk striping; this, or RAID technology
can help immensely in reducing disk contention. While this type of planning is time
consuming, the payoff is well worth the effort in terms of performance gains.

Dynamic Sampling

Dynamic sampling enables the server to improve performance by:

● Estimating single-table predicate statistics where available statistics are


missing or may lead to bad estimations.
● Estimating statistics for tables and indexes with missing statistics.
● Estimating statistics for tables and indexes with out of date statistics.

Dynamic sampling is controlled by the OPTIMIZER_DYNAMIC_SAMPLING parameter,


which accepts values from "0" (off) to "10" (aggressive sampling) with a default value of
"2". At compile-time, Oracle determines if dynamic sampling can improve query

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 831 of 1017


performance. If so, it issues recursive statements to estimate the necessary statistics.
Dynamic sampling can be beneficial when:

● The sample time is small compared to the overall query execution time.
● Dynamic sampling results in a better performing query.

The query can be executed multiple times.

Automatic SQL Tuning in Oracle Database 10g


In its normal mode, the query optimizer needs to make decisions about execution plans
in a very short time. As a result, it may not always be able to obtain enough information
to make the best decision. Oracle 10g allows the optimizer to run in tuning mode,
where it can gather additional information and make recommendations about how
specific statements can be tuned further. This process may take several minutes for a
single statement so it is intended to be used on high-load, resource-intensive
statements.

In tuning mode, the optimizer performs the following analysis:


● Statistics Analysis. The optimizer recommends the gathering of statistics on
objects with missing or stale statistics. Additional statistics for these objects
are stored in an SQL profile.

● SQL Profiling. The optimizer may be able to improve performance by


gathering additional statistics and altering session specific parameters such as
the OPTIMIZER_MODE. If such improvements are possible, the information is
stored in an SQL profile. If accepted, this information can then used by the
optimizer when running in normal mode. Unlike a stored outline, which fixes
the execution plan, an SQL profile may still be of benefit when the contents of
the table alter drastically. Even so, it's sensible to update profiles periodically.
The SQL profiling is not performed when the tuining optimizer is run in limited
mode.

● Access Path Analysis. The optimizer investigates the effect of new or


modified indexes on the access path. Because its index recommendations
relate to a specific statement, where practical, it is also suggest the use of the
SQL Access Advisor to check the impact of these indexes on a representative
SQL workload.

● SQL Structure Analysis. The optimizer suggests alternatives for SQL


statements that contain structures that may affect performance. Be aware that
implementing these suggestions requires human intervention to check their

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 832 of 1017


validity.

TIP
The automatic SQL tuning features are accessible from Enterprise Manager on
the "Advisor Central" page

Useful Views

Useful views related to automatic SQL tuning include:

● DBA_ADVISOR_TASKS
● DBA_ADVISOR_FINDINGS
● DBA_ADVISOR_RECOMMENDATIONS
● DBA_ADVISOR_RATIONALE
● DBA_SQLTUNE_STATISTICS
● DBA_SQLTUNE_BINDS
● DBA_SQLTUNE_PLANS
● DBA_SQLSET
● DBA_SQLSET_BINDS
● DBA_SQLSET_STATEMENTS
● DBA_SQLSET_REFERENCES
● DBA_SQL_PROFILES
● V$SQL
● V$SQLAREA
● V$ACTIVE_SESSION_HISTORY

Memory and Processing

Memory and processing configuration is performed in the init.ora file. Because


each database is different and requires an experienced DBA to analyze and tune it for
optimal performance, a standard set of parameters to optimize PowerCenter is not
practical and is not likely to ever exist.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 833 of 1017


TIP
Changes made in the init.ora file take effect after a restart of the instance.
Use svrmgr to issue the commands “shutdown” and “startup” (eventually
“shutdown immediate”) to the instance. Note that svrmgr is no longer
available as of Oracle 9i because Oracle is moving to a web-based Server
Manager in Oracle 10g. If you are using Oracle 9i, install Oracle client tools
and log onto Oracle Enterprise Manager. Some other tools like DBArtisan
expose the initialization parameters.

The settings presented here are those used in a four-CPU AIX server running Oracle
7.3.4 set to make use of the parallel query option to facilitate parallel
processing queries and indexes. We’ve also included the descriptions and
documentation from Oracle for each setting to help DBAs of other (i.e., non-Oracle)
systems determine what the commands do in the Oracle environment to facilitate
setting their native database commands and settings in a similar fashion.

HASH_AREA_SIZE = 16777216

● Default value: 2 times the value of SORT_AREA_SIZE


● Range of values: any integer
● This parameter specifies the maximum amount of memory, in bytes, to be
used for the hash join. If this parameter is not set, its value defaults to twice
the value of the SORT_AREA_SIZE parameter.
● The value of this parameter can be changed without shutting down the Oracle
instance by using the ALTER SESSION command. (Note: ALTER SESSION
refers to the Database Administration command issued at the svrmgr
command prompt).
● HASH_JOIN_ENABLED

❍ In Oracle 7 and Oracle 8 the hash_join_enabled parameter must be set to


true.
❍ In Oracle 8i and above hash_join_enabled=true is the default value

● HASH_MULTIBLOCK_IO_COUNT

❍ Allows multiblock reads against the TEMP tablespace


❍ It is advisable to set the NEXT extentsize to greater than the value for
hash_multiblock_io_count to reduce disk I/O
❍ This is the same behavior seen when setting the
db_file_multiblock_read_count parameter for data tablespaces except this
one applies only to multiblock access of segments of TEMP Tablespace

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 834 of 1017


● STAR_TRANSFORMATION_ENABLED

❍ Determines whether a cost-based query transformation will be applied to


star queries
❍ When set to TRUE, the optimizer will consider performing a cost-based
query transformation on the n-way join table

● OPTIMIZER_INDEX_COST_ADJ

❍ Numeric parameter set between 0 and 1000 (default 1000)


❍ This parameter lets you tune the optimizer behavior for access path
selection to be more or less index friendly

Optimizer_percent_parallel=33
This parameter defines the amount of parallelism that the optimizer uses in its cost
functions. The default of 0 means that the optimizer chooses the best serial plan. A
value of 100 means that the optimizer uses each object's degree of parallelism in
computing the cost of a full-table scan operation.
The value of this parameter can be changed without shutting down the Oracle instance
by using the ALTER SESSION command. Low values favor indexes, while high values
favor table scans.

Cost-based optimization is always used for queries that reference an object with a
nonzero degree of parallelism. For such queries, a RULE hint or optimizer mode or
goal is ignored. Use of a FIRST_ROWS hint or optimizer mode overrides a nonzero
setting of OPTIMIZER_PERCENT_PARALLEL.

parallel_max_servers=40

● Used to enable parallel query.


● Initially not set on Install.
● Maximum number of query servers or parallel recovery processes for an
instance.

Parallel_min_servers=8

● Used to enable parallel query.


● Initially not set on Install.
● Minimum number of query server processes for an instance. Also the number
of query-server processes Oracle creates when the instance is started.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 835 of 1017


SORT_AREA_SIZE=8388608

● Default value: operating system-dependent


● Minimum value: the value equivalent to two database blocks
● This parameter specifies the maximum amount, in bytes, of program global
area (PGA) memory to use for a sort. After the sort is complete, and all that
remains to do is to fetch the rows out, the memory is released down to the size
specified by SORT_AREA_RETAINED_SIZE. After the last row is fetched out,
all memory is freed. The memory is released back to the PGA, not to the
operating system.
● Increasing SORT_AREA_SIZE size improves the efficiency of large sorts.
Multiple allocations never exist; there is only one memory area of
SORT_AREA_SIZE for each user process at any time.
● The default is usually adequate for most database operations. However, if very
large indexes are created, this parameter may need to be adjusted. For
example, if one process is doing all database access, as in a full database
import, then an increased value for this parameter may speed the import,
particularly the CREATE INDEX statements.

Automatic Shared Memory Management in Oracle 10g

Automatic Shared Memory Management puts Oracle in control of allocating memory


within the SGA. The SGA_TARGET parameter sets the amount of memory available to
the SGA. This parameter can be altered dynamically up to a maximum of the
SGA_MAX_SIZE parameter value. Provided the STATISTICS_LEVEL is set to
TYPICAL or ALL, and the SGA_TARGET is set to a value other than "0", Oracle will
control the memory pools that would otherwise be controlled by the following
parameters:

● DB_CACHE_SIZE (default block size)


● SHARED_POOL_SIZE
● LARGE_POOL_SIZE
● JAVA_POOL_SIZE

If these parameters are set to a non-zero value, they represent the minimum size for
the pool. These minimum values may be necessary if you experience application errors
when certain pool sizes drop below a specific threshold.

The following parameters must be set manually and take memory from the quota

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 836 of 1017


allocated by the SGA_TARGET parameter:

● DB_KEEP_CACHE_SIZE
● DB_RECYCLE_CACHE_SIZE
● DB_nK_CACHE_SIZE (non-default block size)
● STREAMS_POOL_SIZE
● LOG_BUFFER

IPC as an Alternative to TCP/IP on UNIX

On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on same
box), using an IPC connection can significantly reduce the time it takes to build a
lookup cache. In one case, a fact mapping that was using a lookup to get five columns
(including a foreign key) and about 500,000 rows from a table was taking 19 minutes.
Changing the connection type to IPC reduced this to 45 seconds. In another mapping,
the total time decreased from 24 minutes to 8 minutes for ~120-130 bytes/row, 500,000
row write (array inserts), and primary key with unique index in place. Performance went
from about 2Mb/min (280 rows/sec) to about 10Mb/min (1360 rows/sec).

A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:

DW.armafix =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL =TCP)
(HOST = armafix)
(PORT = 1526)
)
)
(CONNECT_DATA=(SID=DW)
)
)

Make a new entry in the tnsnames like this, and use it for connection to the local
Oracle instance:

DWIPC.armafix =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL=ipc)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 837 of 1017


(KEY=DW)
)
(CONNECT_DATA=(SID=DW))
)

Improving Data Load Performance


Alternative to Dropping and Reloading Indexes

Experts often recommend dropping and reloading indexes during very large loads to a
data warehouse but there is no easy way to do this. For example, writing a SQL
statement to drop each index, then writing another SQL statement to rebuild it, can be
a very tedious process.

Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes by


allowing you to disable and re-enable existing indexes. Oracle stores the name of each
index in a table that can be queried. With this in mind, it is an easy matter to write a
SQL statement that queries this table. then generate SQL statements as output to
disable and enable these indexes.

Run the following to generate output to disable the foreign keys in the data warehouse:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE


CONSTRAINT ' || CONSTRAINT_NAME || ' ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'R'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT


SYS_C0011077 ;

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT


SYS_C0011075 ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT


SYS_C0011060 ;

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 838 of 1017


ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT
SYS_C0011059 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE


CONSTRAINT SYS_C0011133 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE


CONSTRAINT SYS_C0011134 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE


CONSTRAINT SYS_C0011131 ;

Dropping or disabling primary keys also speeds loads. Run the results of this SQL
statement after disabling the foreign key constraints:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE


PRIMARY KEY ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'P'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY


KEY ;

Finally, disable any unique constraints with the following:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE


PRIMARY KEY ;'

FROM USER_CONSTRAINTS

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 839 of 1017


WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'U'

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT


SYS_C0011070 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE


CONSTRAINT SYS_C0011071 ;

Save the results in a single file and name it something like ‘DISABLE.SQL’

To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with
‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’ and run
it as a post-session command.

Re-enable constraints in the reverse order that you disabled them. Re-enable the
unique constraints first, and re-enable primary keys before foreign keys.

TIP
Dropping or disabling foreign keys often boosts loading, but also slows queries
(such as lookups) and updates. If you do not use lookups or updates on your
target tables, you should get a boost by using this SQL statement to generate
scripts. If you use lookups and updates (especially on large tables), you can
exclude the index that will be used for the lookup from your script. You may
want to experiment to determine which method is faster.

Optimizing Query Performance


Oracle Bitmap Indexing

With version 7.3.x, Oracle added bitmap indexing to supplement the traditional b-tree
index. A b-tree index can greatly improve query performance on data that has high
cardinality or contains mostly unique values, but is not much help for low cardinality/
highly-duplicated data and may even increase query time. A typical example of a low
cardinality field is gender – it is either male or female (or possibly unknown). This kind
of data is an excellent candidate for a bitmap index, and can significantly improve query
performance.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 840 of 1017


Keep in mind however, that b-tree indexing is still the Oracle default. If you don’t
specify an index type when creating an index, Oracle defaults to b-tree. Also note that
for certain columns, bitmaps are likely to be smaller and faster to create than a b-tree
index on the same column.

Bitmap indexes are suited to data warehousing because of their performance, size, and
ability to create and drop very quickly. Since most dimension tables in a warehouse
have nearly every column indexed, the space savings is dramatic. But it is important to
note that when a bitmap-indexed column is updated, every row associated with that
bitmap entry is locked, making bit-map indexing a poor choice for OLTP database
tables with constant insert and update traffic. Also, bitmap indexes are rebuilt after
each DML statement (e.g., inserts and updates), which can make loads very slow. For
this reason, it is a good idea to drop or disable bitmap indexes prior to the load and re-
create or re-enable them after the load.

The relationship between Fact and Dimension keys is another example of low
cardinality. With a b-tree index on the Fact table, a query processes by joining all the
Dimension tables in a Cartesian product based on the WHERE clause, then joins back
to the Fact table. With a bitmapped index on the Fact table, a ‘star query’ may be
created that accesses the Fact table first followed by the Dimension table joins,
avoiding a Cartesian product of all possible Dimension attributes. This ‘star query’
access method is only used if the STAR_TRANSFORMATION_ENABLED parameter is
equal to TRUE in the init.ora file and if there are single column bitmapped indexes
on the fact table foreign keys. Creating bitmap indexes is similar to creating b-tree
indexes. To specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’.
All other syntax is identical.

Bitmap Indexes

drop index emp_active_bit;

drop index emp_gender_bit;

create bitmap index emp_active_bit on emp (active_flag);

create bitmap index emp_gender_bit on emp (gender);

B-tree Indexes

drop index emp_active;

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 841 of 1017


drop index emp_gender;

create index emp_active on emp (active_flag);

create index emp_gender on emp (gender);

Information for bitmap indexes is stored in the data dictionary in dba_indexes,


all_indexes, and user_indexes with the word ‘BITMAP’ in the Uniqueness column
rather than the word ‘UNIQUE.’ Bitmap indexes cannot be unique.

To enable bitmap indexes, you must set the following items in the instance initialization
file:

● compatible = 7.3.2.0.0 # or higher


● event = "10111 trace name context forever"
● event = "10112 trace name context forever"
● event = "10114 trace name context forever"

Also note that the parallel query option must be installed in order to create bitmap
indexes. If you try to create bitmap indexes without the parallel query option, a syntax
error appears in the SQL statement; the keyword ‘bitmap’ won't be recognized.

TIP
To check if the parallel query option is installed, start and log into SQL*Plus. If
the parallel query option is installed, the word ‘parallel’ appears in the banner
text.

Index Statistics

Table method

Index statistics are used by Oracle to determine the best method to access tables and
should be updated periodically as normal DBA procedures. The following should
improve query results on Fact and Dimension tables (including appending and updating
records) by updating the table and index statistics for the data warehouse:

The following SQL statement can be used to analyze the tables in the database:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 842 of 1017


SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;'

FROM USER_TABLES

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

This generates the following result:

ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS;

ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS;

ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS;

The following SQL statement can be used to analyze the indexes in the database:

SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;'

FROM USER_INDEXES

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

This generates the following results:

ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;

ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS;

ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS;

Save these results as a SQL script to be executed before or after a load.

Schema method

Another way to update index statistics is to compute indexes by schema rather than by
table. If data warehouse indexes are the only indexes located in a single schema, you
can use the following command to update the statistics:

EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 843 of 1017


In this example, BDB is the schema for which the statistics should be updated. Note
that the DBA must grant the execution privilege for dbms_utility to the database user
executing this command.

TIP
These SQL statements can be very resource intensive, especially for very large
tables. For this reason, Informatica recommends running them at off-peak
times when no other process is using the database. If you find the exact
computation of the statistics consumes too much time, it is often acceptable to
estimate the statistics rather than compute them. Use ‘estimate’ instead of
‘compute’ in the above examples.
Parallelism

Parallel execution can be implemented at the SQL statement, database object, or


instance level for many SQL operations. The degree of parallelism should be identified
based on the number of processors and disk drives on the server, with the number of
processors being the minimum degree.

SQL Level Parallelism

Hints are used to define parallelism at the SQL statement level. The following examples
demonstrate how to utilize four processors:

SELECT /*+ PARALLEL(order_fact,4) */ …;

SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …;

TIP
When using a table alias in the SQL Statement, be sure to use this alias in the
hint. Otherwise, the hint will not be used, and you will not receive an error
message.

Example of improper use of alias:

SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME

FROM EMP A

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 844 of 1017


Here, the parallel hint will not be used because of the used alias “A” for table EMP. The
correct way is:

SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME


FROM EMP A

Table Level Parallelism

Parallelism can also be defined at the table and index level. The following example
demonstrates how to set a table’s degree of parallelism to four for all eligible SQL
statements on this table:

ALTER TABLE order_fact PARALLEL 4;

Ensure that Oracle is not contending with other processes for these resources or you
may end up with degraded performance due to resource contention.

Additional Tips

Executing Oracle SQL Scripts as Pre- and Post-Session Commands


on UNIX

You can execute queries as both pre- and post-session commands. For a UNIX
environment, the format of the command is:

sqlplus –s user_id/password@database @ script_name.sql

For example, to execute the ENABLE.SQL file created earlier (assuming the data
warehouse is on a database named ‘infadb’), you would execute the following as a post-
session command:

sqlplus –s user_id/password@infadb @ enable.sql

In some environments, this may be a security issue since both username and
password are hard-coded and unencrypted. To avoid this, use the operating system’s
authentication to log onto the database instance.

In the following example, the Informatica id “pmuser” is used to log onto the Oracle
database. Create the Oracle user “pmuser” with the following SQL statement:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 845 of 1017


CREATE USER PMUSER IDENTIFIED EXTERNALLY
DEFAULT TABLESPACE . . .
TEMPORARY TABLESPACE . . .

In the following pre-session command, “pmuser” (the id Informatica is logged onto the
operating system as) is automatically passed from the operating system to the
database and used to execute the script:

sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL

You may want to use the init.ora parameter “os_authent_prefix” to distinguish between
“normal” oracle-users and “external-identified” ones.

DRIVING_SITE ‘Hint’

If the source and target are on separate instances, the Source Qualifier transformation
should be executed on the target instance.

For example, you want to join two source tables (A and B) together, which may reduce
the number of selected rows. However, Oracle fetches all of the data from both tables,
moves the data across the network to the target instance, then processes everything
on the target instance. If either data source is large, this causes a great deal of network
traffic. To force the Oracle optimizer to process the join on the source instance, use the
‘Generate SQL’ option in the source qualifier and include the ‘driving_site’ hint in the
SQL statement as:

SELECT /*+ DRIVING_SITE */ …;

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 846 of 1017


Performance Tuning Databases (SQL
Server)

Challenge

Database tuning can result in tremendous improvement in loading performance. This


Best Practice offers tips on tuning SQL Server.

Description

Proper tuning of the source and target database is a very important consideration in the
scalability and usability of a business data integration environment. Managing
performance on an SQL Server involves the following points.

● Manage system memory usage (RAM caching).


● Create and maintain good indexes.
● Partition large data sets and indexes.
● Monitor disk I/O subsystem performance.
● Tune applications and queries.
● Optimize active data.

Taking advantage of grid computing is another option for improving the overall SQL
Server performance. To set up a SQL Server cluster environment, you need to set up a
cluster where the databases are split among the nodes. This provides the ability to
distribute the load across multiple nodes. To achieve high performance, Informatica
recommends using a fibre-attached SAN device for shared storage.

Manage RAM Caching

Managing RAM buffer cache is a major consideration in any database server


environment. Accessing data in RAM cache is much faster than accessing the same
information from disk. If database I/O can be reduced to the minimal required set of
data and index pages, the pages stay in RAM longer. Too much unnecessary data and
index information flowing into buffer cache quickly pushes out valuable pages. The
primary goal of performance tuning is to reduce I/O so that buffer cache is used
effectively.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 847 of 1017


Several settings in SQL Server can be adjusted to take advantage of SQL Server RAM
usage:

● Max async I/O is used to specify the number of simultaneous disk I/O
operations that SQL Server can submit to the operating system. Note that this
setting is automated in SQL Server 2000
● SQL Server allows several selectable models for database recovery, these
include:

❍ Full Recovery
❍ Bulk-Logged Recovery
❍ Simple Recovery

Create and Maintain Good Indexes

Creating and maintaining good indexes is key to maintaining minimal I/O for all
database queries.

Partition Large Data Sets and Indexes

To reduce overall I/O contention and improve parallel operations, consider partitioning
table data and indexes. Multiple techniques for achieving and managing partitions
using SQL Server 2000 are addressed in this document.

Tune Applications and Queries

Tuning applications and queries is especially important when a database server is


likely to be servicing requests from hundreds or thousands of connections through a
given application. Because applications typically determine the SQL queries that are
executed on a database server, it is very important for application developers to
understand SQL Server architectural basics and know how to take full advantage of
SQL Server indexes to minimize I/O.

Partitioning for Performance

The simplest technique for creating disk I/O parallelism is to use hardware partitioning
and create a single "pool of drives" that serves all SQL Server database files except
transaction log files, which should always be stored on physically-separate disk drives
dedicated to log files. (See Microsoft documentation for installation procedures.)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 848 of 1017


Objects For Partitioning Consideration

The following areas of SQL Server activity can be separated across different hard
drives, RAID controllers, and PCI channels (or combinations of the three):

● Transaction logs
● Tempdb
● Database
● Tables
● Nonclustered Indexes

Note: In SQL Server 2000, Microsoft introduced enhancements to distributed


partitioned views that enable the creation of federated databases (commonly referred
to as scale-out), which spread resource load and I/O activity across multiple servers.
Federated databases are appropriate for some high-end online analytical processing
(OLTP) applications, but this approach is not recommended for addressing the needs
of a data warehouse.

Segregating the Transaction Log

Transaction log files should be maintained on a storage device that is physically


separate from devices that contain data files. Depending on your database recovery
model setting, most update activity generates both data device activity and log activity.
If both are set up to share the same device, the operations to be performed compete
for the same limited resources. Most installations benefit from separating these
competing I/O activities.

Segregating tempdb

SQL Server creates a database, tempdb, on every server instance to be used by the
server as a shared working area for various activities, including temporary tables,
sorting, processing subqueries, building aggregates to support GROUP BY or ORDER
BY clauses, queries using DISTINCT (temporary worktables have to be created to
remove duplicate rows), cursors, and hash joins.

To move the tempdb database, use the ALTER DATABASE command to change the
physical file location of the SQL Server logical file name associated with tempdb. For
example, to move tempdb and its associated log to the new file locations E:\mssql7 and
C:\temp, use the following commands:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 849 of 1017


alterdatabasetempdbmodifyfile(name='tempdev',filename=

'e:\mssql7\tempnew_location.mDF')

alterdatabasetempdbmodifyfile(name='templog',filename=

'c:\temp\tempnew_loglocation.mDF')

The master database, msdb, and model databases are not used much during
production (as compared to user databases), so it is generally y not necessary to
consider them in I/O performance tuning considerations. The master database is
usually used only for adding new logins, databases, devices, and other system objects.

Database Partitioning

Databases can be partitioned using files and/or filegroups. A filegroup is simply a


named collection of individual files grouped together for administration purposes. A file
cannot be a member of more than one filegroup. Tables, indexes, text, ntext, and
image data can all be associated with a specific filegroup. This means that all their
pages are allocated from the files in that filegroup. The three types of filegroups are:

● Primary filegroup. Contains the primary data file and any other files not
placed into another filegroup. All pages for the system tables are allocated
from the primary filegroup.
● User-defined filegroup. Any filegroup specified using the FILEGROUP
keyword in a CREATE DATABASE or ALTER DATABASE statement, or on
the Properties dialog box within SQL Server Enterprise Manager.
● Default filegroup. Contains the pages for all tables and indexes that do not
have a filegroup specified when they are created. In each database, only one
filegroup at a time can be the default filegroup. If no default filegroup is
specified, the default is the primary filegroup.

Files and filegroups are useful for controlling the placement of data and indexes
and eliminating device contention. Quite a few installations also leverage files and
filegroups as a mechanism that is more granular than a database in order to exercise
more control over their database backup/recovery strategy.

Horizontal Partitioning (Table)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 850 of 1017


Horizontal partitioning segments a table into multiple tables, each containing the same
number of columns but fewer rows. Determining how to partition tables horizontally
depends on how data is analyzed. A general rule of thumb is to partition tables so
queries reference as few tables as possible. Otherwise, excessive UNION queries,
used to merge the tables logically at query time, can impair performance.

When you partition data across multiple tables or multiple servers, queries accessing
only a fraction of the data can run faster because there is less data to scan. If the
tables are located on different servers, or on a computer with multiple processors, each
table involved in the query can also be scanned in parallel, thereby improving query
performance. Additionally, maintenance tasks, such as rebuilding indexes or backing
up a table, can execute more quickly.

By using a partitioned view, the data still appears as a single table and can be queried
as such without having to reference the correct underlying table manually

Cost Threshold for Parallelism Option

Use this option to specify the threshold where SQL Server creates and executes
parallel plans. SQL Server creates and executes a parallel plan for a query only when
the estimated cost to execute a serial plan for the same query is higher than the value
set in cost threshold for parallelism. The cost refers to an estimated elapsed time in
seconds required to execute the serial plan on a specific hardware configuration. Only
set cost threshold for parallelism on symmetric multiprocessors (SMP).

Max Degree of Parallelism Option

Use this option to limit the number of processors (from a maximum of 32) to use in
parallel plan execution. The default value is zero, which uses the actual number of
available CPUs. Set this option to one to suppress parallel plan generation. Set the
value to a number greater than one to restrict the maximum number of processors
used by a single query execution.

Priority Boost Option

Use this option to specify whether SQL Server should run at a higher scheduling
priority than other processors on the same computer. If you set this option to one, SQL
Server runs at a priority base of 13. The default is zero, which is a priority base of
seven.

Set Working Set Size Option

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 851 of 1017


Use this option to reserve physical memory space for SQL Server that is equal to the
server memory setting. The server memory setting is configured automatically by SQL
Server based on workload and available resources. It can vary dynamically among
minimum server memory and maximum server memory. Setting ‘set working set’ size
means the operating system does not attempt to swap out SQL Server pages, even if
they can be used more readily by another process when SQL Server is idle.

Optimizing Disk I/O Performance

When configuring a SQL Server that contains only a few gigabytes of data and does
not sustain heavy read or write activity, you need not be particularly concerned with the
subject of disk I/O and balancing of SQL Server I/O activity across hard drives for
optimal performance. To build larger SQL Server databases however, which can
contain hundreds of gigabytes or even terabytes of data and/or that sustain heavy read/
write activity (as in a DSS application), it is necessary to drive configuration around
maximizing SQL Server disk I/O performance by load-balancing across multiple hard
drives.

Partitioning for Performance

For SQL Server databases that are stored on multiple disk drives, performance can be
improved by partitioning the data to increase the amount of disk I/O parallelism.

Partitioning can be performed using a variety of techniques. Methods for creating and
managing partitions include configuring the storage subsystem (i.e., disk, RAID
partitioning) and applying various data configuration mechanisms in SQL Server such
as files, file groups, tables and views. Some possible candidates for partitioning include:

● Transaction log
● Tempdb
● Database
● Tables
● Non-clustered indexes

Using bcp and BULK INSERT

Two mechanisms exist inside SQL Server to address the need for bulk movement of
data: the bcp utility and the BULK INSERT statement.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 852 of 1017


● Bcp is a command prompt utility that copies data into or out of SQL Server.
● BULK INSERT is a Transact-SQL statement that can be executed from within
the database environment. Unlike bcp, BULK INSERT can only pull data into
SQL Server. An advantage of using BULK INSERT is that it can copy data
into instances of SQL Server using a Transact-SQL statement, rather than
having to shell out to the command prompt.

TIP
Both of these mechanisms enable you to exercise control over the batch size.
Unless you are working with small volumes of data, it is good to get in the habit
of specifying a batch size for recoverability reasons. If none is specified, SQL
Server commits all rows to be loaded as a single batch. For example, you
attempt to load 1,000,000 rows of new data into a table. The server suddenly
loses power just as it finishes processing row number 999,999. When the
server recovers, those 999,999 rows will need to be rolled back out of the
database before you attempt to reload the data. By specifying a batch size of
10,000 you could have saved significant recovery time, because SQL Server
would have only had to rollback 9999 rows instead of 999,999.

General Guidelines for Initial Data Loads

While loading data:

● Remove indexes.
● Use Bulk INSERT or bcp.
● Parallel load using partitioned data files into partitioned tables.
● Run one load stream for each available CPU.
● Set Bulk-Logged or Simple Recovery model.
● Use the TABLOCK option.
● Create indexes.
● Switch to the appropriate recovery model.
● Perform backups

General Guidelines for Incremental Data Loads

● Load data with indexes in place.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 853 of 1017


● Use performance and concurrency requirements to determine locking
granularity (sp_indexoption).
● Change from Full to Bulk-Logged Recovery mode unless there is an overriding
need to preserve a point–in-time recovery, such as online users modifying the
database during bulk loads. Read operations should not affect bulk loads.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 854 of 1017


Performance Tuning Databases (Teradata)

Challenge

Database tuning can result in tremendous improvement in loading performance. This


Best Practice provides tips on tuning Teradata.

Description

Teradata offers several bulk load utilities including:

● MultiLoad which supports inserts, updates, deletes, and “upserts” to any


table.
● FastExport which is a high-performance bulk export utility.
● BTEQ which allows you to export data to a flat file but is suitable for smaller
volumes than FastExport.
● FastLoad which is used for loading inserts into an empty table.
● TPump which is a light-weight utility that does not lock the table that is being
loaded.

Tuning MultiLoad

There are many aspects to tuning a Teradata database. Several aspects of tuning can
be controlled by setting MultiLoad parameters to maximize write throughput. Other
areas to analyze when performing a MultiLoad job include estimating space
requirements and monitoring MultiLoad performance.

MultiLoad parameters

Below are the MultiLoad-specific parameters that are available in PowerCenter:

● TDPID. A client based operand that is part of the logon string.


● Date Format. Ensure that the date format used in your target flat file is
equivalent to the date format parameter in your MultiLoad script. Also validate
that your date format is compatible with the date format specified in the
Teradata database.
● Checkpoint. A checkpoint interval is similar to a commit interval for other

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 855 of 1017


databases. When you set the checkpoint value to less than 60, it represents
the interval in minutes between checkpoint operations. If the checkpoint is set
to a value greater than 60, it represents the number of records to write before
performing a checkpoint operation. To maximize write speed to the database,
try to limit the number of checkpoint operations that are performed.
● Tenacity. Interval in hours between MultiLoad attempts to log on to the
database when the maximum number of sessions are already running.
● Load Mode. Available load methods include Insert, Update, Delete, and
Upsert. Consider creating separate external loader connections for each
method, selecting the one that will be most efficient for each target table.
● Drop Error Tables. Allows you to specify whether to drop or retain the three
error tables for a MultiLoad session. Set this parameter to 1 to drop error
tables or 0 to retain error tables.
● Max Sessions. This parameter specifies the maximum number of sessions
that are allowed to log on to the database. This value should not exceed one
per working amp (Access Module Process).
● Sleep. This parameter specifies the number of minutes that MultiLoad waits
before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs

Always estimate the final size of your MultiLoad target tables and make sure the
destination has enough space to complete your MultiLoad job. In addition to the space
that may be required by target tables, each MultiLoad job needs permanent space for:

● Work tables
● Error tables
● Restart Log table

Note: Spool space cannot be used for MultiLoad work tables, error tables, or the
restart log table. Spool space is freed at each restart. By using permanent space for the
MultiLoad tables, data is preserved for restart operations after a system failure. Work
tables, in particular, require a lot of extra permanent space. Also remember to account
for the size of error tables since error tables are generated for each target table.

Use the following formula to prepare the preliminary space estimate for one target
table, assuming no fallback protection, no journals, and no non-unique secondary
indexes:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 856 of 1017


PERM = (using data size + 38) x (number of rows processed) x (number of
apply conditions satisfied) x (number of Teradata SQL statements within the
applied DML)

Make adjustments to your preliminary space estimates according to the requirements


and expectations of your MultiLoad job.

Monitoring MultiLoad Performance

Below are tips for analyzing MultiLoad performance:

1. Determine which phase of the MultiLoad job is causing poor performance.

● If the performance bottleneck is during the acquisition phase, as data is


acquired from the client system, then the issue may be with the client system.
If it is during the application phase, as data is applied to the target tables, then
the issue is not likely to be with the client system.
● The MultiLoad job output lists the job phases and other useful information.
Save these listings for evaluation.

2. Use the Teradata RDBMS Query Session utility to monitor the progress of the
MultiLoad job.
3. Check for locks on the MultiLoad target tables and error tables.
4. Check the DBC.Resusage table for problem areas, such as data bus or CPU
capacities at or near 100 percent for one or more processors.
5. Determine whether the target tables have non-unique secondary indexes
(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a
separate NUSI change row to be applied to each NUSI sub-table after all of the
rows have been applied to the primary table.
6. Check the size of the error tables. Write operations to the fallback error tables
are performed at normal SQL speed, which is much slower than normal
MultiLoad tasks.
7. Verify that the primary index is unique. Non-unique primary indexes can cause
severe MultiLoad performance problems
8. Poor performance can happen when the input data is skewed with respect to
the Primary Index of the database. Teradata depends upon random and well
distributed data for data input and retrieval. For example, a file containing a
million rows with a single value 'AAAAAA' for the Primary Index will take an
infinite time to load.
9. One common tool used for determining load issues/skewed data/locks is
Performance Monitor (PMON). PMON requires MONITOR access on the
Teradata system. If you do not have Monitor access, then the DBA can help

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 857 of 1017


you to look at the system.
10. SQL against the system catalog can also be used to determine any
performance bottle necks. The following query is used to see if the load is
inserting data into the system. Spool space (a type of work space) is inside the
build as data is transferred to the database. So if the load is going well, the
spool will be built rapidly in the database. Use the following query to check:

SELECT sum(currentspool) from dbc.diskspace where databasename = userid


loading the database.

After the spool rises has a reached its peak, spool will fall rapidly as data is
inserted from spool into the table. If the spool grows slowly, then the input data
is probably skewed.

FastExport

FastExport is a bulk export Teradata utility. One way to pull up data for Lookup/
Sources is by using ODBC since there is not native connectivity to Teradata. However,
ODBC is slow. For higher performance, use FastExport if the number of rows to be
pulled is in the order of a million rows. FastExport writes to a file. The lookup or source
qualifier then reads this file. FastExport integrated within PowerCenter.

BTEQ

BTEQ is a SQL executor utility similar to SQL*Plus. Life FastExport, BTEQ allows you
to export data to a flat file, but is suitable for smaller volumes of data. This provides
faster performance than ODBC but doesn't tax Teradata system resources the way
FastExport can. A possible use for BTEQ with PowerCenter is to export smaller
volumes of data to a flat file (i.e., less than 1 million rows). The flat file is then read by
PowerCenter. BTEQ is not integrated with PowerCenter but can be called from a pre-
session script.

TPump

TPump was a load utility primarily intended for streaming data (think of loading bundles
of messages arriving from MQ using Power Center Real Time). TPump can also load
from a file or a named pipe.

While FastLoad and MultiLoad are bulk load utilities, TPump is a lightweight utility.
Another important difference between MultiLoad and TPump is that TPump locks at the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 858 of 1017


row-hash level instead of the table level thus providing users read access to fresher
data. Although Teradata says that it has improved the speed of TPump for loading files
to compare with that of MultiLoad. So, try a test load using TPump first. Also, be
cautious with the use of TPump to load streaming data if the data throughput is large.

Push Down Optimization

PowerCenter embeds a powerful engine that actually has a memory management


system built within and all the smart algorithms built into the engine to perform various
transformation operations such as aggregation, sorting, joining, lookup etc. This is a
typically referred to as an ETL architecture where Extracts, Transformations and Loads
are performed. So, data is extracted from the data source to the PowerCenter Engine
(can be on the same machine as the source or a separate machine) where all the
transformations are applied and then pushed to the target. Some of the performance
considerations for this type of architecture are:

Is the network fast enough and tuned effectively to support the


necessary data transfer?

Is the hardware on which PowerCenter is running


sufficiently robust with high processing capability and high memory
capacity.

ELT (Extract, Load, Transform) is a relatively new design or runtime paradigm


that became popular with the advent of high performance RDBM systems such asDSS
and OLTP. Because Teradata typically runs on well tuned operating systems and well
tuned hardware, the ELT paradigm tries to push as much of the transformation logic as
possible onto the Teradata system.

The ELT design paradigm can be achieved through the Pushdown Optimization option
offered with PowerCenter.

ETL or ELT

Because many database vendors and consultants advocate using ELT (Extract, Load
and Transform) over ETL (Extract, Transform and Load), the use of Pushdown
Optimization can be somewhat controversial. Informatica advocates using Pushdown
Optimization as an option to solve specific performance situations rather than as the
default design of a mapping.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 859 of 1017


The following scenarios can help in deciding on when to use ETL with PowerCenter
and when to use ELT (i.e., Pushdown Optimization):

1. When the load needs to look up only dimension tables then there may be no
need to use Pushdown Optimization. In this context, PowerCenter's ability to
build dynamic, persistent caching is significant. If a daily load involves 10s or
100s of fact files to be loaded throughout the day, then dimension surrogate
keys can be easily obtained from PowerCenter's cache in memory. Compare
this with the cost of running the same dimension lookup queries on the
database.
2. In many cases large Teradata systems contain only a small amount of data. In
such cases there may be no need to push down.
3. When only simple filters or expressions need to be applied on the data then
there may be no need to push down. The special case is that of applying filters
or expression logic to non-unique columns in incoming data in PowerCenter.
Compare this to loading the same data into the database and then applying a
WHERE clause on a non-unique column, which is highly inefficient for a large
table.

The principle here is: Filter and resolve the data AS it gets loaded instead of
loading it into a database, querying the RDBMS to filter/resolve and re-loading it
into the database. In other words, ETL instead of ELT.
4. Push Down optimization needs to be considered only if a large set of data
needs to be merged or queried for getting to your final load set.

Maximizing Performance using Pushdown Optimization

You can push transformation logic to either the source or target database using
pushdown optimization. The amount of work you can push to the database depends on
the pushdown optimization configuration, the transformation logic, and the mapping
and session configuration.

When you run a session configured for pushdown optimization, the Integration Service
analyzes the mapping and writes one or more SQL statements based on the mapping
transformation logic. The Integration Service analyzes the transformation logic,
mapping, and session configuration to determine the transformation logic it can push to
the database. At run time, the Integration Service executes any SQL statement
generated against the source or target tables, and processes any transformation logic
that it cannot push to the database.

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping
logic that the Integration Service can push to the source or target database. You can
also use the Pushdown Optimization Viewer to view the messages related to

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 860 of 1017


Pushdown Optimization.

Known Issues with Teradata

You may encounter the following problems using ODBC drivers with a Teradata
database:

● Teradata sessions fail if the session requires a conversion to a numeric data


type and the precision is greater than 18.
● Teradata sessions fail when you use full pushdown optimization for a session
containing a Sorter transformation.
● A sort on a distinct key may give inconsistent results if the sort is not case
sensitive and one port is a character port.
● A session containing an Aggregator transformation may produce different
results from PowerCenter if the group by port is a string data type and it is not
case-sensitive.
● A session containing a Lookup transformation fails if it is configured for target-
side pushdown optimization.
● A session that requires type casting fails if the casting is from x to date/time.
● A session that contains a date to string conversion fails

Working with SQL Overrides

You can configure the Integration Service to perform an SQL override with Pushdown
Optimization. To perform an SQL override, you configure the session to create a view.
When you use a SQL override for a Source Qualifier transformation in a session
configured for source or full Pushdown Optimization with a view, the Integration Service
creates a view in the source database based on the override. After it creates the view
in the database, the Integration Service generates a SQL query that it can push to the
database. The Integration Service runs the SQL query against the view to perform
Pushdown Optimization.

Note: To use an SQL override with pushdown optimization, you must configure the
session for pushdown optimization with a view.

Running a Query

If the Integration Service did not successfully drop the view, you can run a query
against the source database to search for the views generated by the Integration
Service. When the Integration Service creates a view, it uses a prefix of PM_V. You

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 861 of 1017


can search for views with this prefix to locate the views created during pushdown
optimization.

Teradata specific SQL:

SELECT TableName FROM DBC.Tables

WHERE CreatorName = USER

AND TableKind ='V'

AND TableName LIKE 'PM\_V%' ESCAPE '\'

Rules and Guidelines for SQL OVERIDE

Use the following rules and guidelines when you configure pushdown optimization for a
session containing an SQL override:

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 862 of 1017


Performance Tuning in a Real-Time Environment

Challenge

As Data Integration becomes a broader and more service-oriented Information Technology initiative, real-time and
right-time solutions will become critical to the success of the overall architecture. Tuning real-time processes is often
different then tuning batch processes.

Description

To remain agile and flexible in increasingly competitive environments, today’s companies are dealing with
sophisticated operational scenarios such as consolidation of customer data in real time to support a call center or
the delivery of precise forecasts for supply chain operation optimization. To support such highly demanding
operational environments, data integration platforms must do more than serve analytical data needs. They must
also support real-time, 24x7, mission-critical operations that involve live or current information available across the
enterprise and beyond. They must access, cleanse, integrate and deliver data in real time to ensure up-to-the-
second information availability. Also, data integration platforms must intelligently scale to meet both increasing data
volumes and also increasing numbers of concurrent requests that are typical of shared services Integration
Competency Center (ICC) environments. The data integration platforms must also be extremely reliable, providing
high availability to minimize outages and ensure seamless failover and recovery as every minute of downtime can
lead to huge impacts on business operations.

PowerCenter can be used to process data in real time. Real-time processing is on-demand processing of data from
real-time sources. A real-time session reads, processes and writes data to targets continuously. By default, a
session reads and writes bulk data at scheduled intervals unless it is configured for real-time processing.

To process data in real time, the data must originate from a real-time source. Real-time sources include JMS,
WebSphere MQ, TIBCO, webMethods, MSMQ, SAP, and web services. Real-time processing can also be used for
processes that require immediate access to dynamic data (i.e., financial data).

Latency Impact on performance

Use the Real-time Flush Latency session condition to control the target commit latency when running in real-time
mode. PWXPC commits source data to the target at the end of the specified maximum latency period. This
parameter requires a valid value and has a valid default value.

When the session runs, PWXPC begins to read data from the source. After data is provided to the source qualifier,
the Real-Time Flush Latency interval begins. At the end of each Real-Time Flush Latency interval and an end-UOW
boundary is reached, PWXPC issues a commit to the target. The following message appears in the session log to
indicate that this has occurred:

[PWXPC_10082] [INFO] [CDCDispatcher] raising real-time flush with restart tokens [restart1_token],
[restart2_token] because Real-time Flush Latency [RTF_millisecs] occurred

Only complete UOWs are committed during real-time flush processing.

The commit to the target when reading CDC data is not strictly controlled by the Real-Time Flush Latency
specification. The UOW Count and the Commit Threshold values also determine the commit frequency.

The value specified for Real-Time Flush Latency also controls the PowerExchange Consumer API (CAPI) interface
timeout value (PowerExchange latency) on the source platform. The CAPI interface timeout value is displayed in the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 863 of 1017


following PowerExchange message on the source platform (and in the session log if “Retrieve PWX Log Entries” is
specified in the Connection Attributes):

PWX-09957 CAPI i/f: Read times out after <n> seconds

The CAPI interface timeout also affects latency as it will affect how quickly changes are returned to the PWXPC
reader by PowerExchange. PowerExchange will ensure that it returns control back to PWXPC at least once every
CAPI interface timeout period. This allows the PWXPC to regain control and, if necessary, perform the real-time
flush of data returned. A high RTF Latency specification will also impact the speed with which stop requests from
PowerCenter are handled as the PWXPC CDC Reader must wait for PowerExchange to return control before it can
handle the stop request.

TIP
Use the PowerExchange STOPTASK command to shutdown more quickly when using a high RTF Latency value.

For example, if the value for Real-Time Flush Latency is 10 seconds, PWXPC will issue a commit for all data read
after 10 seconds have elapsed and the next end-UOW boundary is received. The lower the value is set, the faster
the data commits data to the target. As the lowest possible latency is required for the application of changes to the
target, specify a low Real-Time Flush Latency value.

Warning: When you specify a low Real-Time Flush Latency interval, the session might consume more system
resources on the source and target platforms. This is because:

● The session will commit to the target more frequently therefore consuming more target resources.
● PowerExchange will return more frequently to the PWXPC reader thereby passing fewer rows on each
iteration and consuming more resources on the source PowerExchange platform

Balance performance and resource consumption with latency requirements when choosing the UOW Count and
Real-Time Flush Latency values.

Commit Interval Impact on performance

Commit Threshold is only applicable to Real-Time CDC sessions. Use the Commit Threshold session condition to
cause commits before reaching the end of the UOW when processing large UOWs. This parameter requires a valid
value and has a valid default value

Commit Threshold can be used to cause a commit before the end of a UOW is received, a process also referred to
as sub-packet commit. The value specified in the Commit Threshold is the number of records within a source UOW
to process before inserting a commit into the change stream. This attribute is different from the UOW Count attribute
in that it is a count records within a UOW rather than complete UOWs. The Commit Threshold counter is reset when
either the number of records specified or the end of the UOW is reached.

This attribute is useful when there are extremely large UOWs in the change stream that might cause locking issues
on the target database or resource issues on the PowerCenter Integration Server.

The Commit Threshold count is cumulative across all sources in the group. This means that sub-packet commits are
inserted into the change stream when the count specified is reached regardless of the number of sources to which
the changes actually apply. For example, a UOW contains 900 changes for one source followed by 100 changes for
a second source and then 500 changes for the first source. If the Commit Threshold is set to 1000, the commit
record is inserted after the 1000th change record which is after the 100 changes for the second source.

Warning: A UOW may contain changes for multiple source tables. Using Commit Threshold can cause commits to
be generated at points in the change stream where the relationship between these tables is inconsistent. This may

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 864 of 1017


then result in target commit failures.

If 0 or no value is specified, commits will occur on UOW boundaries only. Otherwise, the value specified is used to
insert commit records into the change stream between UOW boundaries, where applicable.

The value of this attribute overrides the value specified in the PowerExchange DBMOVER configuration file
parameter SUBCOMMIT_THRESHOLD. For more information on this PowerExchange parameter, refer to the
PowerExchange Reference Manual.

The commit to the target when reading CDC data is not strictly controlled by the Commit Threshold specification.
The commit records inserted into the change stream as a result of the Commit Threshold value affect the UOW
Count counter. The UOW Count and the Real-Time Flush Latency values determine the target commit frequency.

For example, a UOW contains 1,000 change records (any combination of inserts, updates, and deletes). If 100 is
specified for the Commit Threshold and 5 for the UOW Count, then a commit record will be inserted after each 100
records and a target commit will be issued after every 500 records.

Last updated: 29-May-08 18:40

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 865 of 1017


Performance Tuning UNIX Systems

Challenge

Identify opportunities for performance improvement within the complexities of the UNIX
operating environment.

Description

This section provides an overview of the subject area, followed by discussion of the use
of specific tools.

Overview

All system performance issues are fundamentally resource contention issues. In any
computer system, there are three essential resources: CPU, memory, and I/O - namely
disk and network I/O. From this standpoint, performance tuning for PowerCenter
means ensuring that the PowerCenter and its sub-processes have adequate resources
to execute in a timely and efficient manner.

Each resource has its own particular set of problems. Resource problems are
complicated because all resources interact with each other. Performance tuning is
about identifying bottlenecks and making trade-off to improve the situation. Your best
approach is to initially take a baseline measurement and to obtain a good
understanding of how it behaves, then evaluate any bottleneck revealed on each
system resource during your load window and determine the removal of whichever
resource contention offers the greatest opportunity for performance enhancement.

Here is a summary of each system resource area and the problems it can have.

CPU

● On any multiprocessing and multi-user system, many processes want to use


the CPUs at the same time. The UNIX kernel is responsible for allocation of a
finite number of CPU cycles across all running processes. If the total demand
on the CPU exceeds its finite capacity, then all processing is likely to reflect a
negative impact on performance; the system scheduler puts each process in a
queue to wait for CPU availability.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 866 of 1017


● An average of the count of active processes in the system for the last 1, 5, and
15 minutes is reported as load average when you execute the command
uptime. The load average provides you a basic indicator of the number of
contenders for CPU time. Likewise vmstat command provides an average
usage of all the CPUs along with the number of processes contending for
CPU (the value under the r column).
● On SMP (symmetric multiprocessing) architecture servers, watch the even
utilization of all the CPUs. How well all the CPUs are utilized depends on how
well an application can be parallelized, If a process is incurring a high degree
of involuntary context switch by the kernel; binding the process to a specific
CPU may improve performance.

Memory

● Memory contention arises when the memory requirements of the active


processes exceed the physical memory available on the system; at this point,
the system is out of memory. To handle this lack of memory, the system starts
paging, or moving portions of active processes to disk in order to reclaim
physical memory. When this happens, performance decreases dramatically.
Paging is distinguished from swapping, which means moving entire processes
to disk and reclaiming their space. Paging and excessive swapping indicate
that the system can't provide enough memory for the processes that are
currently running.
● Commands such as vmstat and pstat show whether the system is paging; ps,
prstat and sar can report the memory requirements of each process.

Disk I/O

● The I/O subsystem is a common source of resource contention problems. A


finite amount of I/O bandwidth must be shared by all the programs (including
the UNIX kernel) that currently run. The system's I/O buses can transfer only
so many megabytes per second; individual devices are even more limited.
Each type of device has its own peculiarities and, therefore, its own problems.
● Tools are available to evaluate specific parts of the I/O subsystem

❍ iostat can give you information about the transfer rates for each disk
drive. ps and vmstat can give some information about how many
processes are blocked waiting for I/O.
❍ sar can provide voluminous information about I/O efficiency.
❍ sadp can give detailed information about disk access patterns.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 867 of 1017


Network I/O

● The source data, the target data, or both the source and target data are likely
to be connected through an Ethernet channel to the system where
PowerCenter resides. Be sure to consider the number of Ethernet channels
and bandwidth available to avoid congestion.

❍ netstat shows packet activity on a network, watch for high collision rate of
output packets on each interface.
❍ nfstat monitors NFS traffic; execute nfstat –c from a client machine (not
from the nfs server); watch for high time rate of total call and “not
responding” message.

Given that these issues all boil down to access to some computing resource, mitigation
of each issue con sists of making some adjustment to the environment to provide more
(or preferential) access to the resource; for instance:

● Adjusting execution schedules to allow leverage of low usage times may


improve availability of memory, disk, network bandwidth, CPU cycles, etc.
● Migrating other applications to other hardware is likely tol reduce demand on
the hardware hosting PowerCenter.
● For CPU intensive sessions, raising CPU priority (or lowering priority for
competing processes) provides more CPU time to the PowerCenter sessions.
● Adding hardware resources, such as adding memory, can make more
resource available to all processes.
● Re-configuring existing resources may provide for more efficient usage, such
as assigning different disk devices for input and output, striping disk devices,
or adjusting network packet sizes.

Detailed Usage

The following tips have proven useful in performance tuning UNIX-based machines.
While some of these tips are likely to be more helpful than others in a particular
environment, all are worthy of consideration.

Availability, syntax and format of each varies across UNIX versions.

Running ps -axu

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 868 of 1017


Run ps -axu to check for the following items:

● Are there any processes waiting for disk access or for paging? If so check the I/
O and memory subsystems.
● What processes are using most of the CPU? This may help to distribute the
workload better.
● What processes are using most of the memory? This may help to distribute the
workload better.
● Does ps show that your system is running many memory-intensive jobs? Look
for jobs with a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues

Use vmstat or sar to check for paging/swapping actions. Check the system to
ensure that excessive paging/swapping does not occur at any time during the session
processing. By using sar 5 10 or vmstat 1 10, you can get a snapshot of paging/
swapping. If paging or excessive swapping does occur at any time, increase memory to
prevent it. Paging/swapping, on any database system, causes a major performance
decrease and increased I/O. On a memory-starved and I/O-bound server, this can
effectively shut down the PowerCenter process and any databases running on the
server.

Some swapping may occur normally regardless of the tuning settings. This occurs
because some processes use the swap space by their design. To check swap space
availability, use pstat and swap. If the swap space is too small for the intended
applications, it should be increased.

Runvmstate 5 (sar wpgr ) for SunOS, vmstat S 5 to detect and confirm memory
problems and check for the following:

● Are pages-outs occurring consistently? If so, you are short of memory.


● Are there a high number of address translation faults? (System V only). This
suggests a memory shortage.
● Are swap-outs occurring consistently? If so, you are extremely short of
memory. Occasional swap-outs are normal; BSD systems swap-out inactive
jobs. Long bursts of swap-outs mean that active jobs are probably falling victim
and indicate extreme memory shortage. If you dont have vmstat S, look at the
w and de fields of vmstat. These should always be zero.

If memory seems to be the bottleneck, try following remedial steps:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 869 of 1017


● Reduce the size of the buffer cache (if your system has one) by decreasing
BUFPAGES.
● If you have statically allocated STREAMS buffers, reduce the number of large
(e.g., 2048- and 4096-byte) buffers. This may reduce network performance,
but netstat-m should give you an idea of how many buffers you really need.
● Reduce the size of your kernels tables. This may limit the systems capacity (i.
e., number of files, number of processes, etc.).
● Try running jobs requiring a lot of memory at night. This may not help the
memory problems, but you may not care about them as much.
● Try running jobs requiring a lot of memory in a batch queue. If only one
memory-intensive job is running at a time, your system may perform
satisfactorily.
● Try to limit the time spent running sendmail, which is a memory hog.
● If you dont see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues

Use iostat to check I/O load and utilization as well as CPU load. Iostat can be used
to monitor the I/O load on the disks on the UNIX server. Using iostat permits monitoring
the load on specific disks. Take notice of how evenly disk activity is distributed among
the system disks. If it is not, are the most active disks also the fastest disks?

Run sadp to get a seek histogram of disk activity. Is activity concentrated in one area
of the disk (good), spread evenly across the disk (tolerable), or in two well-defined
peaks at opposite ends (bad)?

● Reorganize your file systems and disks to distribute I/O activity as evenly as
possible.
● Using symbolic links helps to keep the directory structure the same throughout
while still moving the data files that are causing I/O contention.
● Use your fastest disk drive and controller for your root file system; this almost
certainly has the heaviest activity. Alternatively, if single-file throughput is
important, put performance-critical files into one file system and use the fastest
drive for that file system.
● Put performance-critical files on a file system with a large block size: 16KB or
32KB (BSD).
● Increase the size of the buffer cache by increasing BUFPAGES (BSD). This
may hurt your systems memory performance.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 870 of 1017


● Rebuild your file systems periodically to eliminate fragmentation (i.e., backup,
build a new file system, and restore).
● If you are using NFS and using remote files, look at your network situation.
You don’t have local disk I/O problems.
● Check memory statistics again by running vmstat 5 (sar-rwpg). If your system
is paging or swapping consistently, you have memory problems, fix memory
problem first. Swapping makes performance worse.

If your system has disk capacity problem and is constantly running out of disk space try
the following actions:

● Write a find script that detects old core dumps, editor backup and auto-save
files, and other trash and deletes it automatically. Run the script through cron.
● Use the disk quota system (if your system has one) to prevent individual users
from gathering too much storage.
● Use a smaller block size on file systems that are mostly small files (e.g.,
source code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues

Use uptime or sar -u to check for CPU loading. Sar provides more detail, including %
usr (user), %sys (system), %wio (waiting on I/O), and %idle (% of idle time). A target
goal should be %usr + %sys= 80 and %wio = 10 leaving %idle at 10.

If %wio is higher, the disk and I/O contention should be investigated to eliminate I/O
bottleneck on the UNIX server. If the system shows a heavy load of %sys, and %usr
has a high %idle, this is indicative of memory and contention of swapping/paging
problems. In this case, it is necessary to make memory changes to reduce the load on
the system server.

When you run iostat 5, also watch for CPU idle time. Is the idle time always 0, without
letup? It is good for the CPU to be busy, but if it is always busy 100 percent of the
time, work must be piling up somewhere. This points to CPU overload.

● Eliminate unnecessary daemon processes. rwhod and routed are particularly


likely to be performance problems, but any savings will help.
● Get users to run jobs at night with at or any queuing system thats available.
You may not care if the CPU (or the memory or I/O system) is overloaded at
night, provided the work is done in the morning.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 871 of 1017


● Using nice to lower the priority of CPU-bound jobs improves interactive
performance. Also, using nice to raise the priority of CPU-bound
jobs expedites them but may hurt interactive performance. In general though,
using nice is really only a temporary solution. If your workload grows, it will
soon become insufficient. Consider upgrading your system, replacing it, or
buying another system to share the load.

Identifying and Resolving Network I/O Issues

Suspect problems with network capacity or with data integrity if users experience
slow performance when they are using rlogin or when they are accessing files via NFS.

Look at netsat-i. If the number of collisions is large, suspect an overloaded network. If


the number of input or output errors is large, suspect hardware problems. A large
number of input errors indicate problems somewhere on the network. A large number
of output errors suggests problems with your system and its interface to the network.

If collisions and network hardware are not a problem, figure out which system
appears to be slow. Use spray to send a large burst of packets to the slow system. If
the number of dropped packets is large, the remote system most likely cannot respond
to incoming data fast enough. Look to see if there are CPU, memory or disk I/O
problems on the remote system. If not, the system may just not be able to tolerate
heavy network workloads. Try to reorganize the network so that this system isn’t a file
server.

A large number of dropped packets may also indicate data corruption. Run netstat-s
on the remote system, then spray the remote system from the local system and run
netstat-s again. If the increase of UDP socket full drops (as indicated by netstat) is
equal to or greater than the number of drop packets that spray reports, the remote
system is slow network server If the increase of socket full drops is less than the
number of dropped packets, look for network errors.

Run nfsstat and look at the client RPC data. If the retransfield is more than 5 percent
of calls, the network or an NFS server is overloaded. If timeout is high, at least one
NFS server is overloaded, the network may be faulty, or one or more servers may have
crashed. If badmix is roughly equal to timeout, at least one NFS server is overloaded. If
timeout and retrans are high, but badxid is low, some part of the network between the
NFS client and server is overloaded and dropping packets.

Try to prevent users from running I/O- intensive programs across the network.
The greputility is a good example of an I/O intensive program. Instead, have users log
into the remote system to do their work.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 872 of 1017


Reorganize the computers and disks on your network so that as many users as
possible can do as much work as possible on a local system.

Use systems with good network performance as file servers.

lsattr E l sys0 is used to determine some current settings on some UNIX


environments. (In Solaris, you execute prtenv.) Of particular attention is maxuproc, the
setting to determine the maximum level of user background processes. On most UNIX
environments, this is defaulted to 40, but should be increased to 250 on most systems.

Choose a file system. Be sure to check the database vendor documentation to


determine the best file system for the specific machine. Typical choices include: s5, the
UNIX System V file system; ufs, the UNIX file system derived from Berkeley (BSD);
vxfs, the Veritas file system; and lastly raw devices that, in reality are not a file system
at all. Additionally, for the PowerCenter Enterprise Grid Option cluster file system
(CFS), products such as GFS for RedHat Linux, Veritas CFS, and GPFS for IBM AIX
are some of the available choices.

Cluster File System Tuning

In order to take full advantage of the PowerCenter Enterprise Grid Option , cluster file
system (CFS) is recommended. PowerCenter Grid option requires that the directories
for each Integration Service to be shared with other servers. This allows Integration
Services to share files such as cache files between different session runs. CFS
performance is a result of tuning parameters and tuning the infrastructure. Therefore,
using the parameters recommended by each CFS vendor is the best approach for CFS
tuning.

PowerCenter Options

The Integration Service Monitor is available to display system resource usage


information about associated nodes. The window displays resource usage information
about the running tasks, including CPU%, memory, and swap usage.

The PowerCenter 64-bit option can allocate more memory to sessions and achieve
higher throughputs compared to 32-bit version of PowerCenter.

Last updated: 06-Dec-07 15:16

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 873 of 1017


INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 874 of 1017
Performance Tuning Windows 2000/2003
Systems

Challenge

Windows Server is designed as a self-tuning operating system. Standard installation


of Windows Server provides good performance out-of-the-box, but optimal performance
can be achieved by tuning.

Note: Tuning is essentially the same for both Windows 2000 and 2003-based systems.

Description

The following tips have proven useful in performance-tuning Windows Servers. While
some are likely to be more helpful than others in any particular environment, all are
worthy of consideration.

The two places to begin tuning an NT server are:

● Performance Monitor.
● Performance tab (hit ctrl+alt+del, choose task manager, and click on the
Performance tab).

Although the Performance Monitor can be tracked in real-time, creating a result-set


representative of a full day is more likely to render an accurate view of system
performance.

Resolving Typical Windows Server Problems

The following paragraphs describe some common performance problems in a Windows


Server environment and suggest tuning solutions.

Server Load: Assume that some software will not be well coded, and some
background processes (e.g., a mail server or web server) running on a single machine,
can potentially starve the machine's CPUs. In this situation, off-loading the CPU hogs
may be the only recourse.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 875 of 1017


Device Drivers: The device drivers for some types of hardware are notorious for
inefficient CPU clock cycles. Be sure to obtain the latest drivers from the hardware
vendor to minimize this problem.

Memory and services: Although adding memory to Windows Server is always a good
solution, it is also expensive and usually must be planned in advance. Before adding
memory, check the Services in Control Panel because many background applications
do not uninstall the old service when installing a new version. Thus, both the unused
old service and the new service may be using valuable CPU memory resources.

I/O Optimization: This is, by far, the best tuning option for database applications in
the Windows Server environment. If necessary, level the load across the disk devices
by moving files. In situations where there are multiple controllers, be sure to level the
load across the controllers too.

Using electrostatic devices and fast-wide SCSI can also help to increase performance.
Further, fragmentation can usually be eliminated by using a Windows Server disk
defragmentation product.

Finally, on Windows Servers, be sure to implement disk striping to split single data files
across multiple disk drives and take advantage of RAID (Redundant Arrays of
Inexpensive Disks) technology. Also increase the priority of the disk devices on the
Windows Server. Windows Server, by default, sets the disk device priority low.

Monitoring System Performance in Windows Server

In Windows Server, PowerCenter uses system resources to process transformation,


session execution, and reading and writing of data. The PowerCenter Integration
Service also uses system memory for other data such as aggregate, joiner, rank, and
cached lookup tables. With Windows Server, you can use the system monitor in the
Performance Console of the administrative tools, or system tools in the task manager,
to monitor the amount of system resources used by the PowerCenter and to identify
system bottlenecks.

Windows Server provides the following tools (accessible under the Control Panel/
Administration Tools/Performance) for monitoring resource usage on your computer:

● System Monitor
● Performance Logs and Alerts

These Windows Server monitoring tools enable you to analyze usage and detect

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 876 of 1017


bottlenecks at the disk, memory, processor, and network level.

System Monitor

The System Monitor displays a graph which is flexible and configurable. You can copy
counter paths and settings from the System Monitor display to the Clipboard and paste
counter paths from Web pages or other sources into the System Monitor display.
Because the System Monitor is portable, it is useful in monitoring other systems that
require administration.

Performance Monitor

The Performance Logs and Alerts tool provides two types of performance-related logs—
counter logs and trace logs—and an alerting function.

Counter logs record sampled data about hardware resources and system services
based on performance objects and counters in the same manner as System Monitor.
They can, therefore, be viewed in System Monitor. Data in counter logs can be saved
as comma-separated or tab-separated files that are easily viewed with Excel.

Trace logs collect event traces that measure performance statistics associated with
events such as disk and file I/O, page faults, or thread activity. The alerting function
allows you to define a counter value that will trigger actions such as sending a network
message, running a program, or starting a log. Alerts are useful if you are not actively
monitoring a particular counter threshold value but want to be notified when it exceeds
or falls below a specified value so that you can investigate and determine the cause of
the change. You may want to set alerts based on established performance baseline
values for your system.

Note: You must have Full Control access to a subkey in the registry in order to create
or modify a log configuration. (The subkey is HKEY_CURRENT_MACHINE\SYSTEM
\CurrentControlSet\Services\SysmonLog\Log_Queries).

The predefined log settings under Counter Logs (i.e., System Overview) are configured
to create a binary log that, after manual start-up, updates every 15 seconds and logs
continuously until it achieves a maximum size. If you start logging with the default
settings, data is saved to the Perflogs folder on the root directory and includes the
counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk Queue Length, and
Processor(_Total)\ % Processor Time.

If you want to create your own log setting, press the right mouse on one of the log

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 877 of 1017


types.

PowerCenter Options

The Integration Service Monitor is available to display system resource usage


information about associated nodes. The window displays resource usage information
about running task including CPU%, Memory and Swap usage.

PowerCenter's 64-bit option running on Intel Itanium processor-based machines and 64-
bit Windows Server 2003 can allocate more memory to sessions and achieve higher
throughputs than the 32-bit version of PowerCenter on Windows Server.

Using PowerCenter Grid option on Windows Server enables distribution of a session or


sessions in a workflow to multiple servers and reduces the processing load window.
The PowerCenter Grid option requires that the directories for each integration service
to be shared with other servers. This allows integration services to share files such as
cache files among various session runs. With a Cluster File System (CFS), integration
services running on various servers can perform concurrent reads and write to the
same block of data.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 878 of 1017


Recommended Performance Tuning
Procedures

Challenge

To optimize PowerCenter load times by employing a series of performance tuning


procedures.

Description

When a PowerCenter session or workflow is not performing at the expected or desired


speed, there is a methodology that can help to diagnose problems that may be
adversely affecting various components of the data integration architecture. While
PowerCenter has its own performance settings that can be tuned, you must consider
the entire data integration architecture, including the UNIX/Windows servers, network,
disk array, and the source and target databases to achieve optimal performance. More
often than not, an issue external to PowerCenter is the cause of the performance
problem. In order to correctly and scientifically determine the most logical cause of the
performance problem, you need to execute the performance tuning steps in a specific
order. This enables you to methodically rule out individual pieces and narrow down the
specific areas on which to focus your tuning efforts.

1. Perform Benchmarking

You should always have a baseline of current load times for a given workflow or
session with a similar row count. Maybe you are not achieving your required load
window or simply think your processes could run more efficiently based on comparison
with other similar tasks running faster. Use the benchmark to estimate what your
desired performance goal should be and tune to that goal. Begin with the problem
mapping that you created, along with a session and workflow that use all default
settings. This helps to identify which changes have a positive impact on performance.

2. Identify the Performance Bottleneck Area

This step helps to narrow down the areas on which to focus further. Follow the areas
and sequence below when attempting to identify the bottleneck:

● Target

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 879 of 1017


● Source
● Mapping
● Session/Workflow
● System.

The methodology steps you through a series of tests using PowerCenter to identify
trends that point where next to focus. Remember to go through these tests in a
scientific manner; running them multiple times before reaching any conclusion
and always keep in mind that fixing one bottleneck area may create a different
bottleneck. For more information, see Determining Bottlenecks.

3. "Inside" or "Outside" PowerCenter

Depending on the results of the bottleneck tests, optimize “inside” or “outside”


PowerCenter. Be sure to perform the bottleneck test in the order prescribed
in Determining Bottlenecks, since this is also the order in which you should make any
performance changes.

Problems “outside” PowerCenter refers to anything that indicates the source of the
performance problem is external to PowerCenter. The most common performance
problems “outside” PowerCenter are source/target database problem, network
bottleneck, server, or operating system problem.

● For source database related bottlenecks, refer to Tuning SQL Overrides and
Environment for Better Performance
● For target database related problems, refer to Performance Tuning Databases
- Oracle, SQL Server, or Teradata
● For operating system problems, refer to Performance Tuning UNIX Systems
or Performance Tuning Windows 2000/2003 Systems for more information.

Problems “inside” PowerCenter refers to anything that PowerCenter controls, such as


actual transformation logic, and PowerCenter Workflow/Session settings. The session
settings contain quite a few memory settings and partitioning options that can greatly
improve performance. Refer to the Tuning Sessions for Better Performance for more
information.

Although there are certain procedures to follow to optimize mappings, keep in mind
that, in most cases, the mapping design is dictated by business logic; there may be a

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 880 of 1017


more efficient way to perform the business logic within the mapping, but you cannot
ignore the necessary business logic to improve performance. Refer to Tuning
Mappings for Better Performance for more information.

4. Re-Execute the Problem Workflow or Session

After you have completed the recommended steps for each relevant performance
bottleneck, re-run the problem workflow or session and compare the results to the
benchmark and compare load performance against the baseline. This step is iterative,
and should be performed after any performance-based setting is changed. You are
trying to answer the question, “Did the performance change have a positive impact?” If
so, move on to the next bottleneck. Be sure to prepare detailed documentation at every
step along the way so you have a clear record of what was and wasn't tried.

While it may seem like there are an enormous number of areas where a performance
problem can arise, if you follow the steps for finding the bottleneck(s), and apply the
tuning techniques specific to it, you are likely to improve performance and achieve your
desired goals.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 881 of 1017


Tuning and Configuring Data Analyzer and Data Analyzer
Reports

Challenge

A Data Analyzer report that is slow to return data means lag time to a manager or business analyst. It can be a
crucial point of failure in the acceptance of a data warehouse. This Best Practice offers some suggestions for tuning
Data Analyzer and Data Analyzer reports.

Description

Performance tuning reports occurs both at the environment level and the reporting level. Often report performance
can be enhanced by looking closely at the objective of the report rather than the suggested appearance. The
following guidelines should help with tuning the environment and the report itself.

1. Perform Benchmarking. Benchmark the reports to determine an expected rate of return. Perform
benchmarks at various points throughout the day and evening hours to account for inconsistencies in
network traffic, database server load, and application server load. This provides a baseline to measure
changes against.
2. Review Report. Confirm that all data elements are required in the report. Eliminate any unnecessary data
elements, filters, and calculations. Also be sure to remove any extraneous charts or graphs. Consider if the
report can be broken into multiple reports or presented at a higher level. These are often ways to create
more visually appealing reports and allow for linked detail reports or drill down to detail level.
3. Scheduling of Reports. If the report is on-demand but can be changed to a scheduled report, schedule the
report to run during hours when the system use is minimized. Consider scheduling large numbers of reports
to run overnight. If mid-day updates are required, test the performance at lunch hours and consider
scheduling for that time period. Reports that require filters by users can often be copied and filters pre-
created to allow for scheduling of the report.
4. Evaluate Database. Database tuning occurs on multiple levels. Begin by reviewing the tables used in the
report. Ensure that indexes have been created on dimension keys. If filters are used on attributes, test the
creation of secondary indices to improve the efficiency of the query. Next, execute reports while a DBA
monitors the database environment. This provides the DBA the opportunity to tune the database for
querying. Finally, look into changes in database settings. Increasing the database memory in the initialization
file often improves Data Analyzer performance significantly.
5. Investigate Network. Reports are simply database queries, which can be found by clicking the "View SQL"
button on the report. Run the query from the report, against the database using a client tool on the server
that the database resides on. One caveat to this is that even the database tool on the server may contact the
outside network. Work with the DBA during this test to use a local database connection, (e.g., Bequeath /
IPC Oracle’s local database communication protocol) and monitor the database throughout this process.
This test may pinpoint if the bottleneck is occurring on the network or in the database. If, for instance, the
query performs well regardless of where it is executed, but the report continues to be slow, this indicates an
application server bottleneck. Common locations for network bottlenecks include router tables, web server
demand, and server input/output. Informatica does recommend installing Data Analyzer on a
dedicated application server.
6. Tune the Schema. Having tuned the environment and minimized the report requirements, the final level of
tuning involves changes to the database tables. Review the under performing reports.

Can any of these be generated from aggregate tables instead of from base tables? Data Analyzer makes
efficient use of linked aggregate tables by determining on a report-by-report basis if the report can utilize an
aggregate table. By studying the existing reports and future requirements, you can determine what key
aggregates can be created in the ETL tool and stored in the database.

Calculated metrics can also be created in an ETL tool and stored in the database instead of created in Data

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 882 of 1017


Analyzer. Each time a calculation must be done in Data Analyzer, it is being performed as part of the query
process. To determine if a query can be improved by building these elements in the database, try removing
them from the report and comparing report performance. Consider if these elements are appearing in a
multitude of reports or simply a few.

7.
Database Queries. As a last resort for under-performing reports, you may want to edit the actual report
query. To determine if the query is the bottleneck, select the View SQL button on the report. Next, copy the
SQL into a query utility and execute. (DBA assistance may be beneficial here.) If the query appears to be
the bottleneck, revisit Steps 2 and 6 above to ensure that no additional report changes are possible. Once
you have confirmed that the report is as required, work to edit the query while continuing to re-test it in a
query utility. Additional options include utilizing database views to cache data prior to report generation.
Reports are then built based on the view.

Note: Editing the report query requires query editing for each report change and may require editing during
migrations. Be aware that this is a time-consuming process and a difficult-to-maintain method of performance tuning.

The Data Analyzer repository database should be tuned for an OLTP workload.

Tuning Java Virtual Machine (JVM)

JVM Layout

The Java Virtual Machine (JVM) is the repository for all live objects, dead objects, and free memory. It has the
following primary jobs:

● Execute code
● Manage memory
● Remove garbage objects

The size of the JVM determines how often and how long garbage collection runs.

The JVM parameters can be set in the "startWebLogic.cmd" or "startWebLogic.sh" if using the Weblogic application
server.

Parameters of the JVM

1.
-Xms and -Xmx parameters define the minimum and maximum heap size; for large applications like Data
Analyzer, the values should be set equal to each other.
2.
Start with -ms=512m -mx=512m as needed, increase JVM by 128m or 256m to reduce garbage collection.
3.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 883 of 1017


Permanent generation, which holds the JVM's class and method objects -XX:MaxPermSize command line
parameter controls the permanent generation's size.
4.
"NewSize" and "MaxNewSize" parameters control the new generation's minimum and maximum size.
5.
XX:NewRatio=5 divides the old-to-new in the order of 5:1 (i.e the old generation occupies 5/6 of the heap
while the new generation occupies 1/6 of the heap).

When the new generation fills up, it triggers a minor collection, in which surviving
objects are moved to the old generation.

When the old generation fills up, it triggers a major collection, which involves the entire
object heap. This is more expensive in terms of resources than a minor collection.

6.
If you increase the new generation size, the old generation size decreases. Minor collections occur less
often, but the frequency of major collection increases.
7.
If you decrease the new generation size, the old generation size increases. Minor collections occur more, but
the frequency of major collection decreases.
8.
As a general rule, keep the new generation smaller than half the heap size (i.e., 1/4 or 1/3 of the heap size).
9.
Enable additional JVM if you expect large numbers of users. Informatica typically recommends two to three
CPUs per JVM.

Other Areas to Tune

Execute Threads

Threads available to process simultaneous operations in Weblogic.


Too few threads means CPUs are under-utilized and jobs are waiting for threads to become
available.

Too many threads means system is wasting resource in managing threads. The OS performs
unnecessary context switching.

The default is 15 threads. Informatica recommends using the default value, but you may need
to experiment to determine the optimal value for your environment.

Connection Pooling

The application borrows a connection from the pool, uses it, and then returns it to the pool by closing it.

● Initial capacity = 15
● Maximum capacity = 15
● Sum of connections of all pools should be equal to the number of execution threads.

Connection pooling avoids the overhead of growing and shrinking the pool size dynamically by setting the initial and

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 884 of 1017


maximum pool size at the same level.

Performance packs use platform-optimized (i.e., native) sockets to improve server performance. They are available
on: Windows NT/2000 (default installed), Solaris 2.6/2.7, AIX 4.3, HP/UX, and Linux.

● Check Enable Native I/O on the server attribute tab.


● Adds <NativeIOEnabled> to config.xml as true.

For Websphere, use the Performance Tuner to modify the configurable parameters.

For optimal configuration, separate the application server , the data warehouse database, and the repository
database onto separate dedicated machines.

Application Server-Specific Tuning Details

JBoss Application Server

Web Container. Tune the web container by modifying the following configuration file so that it accepts a reasonable
number of HTTP requests as required by the Data Analyzer installation. Ensure that the web container is made
available to optimal number of threads so that it can accept and process more HTTP requests.

<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/META-INF/jboss-service.xml

The following is a typical configuration:

<!-- A HTTP/1.1 Connector on port 8080 -->


<Connector className="org.apache.coyote.tomcat4.CoyoteConnector" port= "8080" minProcessors="10"
maxProcessors="100" enableLookups="true" acceptCount="20" debug="0" tcpNoDelay="true"
bufferSize="2048" connectionLinger="-1" connectionTimeout="20000" /><

The following parameters may need tuning:

minProcessors. Number of threads created initially in the pool.


maxProcessors. Maximum number of threads that can ever be created in the pool.

acceptCount. Controls the length of the queue of waiting requests when no more threads are
available from the pool to process the request.

connectionTimeout. Amount of time to wait before a URI is received from the stream.
Default is 20 seconds. This avoid problems where a client opens a connection and does not
send any data

tcpNoDelay. Set to true when data should be sent to the client without waiting for the buffer
to be full. This reduces latency at the cost of more packets being sent over the network. The
default is true.

enableLookups. Determines whether a reverse DNS lookup is performed. This can be enabled
to prevent IP spoofing. Enabling this parameter can cause problems when a DNS is
misbehaving. The enableLookups parameter can be turned off when you implicitly trust all

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 885 of 1017


clients.

connectionLinger. How long connections should linger after they are closed. Informatica
recommends using the default value: -1 (no linger).

In the Data Analyzer application, each web page can potentially have more than one request to the application
server. Hence, the maxProcessors should always be more than the actual number of concurrent users. For an
installation with 20 concurrent users, a minProcessors of 5 and maxProcessors of 100 is a suitable value.

If the number of threads is too low, the following message may appear in the log files:

ERROR [ThreadPool] All threads are busy, waiting. Please increase maxThreads

JSP Optimization. To avoid having the application server compile JSP scripts when they are executed for the first
time, Informatica ships Data Analyzer with pre-compiled JSPs.

The following is a typical configuration:

<JBOSS_HOME>/server/informatica/deploy/jbossweb-tomcat.sar/web.xml

<servlet>
<servlet-name>jsp</servlet-name>
<servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class>
<init-param>
<param-name>logVerbosityLevel</param-name>
<param-value>WARNING</param-value>
<param-name>development</param-name>
<param-value>false</param-value>
</init-param>
<load-on-startup>3</load-on-startup>
</servlet>

The following parameter may need tuning:

Set the development parameter to false in a production installation.

Database Connection Pool. Data Analyzer accesses the repository database to retrieve metadata information.
When it runs reports, it accesses the data sources to get the report information. Data Analyzer keeps a pool of
database connections for the repository. It also keeps a separate database connection pool for each data source. To
optimize Data Analyzer database connections, you can tune the database connection pools.

Repository Database Connection Pool. To optimize the repository database connection pool, modify the JBoss
configuration file:

<JBOSS_HOME>/server/informatica/deploy/<DB_Type>_ds.xml

The name of the file includes the database type. <DB_Type> can be Oracle, DB2, or other databases. For example,
for an Oracle repository, the configuration file name is oracle_ds.xml. With some versions of Data Analyzer, the
configuration file may simply be named DataAnalyzer-ds.xml.

The following is a typical configuration:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 886 of 1017


<datasources>
<local-tx-datasource>
<jndi-name>jdbc/IASDataSource</jndi-name>
<connection-url> jdbc:informatica:oracle://aries:1521;SID=prfbase8</connection-url>
<driver-class>com.informatica.jdbc.oracle.OracleDriver</driver-class>
<user-name>powera</user-name>
<password>powera</password>
<exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter
</exception-sorter-class-name>
<min-pool-size>5</min-pool-size>
<max-pool-size>50</max-pool-size>
<blocking-timeout-millis>5000</blocking-timeout-millis>
<idle-timeout-minutes>1500</idle-timeout-minutes>
</local-tx-datasource>
</datasources>

The following parameters may need tuning:

min-pool-size. The minimum number of connections in the pool. (The pool is lazily
constructed, that is, it will be empty until it is first accessed. Once used, it will always have at
least the min-pool-size connections.)

max-pool-size. The strict maximum size of the connection pool.


blocking-timeout-millis. The maximum time in milliseconds that a caller waits to get a


connection when no more free connections are available in the pool.

idle-timeout-minutes. The length of time an idle connection remains in the pool before it is
used.

The max-pool-size value is recommended to be at least five more than maximum number of concurrent users
because there may be several scheduled reports running in the background and each of them needs a database
connection.

A higher value is recommended for idle-timeout-minutes. Because Data Analyzer accesses the repository very
frequently, it is inefficient to spend resources on checking for idle connections and cleaning them out. Checking for
idle connections may block other threads that require new connections.

Data Source Database Connection Pool. Similar to the repository database connection pools, the data source
also has a pool of connections that Data Analyzer dynamically creates as soon as the first client requests a
connection.

The tuning parameters for these dynamic pools are present in following file:

<JBOSS_HOME>/bin/IAS.properties.file

The following is a typical configuration:

#
# Datasource definition
#
dynapool.initialCapacity=5
dynapool.maxCapacity=50

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 887 of 1017


dynapool.capacityIncrement=2
dynapool.allowShrinking=true
dynapool.shrinkPeriodMins=20
dynapool.waitForConnection=true
dynapool.waitSec=1
dynapool.poolNamePrefix=IAS_dynapool.refreshTestMinutes=60
datamart.defaultRowPrefetch=20< /FONT>

The following JBoss-specific parameters may need tuning:

dynapool.initialCapacity. The minimum number of initial connections in the data source pool.

dynapool.maxCapacity. The maximum number of connections that the data source pool may
grow to.

dynapool.poolNamePrefix. This parameter is a prefix added to the dynamic JDB pool name
for identification purposes.

dynapool.waitSec. The maximum amount of time (in seconds) a client will wait to grab a
connection from the pool if none is readily available.

dynapool.refreshTestMinutes. This parameter determines the frequency at which a health


check is performed on the idle connections in the pool. This should not be performed too
frequently because it locks up the connection pool and may prevent other clients from
grabbing connections from the pool.

dynapool.shrinkPeriodMins. This parameter determines the amount of time (in minutes) an


idle connection is allowed to be in the pool. After this period, the number of connections in the
pool shrinks back to the value of its initialCapacity parameter. This is done only if the
allowShrinking parameter is set to true.

EJB Container

Data Analyzer uses EJBs extensively. It has more than 50 stateless session beans (SLSB) and more than 60 entity
beans (EB). In addition, there are six message-driven beans (MDBs) that are used for the scheduling and real-time
functionalities.

Stateless Session Beans (SLSB). For SLSBs, the most important tuning parameter is the EJB pool. You can tune
the EJB pool parameters in the following file:

<JBOSS_HOME>/server/Informatica/conf/standardjboss.xml.

The following is a typical configuration:

<container-configuration>
<container-name> Standard Stateless SessionBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>
stateless-rmi-invoker</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor> org.jboss.ejb.plugins.LogInterceptor</interceptor>

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 888 of 1017


<interceptor>
org.jboss.ejb.plugins.SecurityInterceptor</interceptor>
<!-- CMT -->
<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor transaction="Container">
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>
<!-- BMT -->
<interceptor transaction="Bean">
org.jboss.ejb.plugins.StatelessSessionInstanceInterceptor
</interceptor>
<interceptor transaction="Bean">
org.jboss.ejb.plugins.TxInterceptorBMT</interceptor>
<interceptor transaction="Bean" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
</container-interceptors>
<instance-pool>
org.jboss.ejb.plugins.StatelessSessionInstancePool</instance-pool>
<instance-cache></instance-cache>
<persistence-manager></persistence-manager>
<container-pool-conf>
<MaximumSize>100</MaximumSize>
</container-pool-conf>
</container-configuration>

The following parameter may need tuning:

MaximumSize. Represents the maximum number of objects in the pool. If


<strictMaximumSize> is set to true, then <MaximumSize> is a strict upper limit for the
number of objects that can be created. If <strictMaximumSize> is set to false, the number of
active objects can exceed the <MaximumSize> if there are requests for more objects.
However, only the <MaximumSize> number of objects can be returned to the pool.

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These
two parameters are not set by default in Data Analyzer. They can be tuned after you have
performed proper iterative testing in Data Analyzer to increase the throughput for high-
concurrency installations.

strictMaximumSize. When the value is set to true, the <strictMaximumSize> enforces a rule
that only <MaximumSize> number of objects can be active. Any subsequent requests must
wait for an object to be returned to the pool.

strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount


of time that requests wait for an object to be made available in the pool.

Message-Driven Beans (MDB). MDB tuning parameters are very similar to stateless bean tuning parameters. The
main difference is that MDBs are not invoked by clients. Instead, the messaging system delivers messages to the
MDB when they are available.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 889 of 1017


To tune the MDB parameters, modify the following configuration file:

<JBOSS_HOME>/server/informatica/conf/standardjboss.xml

The following is a typical configuration:

<container-configuration>
<container-name>Standard Message Driven Bean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>message-driven-bean
</invoker-proxy-binding-name>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.RunAsSecurityInterceptor
</interceptor>
<!-- CMT -->
<interceptor transaction="Container">
org.jboss.ejb.plugins.TxInterceptorCMT</interceptor>
<interceptor transaction="Container" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor
</interceptor>
<interceptor transaction="Container">
org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>
<!-- BMT -->
<interceptor transaction="Bean">
org.jboss.ejb.plugins.MessageDrivenInstanceInterceptor
</interceptor>
<interceptor transaction="Bean">
org.jboss.ejb.plugins.MessageDrivenTxInterceptorBMT
</interceptor>
<interceptor transaction="Bean" metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.MessageDrivenInstancePool
</instance-pool>
<instance-cache></instance-cache>
<persistence-manager></persistence-manager>
<container-pool-conf>
<MaximumSize>100</MaximumSize>
</container-pool-conf>
</container-configuration>

The following parameter may need tuning:

MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then
<MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if
<strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are
request for more objects. However, only the <MaximumSize> number of objects can be returned to the pool.

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are
not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 890 of 1017


Analyzer to increase the throughput for high-concurrency installations.

strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter


enforces a rule that only <MaximumSize> number of objects will be active. Any subsequent
requests must wait for an object to be returned to the pool.

strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount


of time that requests wait for an object to be made available in the pool.

Enterprise Java Beans (EJB). Data Analyzer EJBs use BMP (bean-managed persistence) as opposed to CMP
(container-managed persistence). The EJB tuning parameters are very similar to the stateless bean tuning
parameters.

The EJB tuning parameters are in the following configuration file:

<JBOSS_HOME>/server/informatica/conf/standardjboss.xml.

The following is a typical configuration:

<container-configuration>
<container-name>Standard BMP EntityBean</container-name>
<call-logging>false</call-logging>
<invoker-proxy-binding-name>entity-rmi-invoker
</invoker-proxy-binding-name>
<sync-on-commit-only>false</sync-on-commit-only>
<container-interceptors>
<interceptor>org.jboss.ejb.plugins.ProxyFactoryFinderInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.LogInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.SecurityInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.TxInterceptorCMT
</interceptor>
<interceptor metricsEnabled="true">
org.jboss.ejb.plugins.MetricsInterceptor</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityCreationInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityLockInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityInstanceInterceptor
</interceptor>
<interceptor>org.jboss.ejb.plugins.EntityReentranceInterceptor
</interceptor>
<interceptor>
org.jboss.resource.connectionmanager.CachedConnectionInterceptor
</interceptor>
<interceptor>
org.jboss.ejb.plugins.EntitySynchronizationInterceptor
</interceptor>
</container-interceptors>
<instance-pool>org.jboss.ejb.plugins.EntityInstancePool
</instance-pool>
<instance-cache>org.jboss.ejb.plugins.EntityInstanceCache
</instance-cache>
<persistence-manager>org.jboss.ejb.plugins.BMPPersistenceManager

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 891 of 1017


</persistence-manager>
<locking-policy>org.jboss.ejb.plugins.lock.QueuedPessimisticEJBLock
</locking-policy>
<container-cache-conf>
<cache-policy>org.jboss.ejb.plugins.LRUEnterpriseContextCachePolicy
</cache-policy>
<cache-policy-conf>
<min-capacity>50</min-capacity>
<max-capacity>1000000</max-capacity>
<overager-period>300</overager-period>
<max-bean-age>600</max-bean-age>
<resizer-period>400</resizer-period>
<max-cache-miss-period>60</max-cache-miss-period>
<min-cache-miss-period>1</min-cache-miss-period>
<cache-load-factor>0.75</cache-load-factor>
</cache-policy-conf>
</container-cache-conf>
<container-pool-conf>
<MaximumSize>100</MaximumSize>
</container-pool-conf>
<commit-option>A</commit-option>
</container-configuration>

The following parameter may need tuning:

MaximumSize. Represents the maximum number of objects in the pool. If <strictMaximumSize> is set to true, then
<MaximumSize> is a strict upper limit for the number of objects that can be created. Otherwise, if
<strictMaximumSize> is set to false, the number of active objects can exceed the <MaximumSize> if there are
request for more objects. However, only the <MaximumSize> number of objects are returned to the pool.

Additionally, there are two other parameters that you can set to fine tune the EJB pool. These two parameters are
not set by default in Data Analyzer. They can be tuned after you have performed proper iterative testing in Data
Analyzer to increase the throughput for high-concurrency installations.

strictMaximumSize. When the value is set to true, the <strictMaximumSize> parameter


enforces a rule that only <MaximumSize> number of objects can be active. Any subsequent
requests must wait for an object to be returned to the pool.

strictTimeout. If you set <strictMaximumSize> to true, then <strictTimeout> is the amount


of time that requests will wait for an object to be made available in the pool.

RMI Pool

The JBoss Application Server can be configured to have a pool of threads to accept connections from clients for
remote method invocation (RMI). If you use the Java RMI protocol to access the Data Analyzer API from other
custom applications, you can optimize the RMI thread pool parameters.

To optimize the RMI pool, modify the following configuration file:

<JBOSS_HOME>/server/informatica/conf/jboss-service.xml

The following is a typical configuration:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 892 of 1017


<mbeancode="org.jboss.invocation.pooled.server.PooledInvoker"name="jboss:service=invoker,
type=pooled">
<attribute name="NumAcceptThreads">1</attribute>
<attribute name="MaxPoolSize">300</attribute>
<attribute name="ClientMaxPoolSize">300</attribute>
<attribute name="SocketTimeout">60000</attribute>
<attribute name="ServerBindAddress"></attribute>
<attribute name="ServerBindPort">0</attribute>
<attribute name="ClientConnectAddress"></attribute>
<attribute name="ClientConnectPort">0</attribute>
<attribute name="EnableTcpNoDelay">false</attribute>
<depends optional-attribute-name="TransactionManagerService">
jboss:service=TransactionManager
</depends>
</mbean>

The following parameters may need tuning:

NumAcceptThreads. The controlling threads used to accept connections from the client.

MaxPoolSize. A strict maximum size for the pool of threads to service requests on the server.

ClientMaxPoolSize. A strict maximum size for the pool of threads to service requests on the
client.

Backlog. The number of requests in the queue when all the processing threads are in use.
● EnableTcpDelay. Indicates whether information should be sent before the buffer is full. Setting it to true
may increase the network traffic because more packets will be sent across the network.

WebSphere Application Server 5.1. The Tivoli Performance Viewer can be used to observe the behavior of some
of the parameters and arrive at a good settings.

Web Container

Navigate to “Application Servers > [your_server_instance] > Web Container > Thread Pool” to tune the following
parameters.

Minimum Size: Specifies the minimum number of threads to allow in the pool. The default
value of 10 is appropriate.
● Maximum Size: Specifies the minimum number of threads to allow in the pool. For a highly concurrent

usage scenario (with a 3 VM load-balanced configuration), the value of 50-60 has been determined to be
optimal.
● Thread Inactivity Timeout: Specifies the number of milliseconds of inactivity that should elapse before a
thread is reclaimed. The default of 3500ms is considered optimal.

Is Growable: Specifies whether the number of threads can increase beyond the maximum size
configured for the thread pool. Be sure to leave this option unchecked. Also, the maximum
threads should be hard-limited to the value given in the “Maximum Size”.

Note: In a load-balanced environment, there is likely to be more than one server instance that may be spread across
multiple machines. In such a scenario, be sure that the changes have been properly propagated to all of the server

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 893 of 1017


instances.

Transaction Services

Total transaction lifetime timeout: In certain circumstances (e.g., import of large XML files), the default value of 120
seconds may not be sufficient and should be increased. This parameter can be modified during runtime also.

Diagnostic Trace Services

Disable the trace in a production environment .


Navigate to “Application Servers > [your_server_instance] > Administration Services >


Diagnostic Trace Service “ and make sure “Enable Tracing” is not checked.
Debugging Services

Ensure that the tracing is disabled in a production environment.

Navigate to “Application Servers > [your_server_instance] > Logging and Tracing > Diagnostic Trace Service >
Debugging Service “ and make sure “Startup” is not checked.

Performance Monitoring Services

This set of parameters is for monitoring the health of the Application Server. This monitoring service tries to ping the
application server after a certain interval; if the server is found to be dead, it then tries to restart the server.

Navigate to “Application Servers > [your_server_instance] > Process Definition > MonitoringPolicy “ and tune the
parameters according to a policy determined for each Data Analyzer installation.

Note: The parameter “Ping Timeout” determines the time after which a no-response from the server implies that it is
faulty. The monitoring service then attempts to kill the server and restart it if “Automatic restart” is checked. Take
care that “Ping Timeout” is not set to too small a value.

Process Definitions (JVM Parameters)

For a Data Analyzer installation with a high number of concurrent users, Informatica recommends that the minimum
and the maximum heap size be set to the same values. This avoids the heap allocation-reallocation expense during
a high-concurrency scenario. Also, for a high-concurrency scenario, Informatica recommends setting the values of
minimum heap and maximum heap size to at least 1000MB. Further tuning of this heap-size is recommended after
carefully studying the garbage collection behavior by turning on the verbosegc option.

The following is a list of java parameters (for IBM JVM 1.4.1) that should not be modified from the default values for
Data Analyzer installation:

-Xnocompactgc. This parameter switches off heap compaction altogether. Switching off heap
compaction results in heap fragmentation. Since Data Analyzer frequently allocates large
objects, heap fragmentation can result in OutOfMemory exceptions.

-Xcompactgc. Using this parameter leads to each garbage collection cycle carrying out

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 894 of 1017


compaction, regardless of whether it's useful.

-Xgcthreads. This controls the number of garbage collection helper threads created by the
JVM during startup. The default is N-1 threads for an N-processor machine. These threads
provide the parallelism in parallel mark and parallel sweep modes, which reduces the pause
time during garbage collection.

-Xclassnogc. This disables collection of class objects.


● -Xinitsh. This sets the initial size of the application-class system heap. The system heap is expanded as
needed and is never garbage collected.

You may want to alter the following parameters after carefully examining the application server processes:

● Navigate to “Application Servers > [your_server_instance] > Process Definition > Java Virtual Machine"

Verbose garbage collection. Check this option to turn on verbose garbage collection. This can
help in understanding the behavior of the garbage collection for the application. It has a very
low overhead on performance and can be turned on even in the production environment.

Initial heap size. This is the –ms value. Only the numeric value (without MB) needs to be
specified. For concurrent usage, the initial heap-size should be started with a 1000 and,
depending on the garbage collection behavior, can be potentially increased up to 2000. A value
beyond 2000 may actually reduce throughput because the garbage collection cycles will take
more time to go through the large heap, even though the cycles may be occurring less
frequently.

Maximum heap size. This is the –mx value. It should be equal to the “Initial heap size” value.

RunHProf:. This should remain unchecked in production mode, because it slows down the VM
considerably.

Debug Mode. This should remain unchecked in production mode, because it slows down the VM
considerably.

Disable JIT.: This should remain unchecked (i.e., JIT should never be disabled).
Performance Monitoring Services

Be sure that performance monitoring services are not enabled in a production environment.

Navigate to “Application Servers > [your_server_instance] > Performance Monitoring Services“ and be sure “Startup”
is not checked.

Database Connection Pool

The repository database connection pool can be configured by navigating to “JDBC Providers > User-defined JDBC
Provider > Data Sources > IASDataSource > Connection Pools”

The various parameters that may need tuning are:

● Connection Timeout. The default value of 180 seconds should be good. This implies that after 180 seconds,
the request to grab a connection from the pool will timeout. After it times out, DataAnalyzer will throw an
exception. In that case, the pool size may need to be increased.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 895 of 1017


● Max Connections. The maximum number of connections in the pool. Informatica recommends a value of 50
for this.
● Min Connections. The minimum number of connections in the pool. Informatica recommends a value of 10
for this.
● Reap Time. This specifies the frequency of pool maintenance thread. This should not be set very high
because when pool maintenance thread is running, it blocks the whole pool and no process can grab a new
connection form the pool. If the database and the network are reliable, this should have a very high value (e.
g., 1000).
● Unused Timeout. This specifies the time in seconds after which an unused connection will be discarded
until the pool size reaches the minimum size. In a highly concurrent usage, this should be a high value. The
default of 1800 seconds should be fine.
● Aged Timeout. Specifies the interval in seconds before a physical connection is discarded. If the database
and the network are stable, there should not be a reason for age timeout. The default is 0 (i.e., connections
do not age). If the database or the network connection to the repository database frequently comes down
(compared to the life of the AppServer), this can be used to age-out the stale connections.

Much like the repository database connection pools, the data source or data warehouse databases also have a pool
of connections that are created dynamically by Data Analyzer as soon as the first client makes a request.

The tuning parameters for these dynamic pools are present in <WebSphere_Home>/AppServer/IAS.properties file.

The following is a typical configuration:.

# Datasource definition

dynapool.initialCapacity=5

dynapool.maxCapacity=50

dynapool.capacityIncrement=2

dynapool.allowShrinking=true

dynapool.shrinkPeriodMins=20

dynapool.waitForConnection=true

dynapool.waitSec=1

dynapool.poolNamePrefix=IAS_

dynapool.refreshTestMinutes=60

datamart.defaultRowPrefetch=20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 896 of 1017


The various parameters that may need tuning are:

● dynapool.initialCapacity - the minimum number of initial connections in the data-source pool.


● dynapool.maxCapacity - the maximum number of connections that the data-source pool may grow up to.
● dynapool.poolNamePrefix - a prefix added to the dynamic JDB pool name for identification purposes.
● dynapool.waitSec - the maximum amount of time (in seconds) that a client will wait to grab a connection
from the pool if none is readily available.
● dynapool.refreshTestMinutes - determines the frequency at which a health check on the idle connections in
the pool is performed. Such checks should not be performed too frequently because they lock up the
connection pool and may prevent other clients from grabbing connections from the pool.
● dynapool.shrinkPeriodMins - determines the amount of time (in minutes) an idle connection is allowed to be
in the pool. After this period, the number of connections in the pool decreases (to its initialCapacity). This is
done only if allowShrinking is set to true.

Message Listeners Services

To process scheduled reports, Data Analyzer uses Message-Driven-Beans. It is possible to run multiple reports
within one schedule in parallel by increasing the number of instances of the MDB catering to the Scheduler
(InfScheduleMDB). Take care however, not to increase the value to some arbitrarily high value since each report
consumes considerable resources (e.g., database connections, and CPU processing at both the application-server
and database server levels) and setting this to a very high value may actually be detrimental to the whole system.

Navigate to “Application Servers > [your_server_instance] > Message Listener Service > Listener Ports >
IAS_ScheduleMDB_ListenerPort” .

The parameters that can be tuned are:

● Maximum sessions. The default value is one. On a highly-concurrent user scenario, Informatica does not
recommend going beyond five.
● Maximum messages. This should remain as one. This implies that each report in a schedule will be
executed in a separate transaction instead of a batch. Setting it to more than one may have unwanted
effects like transaction timeouts, and the failure of one report may cause all the reports in the batch to fail.

Plug-in Retry Intervals and Connect Timeouts

When Data Analyzer is set up in a clustered WebSphere environment, a plug-in is normally used to perform the load-
balancing between each server in the cluster. The proxy http-server sends the request to the plug-in and the plug-in
then routes the request to the proper application-server.

The plug-in file can be generated automatically by navigating to “

Environment > Update web server plugin configuration”.

The default plug-in file contains ConnectTimeOut=0, which means that it relies on the tcp timeout setting of the
server. It is possible to have different timeout settings for different servers in the cluster. The timeout settings implies
that after the given number of seconds if the server doesn’t respond, then it is marked as down and the request is
sent over to the next available member of the cluster.

The RetryInterval parameter allows you to specify how long to wait before retrying a server that is marked as down.
The default value is 10 seconds. This means if a cluster member is marked as down, the server does not try to send
a request to the same member for 10 seconds.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 897 of 1017


Last updated: 13-Feb-07 17:59

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 898 of 1017


Tuning Mappings for Better Performance

Challenge

In general, mapping-level optimization takes time to implement, but can significantly boost performance.
Sometimes the mapping is the biggest bottleneck in the load process because business rules determine
the number and complexity of transformations in a mapping.

Before deciding on the best route to optimize the mapping architecture, you need to resolve some basic
issues. Tuning mappings is a grouped approach. The first group can be of assistance almost universally,
bringing about a performance increase in all scenarios. The second group of tuning processes may yield
only small performance increase, or can be of significant value, depending on the situation.

Some factors to consider when choosing tuning processes at the mapping level include the specific
environment, software/ hardware limitations, and the number of rows going through a mapping. This Best
Practice offers some guidelines for tuning mappings.

Description

Analyze mappings for tuning only after you have tuned the target and source for peak performance. To
optimize mappings, you generally reduce the number of transformations in the mapping and delete
unnecessary links between transformations.

For transformations that use data cache (such as Aggregator, Joiner, Rank, and Lookup transformations),
limit connected input/output or output ports. Doing so can reduce the amount of data the transformations
store in the data cache. Having too many Lookups and Aggregators can encumber performance because
each requires index cache and data cache. Since both are fighting for memory space, decreasing the
number of these transformations in a mapping can help improve speed. Splitting them up into different
mappings is another option.

Limit the number of Aggregators in a mapping. A high number of Aggregators can increase I/O activity on
the cache directory. Unless the seek/access time is fast on the directory itself, having too many
Aggregators can cause a bottleneck. Similarly, too many Lookups in a mapping causes contention of disk
and memory, which can lead to thrashing, leaving insufficient memory to run a mapping efficiently.

Consider Single-Pass Reading

If several mappings use the same data source, consider a single-pass reading. If you have several
sessions that use the same sources, consolidate the separate mappings with either a single Source
Qualifier Transformation or one set of Source Qualifier Transformations as the data source for the
separate data flows.

Similarly, if a function is used in several mappings, a single-pass reading reduces the number of times that
function is called in the session. For example, if you need to subtract percentage from the PRICE ports for
both the Aggregator and Rank transformations, you can minimize work by subtracting the percentage
before splitting the pipeline.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 899 of 1017


Optimize SQL Overrides

When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in the update override
of a target object, be sure the SQL statement is tuned. The extent to which and how SQL can be tuned
depends on the underlying source or target database system. See Tuning SQL Overrides and
Environment for Better Performance for more information .

Scrutinize Datatype Conversions

PowerCenter Server automatically makes conversions between compatible datatypes. When these
conversions are performed unnecessarily, performance slows. For example, if a mapping moves data from
an integer port to a decimal port, then back to an integer port, the conversion may be unnecessary.

In some instances however, datatype conversions can help improve performance. This is especially true
when integer values are used in place of other datatypes for performing comparisons using Lookup and
Filter transformations.

Eliminate Transformation Errors

Large numbers of evaluation errors significantly slow performance of the PowerCenter Server. During
transformation errors, the PowerCenter Server engine pauses to determine the cause of the error,
removes the row causing the error from the data flow, and logs the error in the session log.

Transformation errors can be caused by many things including: conversion errors, conflicting mapping
logic, any condition that is specifically set up as an error, and so on. The session log can help point out the
cause of these errors. If errors recur consistently for certain transformations, re-evaluate the constraints for
these transformations. If you need to run a session that generates a large number of transformation errors,
you might improve performance by setting a lower tracing level. However, this is not a long-term response
to transformation errors. Any source of errors should be traced and eliminated.

Optimize Lookup Transformations

There are a several ways to optimize lookup transformations that are set up in a mapping.

When to Cache Lookups

Cache small lookup tables. When caching is enabled, the PowerCenter Server caches the lookup table
and queries the lookup cache during the session. When this option is not enabled, the PowerCenter
Server queries the lookup table on a row-by-row basis.

Note: All of the tuning options mentioned in this Best Practice assume that memory and cache sizing for
lookups are sufficient to ensure that caches will not page to disks. Information regarding memory and
cache sizing for Lookup transformations are covered in the Best Practice: Tuning Sessions for Better
Performance.

A better rule of thumb than memory size is to determine the size of the potential lookup cache with regard
to the number of rows expected to be processed. For example, consider the following example.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 900 of 1017


In Mapping X, the source and lookup contain the following number of records:

ITEMS (source): 5000 records

MANUFACTURER: 200 records

DIM_ITEMS: 100000 records

Number of Disk Reads

Cached Lookup Un-cached Lookup

LKP_Manufacturer

Build Cache 200 0

Read Source Records 5000 5000

Execute Lookup 0 5000

Total # of Disk Reads 5200 100000

LKP_DIM_ITEMS

Build Cache 100000 0

Read Source Records 5000 5000

Execute Lookup 0 5000

Total # of Disk Reads 105000 10000

Consider the case where MANUFACTURER is the lookup table. If the lookup table is cached, it will take a
total of 5200 disk reads to build the cache and execute the lookup. If the lookup table is not cached, then it
will take a total of 10,000 total disk reads to execute the lookup. In this case, the number of records in the
lookup table is small in comparison with the number of times the lookup is executed. So this lookup should
be cached. This is the more likely scenario.

Consider the case where DIM_ITEMS is the lookup table. If the lookup table is cached, it will result in
105,000 total disk reads to build and execute the lookup. If the lookup table is not cached, then the disk
reads would total 10,000. In this case the number of records in the lookup table is not small in comparison
with the number of times the lookup will be executed. Thus, the lookup should not be cached.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 901 of 1017


Use the following eight step method to determine if a lookup should be cached:

1. Code the lookup into the mapping.


2. Select a standard set of data from the source. For example, add a "where" clause on a relational
source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different name than the log created
in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the lookup object. Note
this time in seconds: LOOKUP TIME IN SECONDS = LS.
6. In the non-cached log, take the time from the last lookup cache to the end of the load in seconds
and divide it into the number or rows being processed: NON-CACHED ROWS PER SECOND =
NRS.
7. In the cached log, take the time from the last lookup cache to the end of the load in seconds and
divide it into number or rows being processed: CACHED ROWS PER SECOND = CRS.
8. Use the following formula to find the breakeven row point:

(LS*NRS*CRS)/(CRS-NRS) = X

Where X is the breakeven point. If your expected source records is less than X, it is better to not
cache the lookup. If your expected source records is more than X, it is better to cache the lookup.

For example:

Assume the lookup takes 166 seconds to cache (LS=166).


Assume with a cached lookup the load is 232 rows per second (CRS=232).
Assume with a non-cached lookup the load is 147 rows per second (NRS = 147).

The formula would result in: (166*147*232)/(232-147) = 66,603.

Thus, if the source has less than 66,603 records, the lookup should not be cached. If it has more
than 66,603 records, then the lookup should be cached.

Sharing Lookup Caches

There are a number of methods for sharing lookup caches:

● Within a specific session run for a mapping, if the same lookup is used multiple times in a
mapping, the PowerCenter Server will re-use the cache for the multiple instances of the lookup.
Using the same lookup multiple times in the mapping will be more resource intensive with each
successive instance. If multiple cached lookups are from the same table but are expected to
return different columns of data, it may be better to setup the multiple lookups to bring back the
same columns even though not all return ports are used in all lookups. Bringing back a common
set of columns may reduce the number of disk reads.
● Across sessions of the same mapping, the use of an unnamed persistent cache allows multiple
runs to use an existing cache file stored on the PowerCenter Server. If the option of creating a
persistent cache is set in the lookup properties, the memory cache created for the lookup during
the initial run is saved to the PowerCenter Server. This can improve performance because the
Server builds the memory cache from cache files instead of the database. This feature should
only be used when the lookup table is not expected to change between session runs.
● Across different mappings and sessions, the use of a named persistent cache allows

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 902 of 1017


sharing an existing cache file.

Reducing the Number of Cached Rows

There is an option to use a SQL override in the creation of a lookup cache. Options can be added to the
WHERE clause to reduce the set of records included in the resulting cache.

Note: If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the Lookup Condition

In the case where a lookup uses more than one lookup condition, set the conditions with an equal sign first
in order to optimize lookup performance.

Indexing the Lookup Table

The PowerCenter Server must query, sort, and compare values in the lookup condition columns. As a
result, indexes on the database table should include every column used in a lookup condition. This can
improve performance for both cached and un-cached lookups.

In the case of a cached lookup, an ORDER BY condition is issued in the SQL statement
used to create the cache. Columns used in the ORDER BY condition should be indexed.
The session log will contain the ORDER BY statement.

In the case of an un-cached lookup, since a SQL statement is created for each row
passing into the lookup transformation, performance can be helped by indexing
columns in the lookup condition.

Use a Persistent Lookup Cache for Static Lookups

If the lookup source does not change between sessions, configure the Lookup transformation to use a
persistent lookup cache. The PowerCenter Server then saves and reuses cache files from session to
session, eliminating the time required to read the lookup source.

Optimize Filter and Router Transformations

Filtering data as early as possible in the data flow improves the efficiency of a mapping. Instead of
using a Filter Transformation to remove a sizeable number of rows in the middle or end of a mapping, use
a filter on the Source Qualifier or a Filter Transformation immediately after the source qualifier to improve
performance.

Avoid complex expressions when creating the filter condition. Filter transformations are most
effective when a simple integer or TRUE/FALSE expression is used in the filter condition.

Filters or routers should also be used to drop rejected rows from an Update Strategy transformation if
rejected rows do not need to be saved.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 903 of 1017


Replace multiple filter transformations with a router transformation. This reduces the number of
transformations in the mapping and makes the mapping easier to follow.

Optimize Aggregator Transformations

Aggregator Transformations often slow performance because they must group data before processing it.

Use simple columns in the group by condition to make the Aggregator Transformation more efficient.
When possible, use numbers instead of strings or dates in the GROUP BY columns. Also avoid complex
expressions in the Aggregator expressions, especially in GROUP BY ports.

Use the Sorted Input option in the Aggregator. This option requires that data sent to the Aggregator be
sorted in the order in which the ports are used in the Aggregator's group by. The Sorted Input option
decreases the use of aggregate caches. When it is used, the PowerCenter Server assumes all data is
sorted by group and, as a group is passed through an Aggregator, calculations can be performed and
information passed on to the next transformation. Without sorted input, the Server must wait for all rows of
data before processing aggregate calculations. Use of the Sorted Inputs option is usually accompanied by
a Source Qualifier which uses the Number of Sorted Ports option.

Use an Expression and Update Strategy instead of an Aggregator Transformation. This technique can
only be used if the source data can be sorted. Further, using this option assumes that a mapping is using
an Aggregator with Sorted Input option. In the Expression Transformation, the use of variable ports is
required to hold data from the previous row of data processed. The premise is to use the previous row of
data to determine whether the current row is a part of the current group or is the beginning of a new group.
Thus, if the row is a part of the current group, then its data would be used to continue calculating the
current group function. An Update Strategy Transformation would follow the Expression Transformation
and set the first row of a new group to insert, and the following rows to update.

Use incremental aggregation if you can capture changes from the source that changes less than half the
target. When using incremental aggregation, you apply captured changes in the source to aggregate
calculations in a session. The PowerCenter Server updates your target incrementally, rather than
processing the entire source and recalculating the same calculations every time you run the session.

Joiner Transformation

Joining Data from the Same Source

You can join data from the same source in the following ways:

● Join two branches of the same pipeline.


● Create two instances of the same source and join pipelines from these source instances.

You may want to join data from the same source if you want to perform a calculation on part of the data
and join the transformed data with the original data. When you join the data using this method, you can
maintain the original data and transform parts of that data within one mapping.

When you join data from the same source, you can create two branches of the pipeline. When you branch

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 904 of 1017


a pipeline, you must add a transformation between the Source Qualifier and the Joiner transformation in at
least one branch of the pipeline. You must join sorted data and configure the Joiner transformation for
sorted input.

If you want to join unsorted data, you must create two instances of the same source and join the pipelines.

For example, you may have a source with the following ports:

● Employee
● Department
● Total Sales

In the target table, you want to view the employees who generated sales that were greater than the
average sales for their respective departments. To accomplish this, you create a mapping with the
following transformations:

● Sorter transformation. Sort the data.


● Sorted Aggregator transformation. Average the sales data and group by department. When
you perform this aggregation, you lose the data for individual employees. To maintain employee
data, you must pass a branch of the pipeline to the Aggregator transformation and pass a branch
with the same data to the Joiner transformation to maintain the original data. When you join both
branches of the pipeline, you join the aggregated data with the original data.
● Sorted Joiner transformation. Use a sorted Joiner transformation to join the sorted aggregated
data with the original data.
● Filter transformation. Compare the average sales data against sales data for each employee
and filter out employees with less than above average sales.

Note: You can also join data from output groups of the same transformation, such as the Custom
transformation or XML Source Qualifier transformations. Place a Sorter transformation between each
output group and the Joiner transformation and configure the Joiner transformation to receive sorted input.

Joining two branches can affect performance if the Joiner transformation receives data from one branch
much later than the other branch. The Joiner transformation caches all the data from the first branch, and
writes the cache to disk if the cache fills. The Joiner transformation must then read the data from disk
when it receives the data from the second branch. This can slow processing.

You can also join same source data by creating a second instance of the source. After you create the
second source instance, you can join the pipelines from the two source instances.

Note: When you join data using this method, the PowerCenter Server reads the source data for each
source instance, so performance can be slower than joining two branches of a pipeline.

Use the following guidelines when deciding whether to join branches of a pipeline or join two instances of a
source:

● Join two branches of a pipeline when you have a large source or if you can read the source data
only once. For example, you can only read source data from a message queue once.
● Join two branches of a pipeline when you use sorted data. If the source data is unsorted and you

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 905 of 1017


use a Sorter transformation to sort the data, branch the pipeline after you sort the data.
● Join two instances of a source when you need to add a blocking transformation to the pipeline
between the source and the Joiner transformation.
● Join two instances of a source if one pipeline may process much more slowly than the other
pipeline.

Performance Tips

Use the database to do the join when sourcing data from the same database schema. Database
systems usually can perform the join more quickly than the PowerCenter Server, so a SQL override or a
join condition should be used when joining multiple tables from the same database schema.

Use Normal joins whenever possible. Normal joins are faster than outer joins and the resulting set of
data is also smaller.

Join sorted data when possible. You can improve session performance by configuring the Joiner
transformation to use sorted input. When you configure the Joiner transformation to use sorted data, the
PowerCenter Server improves performance by minimizing disk input and output. You see the greatest
performance improvement when you work with large data sets.

For an unsorted Joiner transformation, designate as the master sourcethe source with fewer rows.
For optimal performance and disk storage, designate the master source as the source with the fewer rows.
During a session, the Joiner transformation compares each row of the master source against the detail
source. The fewer unique rows in the master, the fewer iterations of the join comparison occur, which
speeds the join process.

For a sorted Joiner transformation, designate as the master source the source with fewer duplicate
key values. For optimal performance and disk storage, designate the master source as the source with
fewer duplicate key values. When the PowerCenter Server processes a sorted Joiner transformation, it
caches rows for one hundred keys at a time. If the master source contains many rows with the same key
value, the PowerCenter Server must cache more rows, and performance can be slowed.

Optimizing sorted joiner transformations with partitions. When you use partitions with a sorted Joiner
transformation, you may optimize performance by grouping data and using n:n partitions.

Add a hash auto-keys partition upstream of the sort origin

To obtain expected results and get best performance when partitioning a sorted Joiner transformation, you
must group and sort data. To group data, ensure that rows with the same key value are routed to the same
partition. The best way to ensure that data is grouped and distributed evenly among partitions is to add a
hash auto-keys or key-range partition point before the sort origin. Placing the partition point before you sort
the data ensures that you maintain grouping and sort the data within each group.

Use n:n partitions

You may be able to improve performance for a sorted Joiner transformation by using n:n partitions. When
you use n:n partitions, the Joiner transformation reads master and detail rows concurrently and does not
need to cache all of the master data. This reduces memory usage and speeds processing. When you use
1:n partitions, the Joiner transformation caches all the data from the master pipeline and writes the cache

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 906 of 1017


to disk if the memory cache fills. When the Joiner transformation receives the data from the detail pipeline,
it must then read the data from disk to compare the master and detail pipelines.

Optimize Sequence Generator Transformations

Sequence Generator transformations need to determine the next available sequence number; thus,
increasing the Number of Cached Values property can increase performance. This property determines
the number of values the PowerCenter Server caches at one time. If it is set to cache no values, then the
PowerCenter Server must query the repository each time to determine the next number to be used. You
may consider configuring the Number of Cached Values to a value greater than 1000. Note that any
cached values not used in the course of a session are lost since the sequence generator value in the
repository is set when it is called next time, to give the next set of cache values.

Avoid External Procedure Transformations

For the most part, making calls to external procedures slows a session. If possible, avoid the use of these
Transformations, which include Stored Procedures, External Procedures, and Advanced External
Procedures.

Field-Level Transformation Optimization

As a final step in the tuning process, you can tune expressions used in transformations. When examining
expressions, focus on complex expressions and try to simplify them when possible.

To help isolate slow expressions, do the following:

1. Time the session with the original expression.


2. Copy the mapping and replace half the complex expressions with a constant.
3. Run and time the edited session.
4. Make another copy of the mapping and replace the other half of the complex expressions with a
constant.
5. Run and time the edited session.

Processing field level transformations takes time. If the transformation expressions are complex, then
processing is even slower. It’s often possible to get a 10 to 20 percent performance improvement by
optimizing complex field level transformations. Use the target table mapping reports or the Metadata
Reporter to examine the transformations. Likely candidates for optimization are the fields with the most
complex expressions. Keep in mind that there may be more than one field causing performance problems.

Factoring Out Common Logic

Factoring out common logic can reduce the number of times a mapping performs the same logic. If a
mapping performs the same logic multiple times, moving the task upstream in the mapping may allow the
logic to be performed just once. For example, a mapping has five target tables. Each target requires a
Social Security Number lookup. Instead of performing the lookup right before each target, move the lookup
to a position before the data flow splits.

Minimize Function Calls

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 907 of 1017


Anytime a function is called it takes resources to process. There are several common examples where
function calls can be reduced or eliminated.

Aggregate function calls can sometime be reduced. In the case of each aggregate function call, the
PowerCenter Server must search and group the data. Thus, the following expression:

SUM(Column A) + SUM(Column B)

Can be optimized to:

SUM(Column A + Column B)

In general, operators are faster than functions, so operators should be used whenever possible. For
example if you have an expression which involves a CONCAT function such as:

CONCAT(CONCAT(FIRST_NAME, ), LAST_NAME)

It can be optimized to:

FIRST_NAME || LAST_NAME

Remember that IIF() is a function that returns a value, not just a logical test. This allows many logical
statements to be written in a more compact fashion. For example:

IIF(FLG_A=Y and FLG_B=Y and FLG_C= Y, VAL_A+VAL_B+VAL_C,< /FONT>

IIF(FLG_A=Y and FLG_B=Y and FLG_C= N, VAL_A+VAL_B,< /FONT>

IIF(FLG_A=Y and FLG_B=N and FLG_C= Y, VAL_A+VAL_C,< /FONT>

IIF(FLG_A=Y and FLG_B=N and FLG_C= N, VAL_A,< /FONT>

IIF(FLG_A=N and FLG_B=Y and FLG_C= Y, VAL_B+VAL_C,< /FONT>

IIF(FLG_A=N and FLG_B=Y and FLG_C= N, VAL_B,< /FONT>

IIF(FLG_A=N and FLG_B=N and FLG_C= Y, VAL_C,< /FONT>

IIF(FLG_A=N and FLG_B=N and FLG_C= N, 0.0))))))))< /FONT>

Can be optimized to:

IIF(FLG_A=Y, VAL_A, 0.0) + IIF(FLG_B=Y, VAL_B, 0.0) + IIF(FLG_C= Y, VAL_C, 0.0)< /FONT>

The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized expression results in
three IIFs, three comparisons, and two additions.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 908 of 1017


Be creative in making expressions more efficient. The following is an example of rework of an expression
that eliminates three comparisons down to one:

IIF(X=1 OR X=5 OR X=9, 'yes', 'no')< /FONT>

Can be optimized to:

IIF(MOD(X, 4) = 1, 'yes', 'no')< /FONT >

Calculate Once, Use Many Times

Avoid calculating or testing the same value multiple times. If the same sub-expression is used several
times in a transformation, consider making the sub-expression a local variable. The local variable can be
used only within the transformation in which it was created. Calculating the variable only once and then
referencing the variable in following sub-expressions improves performance.

Choose Numeric vs. String Operations

The PowerCenter Server processes numeric operations faster than string operations. For example, if a
lookup is performed on a large amount of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID,
configuring the lookup around EMPLOYEE_ID improves performance.

Optimizing Char-Char and Char-Varchar Comparisons

When the PowerCenter Server performs comparisons between CHAR and VARCHAR columns, it slows
each time it finds trailing blank spaces in the row. To resolve this, treat CHAR as the CHAR On Read
option in the PowerCenter Server setup so that the server does not trim trailing spaces from the end of
CHAR source fields.

Use DECODE Instead of LOOKUP

When a LOOKUP function is used, the PowerCenter Server must lookup a table in the database. When a
DECODE function is used, the lookup values are incorporated into the expression itself so the server does
not need to lookup a separate table. Thus, when looking up a small set of unchanging values, using
DECODE may improve performance.

Reduce the Number of Transformations in a Mapping

Because there is always overhead involved in moving data among transformations, try, whenever
possible, to reduce the number of transformations. Also, resolve unnecessary links between
transformations to minimize the amount of data moved. This is especially important with data being pulled
from the Source Qualifier Transformation.

Use Pre- and Post-Session SQL Commands

You can specify pre- and post-session SQL commands in the Properties tab of the Source Qualifier
transformation and in the Properties tab of the target instance in a mapping. To increase the load speed,
use these commands to drop indexes on the target before the session runs, then recreate them when the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 909 of 1017


session completes.

Apply the following guidelines when using SQL statements:

● You can use any command that is valid for the database type. However, the PowerCenter Server
does not allow nested comments, even though the database may.
● You can use mapping parameters and variables in SQL executed against the source, but not
against the target.
● Use a semi-colon (;) to separate multiple statements.
● The PowerCenter Server ignores semi-colons within single quotes, double quotes, or within /* ...*/.
● If you need to use a semi-colon outside of quotes or comments, you can escape it with a back
slash (\).
● The Workflow Manager does not validate the SQL.

Use Environmental SQL

For relational databases, you can execute SQL commands in the database environment when connecting
to the database. You can use this for source, target, lookup, and stored procedure connections. For
instance, you can set isolation levels on the source and target systems to avoid deadlocks. Follow the
guidelines listed above for using the SQL statements.

Use Local Variables

You can use local variables in Aggregator, Expression, and Rank transformations.

Temporarily Store Data and Simplify Complex Expressions

Rather than parsing and validating the same expression each time, you can define these components as
variables. This also allows you to simplyfy complex expressions. For example, the following expressions:

AVG( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >

SUM( SALARY, ( ( JOB_STATUS = 'Full-time' ) AND (OFFICE_ID = 1000 ) ) ) < /FONT >

can use variables to simplify complex expressions and temporarily store data:

Port Value

V_CONDITION1 JOB_STATUS = 'Full-time'

V_CONDITION2 OFFICE_ID = 1000

AVG_SALARY AVG( SALARY, V_CONDITION1 AND V_CONDITION2 )

SUM_SALARY SUM( SALARY, V_CONDITION1 AND V_CONDITION2 )

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 910 of 1017


Store Values Across Rows

You can use variables to store data from prior rows. This can help you perform procedural calculations.
To compare the previous state to the state just read:

IIF( PREVIOUS_STATE = STATE, STATE_COUNTER + 1, 1 )< /FONT >

Capture Values from Stored Procedures

Variables also provide a way to capture multiple columns of return values from stored procedures.

Last updated: 13-Feb-07 17:43

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 911 of 1017


Tuning Sessions for Better Performance

Challenge

Running sessions is where the pedal hits the metal. A common misconception is that
this is the area where most tuning should occur. While it is true that various specific
session options can be modified to improve performance, PowerCenter 8 comes with
PowerCenter Enterprise Grid Option and Pushdown optimizations that also improve
performance tremendously.

Description

Once you optimize the source and target database, and mapping, you can focus on
optimizing the session. The greatest area for improvement at the session level usually
involves tweaking memory cache settings. The Aggregator (without sorted ports),
Joiner, Rank, Sorter and Lookup transformations (with caching enabled) use caches.

The PowerCenter Server uses index and data caches for each of these
transformations. If the allocated data or index cache is not large enough to store the
data, the PowerCenter Server stores the data in a temporary disk file as it processes
the session data. Each time the PowerCenter Server pages to the temporary file,
performance slows.

You can see when the PowerCenter Server pages to the temporary file by examining
the performance details. The transformation_readfromdisk or
transformation_writetodisk counters for any Aggregator, Rank, Lookup, Sorter, or
Joiner transformation indicate the number of times the PowerCenter Server must page
to disk to process the transformation. Index and data caches should both be sized
according to the requirements of the individual lookup. The sizing can be done using
the estimation tools provided in the Transformation Guide, or through observation of
actual cache sizes on in the session caching directory.

The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention used by
the PowerCenter Server for these files is PM [type of transformation] [generated
session instance id number] _ [transformation instance id number] _ [partition index].
dat or .idx. For example, an aggregate data cache file would be named PMAGG31_19.
dat. The cache directory may be changed however, if disk space is a constraint.
Informatica recommends that the cache directory be local to the PowerCenter
Server. A RAID 0 arrangement that gives maximum performance with no redundancy is

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 912 of 1017


recommended for volatile cache file directories (i.e., no persistent caches).

If the PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow session
performance, the RAM allocated needs to be available on the server. If the server
doesn’t have available RAM and uses paged memory, your session is again accessing
the hard disk. In this case, it is more efficient to allow PowerCenter to page the data
rather than the operating system. Adding additional memory to the server is, of course,
the best solution.

Refer to Session Caches in the Workflow Administration Guide for detailed information
on determining cache sizes.

The PowerCenter Server writes to the index and data cache files during a session in
the following cases:

● The mapping contains one or more Aggregator transformations, and the


session is configured for incremental aggregation.
● The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the PowerCenter Server runs the session for the
first time.
● The mapping contains a Lookup transformation that is configured to initialize
the persistent lookup cache.
● The Data Transformation Manager (DTM) process in a session runs out of
cache memory and pages to the local cache files. The DTM may create
multiple files when processing large amounts of data. The session fails if the
local directory runs out of disk space.

When a session is running, the PowerCenter Server writes a message in the session
log indicating the cache file name and the transformation name. When a session
completes, the DTM generally deletes the overflow index and data cache files.
However, index and data files may exist in the cache directory if the session is
configured for either incremental aggregation or to use a persistent lookup cache.
Cache files may also remain if the session does not complete successfully.

Configuring Automatic Memory Settings

PowerCenter 8 allows you to configure the amount of cache memory. Alternatively, you
can configure the Integration Service to automatically calculate cache memory settings
at run time. When you run a session, the Integration Service allocates buffer memory to
the session to move the data from the source to the target. It also creates session

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 913 of 1017


caches in memory. Session caches include index and data caches for the Aggregator,
Rank, Joiner, and Lookup transformations, as well as Sorter and XML target caches.
The values stored in the data and index caches depend upon the requirements of the
transformation. For example, the Aggregator index cache stores group values as
configured in the group by ports, and the data cache stores calculations based on the
group by ports. When the Integration Service processes a Sorter transformation or
writes data to an XML target, it also creates a cache.

Configuring Session Cache Memory

The Integration Service can determine cache memory requirements for the Lookup,
Aggregator, Rank, Joiner, Sorter and XML.

You can configure auto for the index and data cache size in the transformation
properties or on the mappings tab of the session properties

Max Memory Limits

Configuring maximum memory limits allows you to ensure that you reserve a
designated amount or percentage of memory for other processes. You can configure
the memory limit as a numeric value and as a percent of total memory. Because
available memory varies, the Integration Service bases the percentage value on the
total memory on the Integration Service process machine.

For example, you configure automatic caching for three Lookup transformations in a
session. Then, you configure a maximum memory limit of 500MB for the session. When
you run the session, the Integration Service divides the 500MB of allocated memory
among the index and data caches for the Lookup transformations.

When you configure a maximum memory value, the Integration Service divides
memory among transformation caches based on the transformation type.

When you configure a numeric value and a percent both, the Integration Service
compares the values and uses the lower value as the maximum memory limit.

When you configure automatic memory settings, the Integration Service specifies a
minimum memory allocation for the index and data caches. The Integration Service
allocates 1,000,000 bytes to the index cache and 2,000,000 bytes to the data cache for
each transformation instance. If you configure a maximum memory limit that is less
than the minimum value for an index or data cache, the Integration Service overrides
this value. For example, if you configure a maximum memory value of 500 bytes for

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 914 of 1017


session containing a Lookup transformation, the Integration Service overrides or
disable the automatic memory settings and uses the default values.

When you run a session on a grid and you configure Maximum Memory Allowed for
Auto Memory Attributes, the Integration Service divides the allocated memory among
all the nodes in the grid. When you configure Maximum Percentage of Total Memory
Allowed for Auto Memory Attributes, the Integration Service allocates the specified
percentage of memory on each node in the grid.

Aggregator Caches

Keep the following items in mind when configuring the aggregate memory cache sizes:

● Allocate at least enough space to hold at least one row in each aggregate
group.
● Remember that you only need to configure cache memory for an Aggregator
transformation that does not use sorted ports. The PowerCenter Server uses
Session Process memory to process an Aggregator transformation with sorted
ports, not cache memory.
● Incremental aggregation can improve session performance. When it is used,
the PowerCenter Server saves index and data cache information to disk at the
end of the session. The next time the session runs, the PowerCenter Server
uses this historical information to perform the incremental aggregation. The
PowerCenter Server names these files PMAGG*.dat and PMAGG*.idx and
saves them to the cache directory. Mappings that have sessions which use
incremental aggregation should be set up so that only new detail records are
read with each subsequent run.
● When configuring Aggregate data cache size, remember that the data cache
holds row data for variable ports and connected output ports only. As a result,
the data cache is generally larger than the index cache. To reduce the data
cache size, connect only the necessary output ports to subsequent
transformations.

Joiner Caches

When a session is run with a Joiner transformation, the PowerCenter Server reads
from master and detail sources concurrently and builds index and data caches based
on the master rows. The PowerCenter Server then performs the join based on the
detail source data and the cache data.

The number of rows the PowerCenter Server stores in the cache depends on the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 915 of 1017


partitioning scheme, the data in the master source, and whether or not you use sorted
input.

After the memory caches are built, the PowerCenter Server reads the rows from the
detail source and performs the joins. The PowerCenter Server uses the index cache to
test the join condition. When it finds source data and cache data that match, it retrieves
row values from the data cache.

Lookup Caches

Several options can be explored when dealing with Lookup transformation caches.

● Persistent caches should be used when lookup data is not expected to change
often. Lookup cache files are saved after a session with a persistent cache
lookup is run for the first time. These files are reused for subsequent runs,
bypassing the querying of the database for the lookup. If the lookup table
changes, you must be sure to set the Recache from Database option to
ensure that the lookup cache files are rebuilt. You can also delete the cache
files before the session run to force the session to rebuild the caches.
● Lookup caching should be enabled for relatively small tables. Refer to the Best
Practice Tuning Mappings for Better Performance to determine when lookups
should be cached. When the Lookup transformation is not configured for
caching, the PowerCenter Server queries the lookup table for each input row.
The result of the lookup query and processing is the same, regardless of
whether the lookup table is cached or not. However, when the transformation
is configured to not cache, the PowerCenter Server queries the lookup table
instead of the lookup cache. Using a lookup cache can usually increase
session performance.
● Just like for a joiner, the PowerCenter Server aligns all data for lookup caches
on an eight-byte boundary, which helps increase the performance of the
lookup

Allocating Buffer Memory

The Integration Service can determine the memory requirements for the buffer memory:

● DTM Buffer Size


● Default Buffer Block Size

You can also configure DTM buffer size and the default buffer block size in the session
properties. When the PowerCenter Server initializes a session, it allocates blocks of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 916 of 1017


memory to hold source and target data. Sessions that use a large number of sources
and targets may require additional memory blocks.

To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the buffer
size and/or the buffer block size based on the default settings, to create the required
number of session blocks.

If there are XML sources or targets in the mappings, use the number of groups in the
XML source or target in the total calculation for the total number of sources and targets.

Increasing the DTM Buffer Pool Size

The DTM Buffer Pool Size setting specifies the amount of memory the PowerCenter
Server uses as DTM buffer memory. The PowerCenter Server uses DTM buffer
memory to create the internal data structures and buffer blocks used to bring data into
and out of the server. When the DTM buffer memory is increased, the PowerCenter
Server creates more buffer blocks, which can improve performance during momentary
slowdowns.

If a session's performance details show low numbers for your source and target
BufferInput_efficiency and BufferOutput_efficiency counters, increasing the DTM buffer
pool size may improve performance.

Using DTM buffer memory allocation generally causes performance to improve initially
and then level off. (Conversely, it may have no impact on source or target-bottlenecked
sessions at all and may not have an impact on DTM bottlenecked sessions). When the
DTM buffer memory allocation is increased, you need to evaluate the total memory
available on the PowerCenter Server. If a session is part of a concurrent batch, the
combined DTM buffer memory allocated for the sessions or batches must not exceed
the total memory for the PowerCenter Server system. You can increase the DTM buffer
size in the Performance settings of the Properties tab.

Running Workflows and Sessions Concurrently

The PowerCenter Server can process multiple sessions in parallel and can also
process multiple partitions of a pipeline within a session. If you have a symmetric multi-
processing (SMP) platform, you can use multiple CPUs to concurrently process session
data or partitions of data. This provides improved performance since true parallelism is
achieved. On a single processor platform, these tasks share the CPU, so there is no
parallelism.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 917 of 1017


To achieve better performance, you can create a workflow that runs several sessions in
parallel on one PowerCenter Server. This technique should only be employed on
servers with multiple CPUs available.

Partitioning Sessions

Performance can be improved by processing data in parallel in a single session by


creating multiple partitions of the pipeline. If you have PowerCenter partitioning
available, you can increase the number of partitions in a pipeline to improve session
performance. Increasing the number of partitions allows the PowerCenter Server to
create multiple connections to sources and process partitions of source data
concurrently.

When you create or edit a session, you can change the partitioning information for each
pipeline in a mapping. If the mapping contains multiple pipelines, you can specify
multiple partitions in some pipelines and single partitions in others. Keep the following
attributes in mind when specifying partitioning information for a pipeline:

● Location of partition points. The PowerCenter Server sets partition points at


several transformations in a pipeline by default. If you have PowerCenter
partitioning available, you can define other partition points. Select those
transformations where you think redistributing the rows in a different way is
likely to increase the performance considerably.
● Number of partitions. By default, the PowerCenter Server sets the number of
partitions to one. You can generally define up to 64 partitions at any partition
point. When you increase the number of partitions, you increase the number of
processing threads, which can improve session performance. Increasing the
number of partitions or partition points also increases the load on the server. If
the server contains ample CPU bandwidth, processing rows of data in a
session concurrently can increase session performance. However, if you
create a large number of partitions or partition points in a session that
processes large amounts of data, you can overload the system. You can also
overload source and target systems, so that is another consideration.
● Partition types. The partition type determines how the PowerCenter Server
redistributes data across partition points. The Workflow Manager allows you to
specify the following partition types:

1. Round-robin partitioning. PowerCenter distributes rows of data evenly


to all partitions. Each partition processes approximately the same
number of rows. In a pipeline that reads data from file sources of
different sizes, you can use round-robin partitioning to ensure that each

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 918 of 1017


partition receives approximately the same number of rows.
2. Hash keys. The PowerCenter Server uses a hash function to group
rows of data among partitions. The Server groups the data based on a
partition key. There are two types of hash partitioning:

❍ Hash auto-keys. The PowerCenter Server uses all grouped or


sorted ports as a compound partition key. You can use hash
auto-keys partitioning at or before Rank, Sorter, and unsorted
Aggregator transformations to ensure that rows are grouped
properly before they enter these transformations.
❍ Hash user keys. The PowerCenter Server uses a hash
function to group rows of data among partitions based on a
user-defined partition key. You choose the ports that define the
partition key.

3. Key range. The PowerCenter Server distributes rows of data based on


a port or set of ports that you specify as the partition key. For each port,
you define a range of values. The PowerCenter Server uses the key and
ranges to send rows to the appropriate partition. Choose key range
partitioning where the sources or targets in the pipeline are partitioned
by key range.
4. -Pass-through partitioning. The PowerCenter Server processes data
without redistributing rows among partitions. Therefore, all rows in a
single partition stay in that partition after crossing a pass-through
partition point.
5. Database partitioning partition. You can optimize session
performance by using the database partitioning partition type instead of
the pass-through partition type for IBM DB2 targets.

If you find that your system is under-utilized after you have tuned the application,
databases, and system for maximum single-partition performance, you can reconfigure
your session to have two or more partitions to make your session utilize more of the
hardware. Use the following tips when you add partitions to a session:

● Add one partition at a time. To best monitor performance, add one partition
at a time, and note your session settings before you add each partition.
● Set DTM buffer memory. For a session with n partitions, this value should be
at least n times the value for the session with one partition.
● Set cached values for Sequence Generator. For a session with n partitions,
there should be no need to use the number of cached values property of the
Sequence Generator transformation. If you must set this value to a value
greater than zero, make sure it is at least n times the original value for the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 919 of 1017


session with one partition.
● Partition the source data evenly. Configure each partition to extract the
same number of rows. Or redistribute the data among partitions early using a
partition point with round-robin. This is actually a good way to prevent
hammering of the source system. You could have a session with multiple
partitions where one partition returns all the data and the override SQL in the
other partitions is set to return zero rows (where 1 = 2 in the where
clause prevents any rows being returned). Some source systems react better
to multiple concurrent SQL queries; others prefer smaller numbers of queries.
● Monitor the system while running the session. If there are CPU cycles
available (twenty percent or more idle time), then performance may improve
for this session by adding a partition.
● Monitor the system after adding a partition. If the CPU utilization does not
go up, the wait for I/O time goes up, or the total data transformation rate goes
down, then there is probably a hardware or software bottleneck. If the wait for I/
O time goes up a significant amount, then check the system for hardware
bottlenecks. Otherwise, check the database configuration.
● Tune databases and system. Make sure that your databases are tuned
properly for parallel ETL and that your system has no bottlenecks.

Increasing the Target Commit Interval

One method of resolving target database bottlenecks is to increase the commit interval.
Each time the target database commits, performance slows. If you increase the commit
interval, the number of times the PowerCenter Server commits decreases and
performance may improve.

When increasing the commit interval at the session level, you must remember to
increase the size of the database rollback segments to accommodate the larger
number of rows. One of the major reasons that Informatica set the default commit
interval to 10,000 is to accommodate the default rollback segment / extent size of most
databases. If you increase both the commit interval and the database rollback
segments, you should see an increase in performance. In some cases though, just
increasing the commit interval without making the appropriate database changes may
cause the session to fail part way through (i.e., you may get a database error like
"unable to extend rollback segments" in Oracle).

Disabling High Precision

If a session runs with high precision enabled, disabling high precision may improve
session performance.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 920 of 1017


The Decimal datatype is a numeric datatype with a maximum precision of 28. To use a
high-precision Decimal datatype in a session, you must configure it so that the
PowerCenter Server recognizes this datatype by selecting Enable High Precision in the
session property sheet. However, since reading and manipulating a high-precision
datatype (i.e., those with a precision of greater than 28) can slow the PowerCenter
Server down, session performance may be improved by disabling decimal arithmetic.
When you disable high precision, the PowerCenter Server reverts to using a dataype of
Double.

Reducing Error Tracking

If a session contains a large number of transformation errors, you may be able to


improve performance by reducing the amount of data the PowerCenter Server writes to
the session log.

To reduce the amount of time spent writing to the session log file, set the tracing level
to Terse. At this tracing level, the PowerCenter Server does not write error messages
or row-level information for reject data. However, if terse is not an acceptable level of
detail, you may want to consider leaving the tracing level at Normal and focus your
efforts on reducing the number of transformation errors. Note that the tracing level must
be set to Normal in order to use the reject loading utility.

As an additional debug option (beyond the PowerCenter Debugger), you may set the
tracing level to verbose initialization or verbose data.

● Verbose initialization logs initialization details in addition to normal, names of


index and data files used, and detailed transformation statistics.
● Verbose data logs each row that passes into the mapping. It also notes where
the PowerCenter Server truncates string data to fit the precision of a column
and provides detailed transformation statistics. When you configure the tracing
level to verbose data, the PowerCenter Server writes row data for all rows in a
block when it processes a transformation.

However, the verbose initialization and verbose data logging options significantly affect
the session performance. Do not use Verbose tracing options except when testing
sessions. Always remember to switch tracing back to Normal after the testing is
complete.

The session tracing level overrides any transformation-specific tracing levels within the
mapping. Informatica does not recommend reducing error tracing as a long-term
response to high levels of transformation errors. Because there are only a handful of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 921 of 1017


reasons why transformation errors occur, it makes sense to fix and prevent any
recurring transformation errors. PowerCenter uses the mapping tracing level when the
session tracing level is set to none.

Pushdown Optimization

You can push transformation logic to the source or target database using pushdown
optimization. The amount of work you can push to the database depends on the
pushdown optimization configuration, the transformation logic, and the mapping and
session configuration.

When you run a session configured for pushdown optimization, the Integration Service
analyzes the mapping and writes one or more SQL statements based on the mapping
transformation logic. The Integration Service analyzes the transformation logic,
mapping, and session configuration to determine the transformation logic it can push to
the database. At run time, the Integration Service executes any SQL statement
generated against the source or target tables, and it processes any transformation logic
that it cannot push to the database.

Use the Pushdown Optimization Viewer to preview the SQL statements and mapping
logic that the Integration Service can push to the source or target database. You can
also use the Pushdown Optimization Viewer to view the messages related to
Pushdown Optimization.

Source-Side Pushdown Optimization Sessions

In source-side pushdown optimization, the Integration Service analyzes the mapping


from the source to the target until it reaches a downstream transformation that cannot
be pushed to the database.

The Integration Service generates a SELECT statement based on the transformation


logic up to the transformation it can push to the database. Integration Service pushes
all transformation logic that is valid to push to the database by executing the generated
SQL statement at run time. Then, it reads the results of this SQL statement and
continues to run the session. Similarly it create the view for SQL override and then
generate SELECT statement and runs the SELECT statement against this view. When
the session completes, the Integration Service drops the view from the database.

Target-Side Pushdown Optimization Sessions

When you run a session configured for target-side pushdown optimization, the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 922 of 1017


Integration Service analyzes the mapping from the target to the source or until it
reaches an upstream transformation it cannot push to the database. It generates an
INSERT, DELETE, or UPDATE statement based on the transformation logic for each
transformation it can push to the database, starting with the first transformation in the
pipeline it can push to the database. The Integration Service processes the
transformation logic up to the point that it can push the transformation logic to the target
database. Then, it executes the generated SQL.

Full Pushdown Optimization Sessions

To use full pushdown optimization, the source and target must be on the same
database. When you run a session configured for full pushdown optimization, the
Integration Service analyzes the mapping from source to target and analyze each
transformation in the pipeline until it analyzes the target. It generates and executes the
SQL on sources and targets,

When you run a session for full pushdown optimization, the database must run a long
transaction if the session contains a large quantity of data. Consider the following
database performance issues when you generate a long transaction:

● A long transaction uses more database resources.


● A long transaction locks the database for longer periods of time, and thereby
reduces the database concurrency and increases the likelihood of deadlock.
● A long transaction can increase the likelihood that an unexpected event may
occur.

The Rank transformation cannot be pushed to the database. If you configure the
session for full pushdown optimization, the Integration Service pushes the Source
Qualifier transformation and the Aggregator transformation to the source. It pushes the
Expression transformation and target to the target database, and it processes the Rank
transformation. The Integration Service does not fail the session if it can push only part
of the transformation logic to the database and the session is configured for full
optimization.

Using a Grid

You can use a grid to increase session and workflow performance. A grid is an alias
assigned to a group of nodes that allows you to automate the distribution of workflows
and sessions across nodes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 923 of 1017


When you use a grid, the Integration Service distributes workflow tasks and session
threads across multiple nodes. Running workflows and sessions on the nodes of a grid
provides the following performance gains:

● Balances the Integration Service workload.


● Processes concurrent sessions faster.
● Processes partitions faster.

When you run a session on a grid, you improve scalability and performance by
distributing session threads to multiple DTM processes running on nodes in the grid.

To run a workflow or session on a grid, you assign resources to nodes, create and
configure the grid, and configure the Integration Service to run on a grid.

Running a Session on Grid

When you run a session on a grid, the master service process runs the workflow and
workflow tasks, including the Scheduler. Because it runs on the master service process
node, the Scheduler uses the date and time for the master service process node to
start scheduled workflows. The Load Balancer distributes Command tasks as it does
when you run a workflow on a grid. In addition, when the Load Balancer dispatches a
Session task, it distributes the session threads to separate DTM processes.

The master service process starts a temporary preparer DTM process that fetches the
session and prepares it to run. After the preparer DTM process prepares the session, it
acts as the master DTM process, which monitors the DTM processes running on other
nodes.

The worker service processes start the worker DTM processes on other nodes. The
worker DTM runs the session. Multiple worker DTM processes running on a node might
be running multiple sessions or multiple partition groups from a single session
depending on the session configuration.

For example, you run a workflow on a grid that contains one Session task and one
Command task. You also configure the session to run on the grid.

When the Integration Service process runs the session on a grid, it performs the
following tasks:

● On Node 1, the master service process runs workflow tasks. It also starts a

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 924 of 1017


temporary preparer DTM process, which becomes the master DTM process.
The Load Balancer dispatches the Command task and session threads to
nodes in the grid.
● On Node 2, the worker service process runs the Command task and starts the
worker DTM processes that run the session threads.
● On Node 3, the worker service process starts the worker DTM processes that
run the session threads.

For information about configuring and managing a grid, refer to the PowerCenter
Administrator Guide and to the best practice PowerCenter Enterprise Grid Option.

For information about how the DTM distributes session threads into partition groups,
see "Running Workflows and Sessions on a Grid" in the Workflow Administration Guide.

Last updated: 06-Dec-07 15:20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 925 of 1017


Tuning SQL Overrides and Environment for Better Performance

Challenge

Tuning SQL Overrides and SQL queries within the source qualifier objects can improve performance in selecting data from
source database tables, which positively impacts the overall session performance. This Best Practice explores ways to
optimize a SQL query within the source qualifier object. The tips here can be applied to any PowerCenter mapping. While
the SQL discussed here is executed in Oracle 8 and above, the techniques are generally applicable, but specifics for other
RDBMS products (e.g., SQL Server, Sybase, etc.) are not included.

Description

SQL Queries Performing Data Extractions

Optimizing SQL queries is perhaps the most complex portion of performance tuning. When tuning SQL, the developer must
look at the type of execution being forced by hints, the execution plan, and the indexes on the query tables in the SQL, the
logic of the SQL statement itself, and the SQL syntax. The following paragraphs discuss each of these areas in more detail.

DB2 Coalesce and Oracle NVL

When examining data with NULLs, it is often necessary to substitute a value to make comparisons and joins work. In
Oracle, the NVL function is used, while in DB2, the COALESCE function is used.

Here is an example of the Oracle NLV function:

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM exp.exp_bio_result bio, sar.sar_data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

AND NVL(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = (SELECT MAX(seq_no) FROM sar.sar_data_load_log < /FONT >

WHERE load_status = 'P')<

Here is the same query in DB2:

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM bio_result bio, data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = (SELECT MAX(seq_no) FROM data_load_log < /FONT >

WHERE load_status = 'P')< /FONT >

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 926 of 1017


Surmounting the Single SQL Statement Limitation in Oracle or DB2: In-line Views

In source qualifiers and lookup objects, you are limited to a single SQL statement. There are several ways to get around
this limitation.

You can create views in the database and use them as you would tables, either as source tables, or in the FROM clause of
the SELECT statement. This can simplify the SQL and make it easier to understand, but it also makes it harder to maintain.
The logic is now in two places: in an Informatica mapping and in a database view

You can use in-line views which are SELECT statements in the FROM or WHERE clause. This can help focus the query to
a subset of data in the table and work more efficiently than using a traditional join. Here is an example of an in-line view in
the FROM clause:

SELECT N.DOSE_REGIMEN_TEXT as DOSE_REGIMEN_TEXT,

N.DOSE_REGIMEN_COMMENT as DOSE_REGIMEN_COMMENT,

N.DOSE_VEHICLE_BATCH_NUMBER as DOSE_VEHICLE_BATCH_NUMBER,

N.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID

FROM DOSE_REGIMEN N,

(SELECT DISTINCT R.DOSE_REGIMEN_ID as DOSE_REGIMEN_ID

FROM EXPERIMENT_PARAMETER R,

NEW_GROUP_TMP TMP

WHERE R.EXPERIMENT_PARAMETERS_ID = TMP.EXPERIMENT_PARAMETERS_ID< /FONT >

AND R.SCREEN_PROTOCOL_ID = TMP.BDS_PROTOCOL_ID < /FONT >

)X

WHERE N.DOSE_REGIMEN_ID = X.DOSE_REGIMEN_ID < /FONT >

ORDER BY N.DOSE_REGIMEN_ID

Surmounting the Single SQL Statement Limitation in DB2: Using the Common Table Expression
temp tables and the WITH Clause

The Common Table Expression (CTE) stores data in temp tables during the execution of the SQL statement. The WITH
clause lets you assign a name to a CTE block. You can then reference the CTE block multiple places in the query by
specifying the query name. For example:

WITH maxseq AS (SELECT MAX(seq_no) as seq_no FROM data_load_log WHERE load_status = 'P') < /FONT >

SELECT DISTINCT bio.experiment_group_id, bio.database_site_code

FROM bio_result bio, data_load_log log

WHERE bio.update_date BETWEEN log.start_time AND log.end_time

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 927 of 1017


AND COALESCE(bio.species_type_code, 'X') IN ('mice', 'rats', ‘X’)

AND log.seq_no = maxseq. seq_no< /FONT >

Here is another example using a WITH clause that uses recursive SQL:

WITH PERSON_TEMP (PERSON_ID, NAME, PARENT_ID) AS

(SELECT PERSON_ID, NAME, PARENT_ID

FROM PARENT_CHILD

WHERE NAME IN (‘FRED’, ‘SALLY’, ‘JIM’)

UNION ALL

SELECT C.PERSON_ID, C.NAME, C.PARENT_ID

FROM PARENT_CHILD C, PERSON_TEMP RECURS

WHERE C.PERSON_ID = RECURS.PERSON_ID < /FONT >

AND LEVEL < 5)

SELECT * FROM PERSON_TEMP

The PARENT_ID in any particular row refers to the PERSON_ID of the parent. Pretty stupid since we all have two parents,
but you get the idea. The LEVEL clause prevents infinite recursion.

CASE (DB2) vs. DECODE (Oracle)

The CASE syntax is allowed in ORACLE, but you are much more likely to see the DECODE logic, even for a single case
since it was the only legal way to test a condition in earlier versions.

DECODE is not allowed in DB2.

In Oracle:

SELECT EMPLOYEE, FNAME, LNAME,

DECODE (SALARY)

< 10000, ‘NEED RAISE’,

> 1000000, ‘OVERPAID’,

‘THE REST OF US’) AS COMMENT

FROM EMPLOYEE

In DB2:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 928 of 1017


SELECT EMPLOYEE, FNAME, LNAME,

CASE

WHEN SALARY < 10000 THEN ‘NEED RAISE’

WHEN SALARY > 1000000 THEN ‘OVERPAID’

ELSE ‘THE REST OF US’

END AS COMMENT

FROM EMPLOYEE

Debugging Tip: Obtaining a Sample Subset

It is often useful to get a small sample of the data from a long running query that returns a large set of data. The logic can
be commented out or removed after it is put in general use.

DB2 uses the FETCH FIRST n ROWS ONLY clause to do this as follows:

SELECT EMPLOYEE, FNAME, LNAME

FROM EMPLOYEE

WHERE JOB_TITLE = ‘WORKERBEE’ < /FONT >

FETCH FIRST 12 ROWS ONLY

Oracle does it this way using the ROWNUM variable:

SELECT EMPLOYEE, FNAME, LNAME

FROM EMPLOYEE

WHERE JOB_TITLE = ‘WORKERBEE’ < /FONT >

AND ROWNUM <= 12< /FONT>

INTERSECT, INTERSECT ALL, UNION, UNION ALL

Remember that both the UNION and INTERSECT operators return distinct rows, while UNION ALL and INTERSECT ALL
return all rows.

System Dates in Oracle and DB2

Oracle uses the system variable SYSDATE for the current time and date, and allows you to display either the time and/or
the date however you want with date functions.

Here is an example that returns yesterday’s date in Oracle (default format as mm/dd/yyyy):

SELECT TRUNC(SYSDATE) – 1 FROM DUAL

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 929 of 1017


DB2 uses the system variables, here called special registers, CURRENT DATE, CURRENT TIME and CURRENT
TIMESTAMP

Here is an example for DB2:

SELECT FNAME, LNAME, CURRENT DATE AS TODAY

FROM EMPLOYEE

Oracle: Using Hints

Hints affect the way a query or sub-query is executed and can therefore, provide a significant performance increase in
queries. Hints cause the database engine to relinquish control over how a query is executed, thereby giving the developer
control over the execution. Hints are always honored unless execution is not possible. Because the database engine does
not evaluate whether the hint makes sense, developers must be careful in implementing hints. Oracle has many types of
hints: optimizer hints, access method hints, join order hints, join operation hints, and parallel execution hints. Optimizer and
access method hints are the most common.

In the latest versions of Oracle, the Cost-based query analysis is built-in and Rule-based analysis is no longer possible. It
was in Rule-based Oracle systems that hints mentioning specific indexes were most helpful. In Oracle version 9.2,
however, the use of /*+ INDEX */ hints may actually decrease performance significantly in many cases. If you are using
older versions of Oracle however, the use of the proper INDEX hints should help performance.

The optimizer hint allows the developer to change the optimizer's goals when creating the execution plan. The table below
provides a partial list of optimizer hints and descriptions.

Optimizer hints: Choosing the best join method

Sort/merge and hash joins are in the same group, but nested loop joins are very different. Sort/merge involves two sorts
while the nested loop involves no sorts. The hash join also requires memory to build the hash table.

Hash joins are most effective when the amount of data is large and one table is much larger than the other.

Here is an example of a select that performs best as a hash join:

SELECT COUNT(*) FROM CUSTOMERS C, MANAGERS M

WHERE C.CUST_ID = M.MANAGER_ID< /FONT >

Considerations Join Type

Better throughput Sort/Merge

Better response time Nested loop

Large subsets of data Sort/Merge

Index available to support join Nested loop

Limited memory and CPU available for sorting Nested loop

Parallel execution Sort/Merge or Hash

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 930 of 1017


Joining all or most of the rows of large tables Sort/Merge or Hash

Joining small sub-sets of data and index available Nested loop

Hint Description

ALL_ROWS The database engine creates an execution plan that optimizes for throughput.
Favors full table scans. Optimizer favors Sort/Merge

FIRST_ROWS The database engine creates an execution plan that optimizes for response time.
It returns the first row of data as quickly as possible. Favors index lookups.
Optimizer favors Nested-loops

CHOOSE The database engine creates an execution plan that uses cost-based execution if
statistics have been run on the tables. If statistics have not been run, the engine
uses rule-based execution. If statistics have been run on empty tables, the
engine still uses cost-based execution, but performance is extremely poor.

RULE The database engine creates an execution plan based on a fixed set of rules.

USE NL Use nested loops

USE MERGE Use sort merge joins

HASH The database engine performs a hash scan of the table. This hint is ignored if the
table is not clustered.

Access method hints

Access method hints control how data is accessed. These hints are used to force the database engine to use indexes,
hash scans, or row id scans. The following table provides a partial list of access method hints.

Hint Description

ROWID The database engine performs a scan of the table based on ROWIDS.

INDEX DO NOT USE in Oracle 9.2 and above. The database engine performs an index
scan of a specific table, but in 9.2 and above, the optimizer does not use any
indexes other than those mentioned.

USE_CONCAT The database engine converts a query with an OR condition into two or more
queries joined by a UNION ALL statement.

The syntax for using a hint in a SQL statement is as follows:

Select /*+ FIRST_ROWS */ empno, ename

From emp;

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 931 of 1017


Select /*+ USE_CONCAT */ empno, ename

From emp;

SQL Execution and Explain Plan

The simplest change is forcing the SQL to choose either rule-based or cost-based execution. This change can be
accomplished without changing the logic of the SQL query. While cost-based execution is typically considered the best
SQL execution; it relies upon optimization of the Oracle parameters and updated database statistics. If these statistics are
not maintained, cost-based query execution can suffer over time. When that happens, rule-based execution can actually
provide better execution time.

The developer can determine which type of execution is being used by running an explain plan on the SQL query in
question. Note that the step in the explain plan that is indented the most is the statement that is executed first. The results
of that statement are then used as input by the next level statement.

Typically, the developer should attempt to eliminate any full table scans and index range scans whenever possible. Full
table scans cause degradation in performance.

Information provided by the Explain Plan can be enhanced using the SQL Trace Utility. This utility provides the following
additional information including:

● The number of executions


● The elapsed time of the statement execution
● The CPU time used to execute the statement

The SQL Trace Utility adds value because it definitively shows the statements that are using the most resources, and can
immediately show the change in resource consumption after the statement has been tuned and a new explain plan has
been run.

Using Indexes

The explain plan also shows whether indexes are being used to facilitate execution. The data warehouse team should
compare the indexes being used to those available. If necessary, the administrative staff should identify new indexes that
are needed to improve execution and ask the database administration team to add them to the appropriate tables. Once
implemented, the explain plan should be executed again to ensure that the indexes are being used. If an index is not being
used, it is possible to force the query to use it by using an access method hint, as described earlier.

Reviewing SQL Logic

The final step in SQL optimization involves reviewing the SQL logic itself. The purpose of this review is to determine
whether the logic is efficiently capturing the data needed for processing. Review of the logic may uncover the need for
additional filters to select only certain data, as well as the need to restructure the where clause to use indexes. In extreme
cases, the entire SQL statement may need to be re-written to become more efficient.

Reviewing SQL Syntax

SQL Syntax can also have a great impact on query performance. Certain operators can slow performance, for example:

● EXISTS clauses are almost always used in correlated sub-queries. They are executed for each row of the parent
query and cannot take advantage of indexes, while the IN clause is executed once and does use indexes, and
may be translated to a JOIN by the optimizer. If possible, replace EXISTS with an IN clause. For example:

SELECT * FROM DEPARTMENTS WHERE DEPT_ID IN

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 932 of 1017


(SELECT DISTINCT DEPT_ID FROM MANAGERS) -- Faster

SELECT * FROM DEPARTMENTS D WHERE EXISTS

(SELECT * FROM MANAGERS M WHERE M.DEPT_ID = D.DEPT_ID)< /FONT >

Situation Exists In

Index supports subquery Yes Yes

No Index to support subquery No Yes


Table scans per parent Table scan once
row

Sub-query returns many rows Probably not Yes

Sub-query returns one or a few rows Yes Yes

Most of the sub-query rows are eliminated by the No Yes


parent query

Index in parent that match sub-query columns Possibly not since the Yes – IN uses the
EXISTS cannot use the index
index

● Where possible, use the EXISTS clause instead of the INTERSECT clause. Simply modifying the query in this way
can improve performance by more than100 percent.
● Where possible, limit the use of outer joins on tables. Remove the outer joins from the query and create lookup
objects within the mapping to fill in the optional information.

Choosing the Best Join Order

Place the smallest table first in the join order. This is often a staging table holding the IDs identifying the data in the
incremental ETL load.

Always put the small table column on the right side of the join. Use the driving table first in the WHERE clause, and work
from it outward. In other words, be consistent and orderly about placing columns in the WHERE clause.

Outer joins limit the join order that the optimizer can use. Don’t use them needlessly.

Anti-join with NOT IN, NOT EXISTS, MINUS or EXCEPT, OUTER JOIN

● Avoid use of the NOT IN clause. This clause causes the database engine to perform a full table scan. While this
may not be a problem on small tables, it can become a performance drain on large tables.

SELECT NAME_ID FROM CUSTOMERS

WHERE NAME_ID NOT IN

(SELECT NAME_ID FROM EMPLOYEES)

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 933 of 1017


● Avoid use of the NOT EXISTS clause. This clause is better than the NOT IN, but still may cause a full table scan.

SELECT C.NAME_ID FROM CUSTOMERS C

WHERE NOT EXISTS

(SELECT * FROM EMPLOYEES E

WHERE C.NAME_ID = E.NAME_ID)< /FONT >

● In Oracle, use the MINUS operator to do the anti-join, if possible. In DB2, use the equivalent EXCEPT operator.

SELECT C.NAME_ID FROM CUSTOMERS C

MINUS

SELECT E.NAME_ID* FROM EMPLOYEES E

● Also consider using outer joins with IS NULL conditions for anti-joins.

SELECT C.NAME_ID FROM CUSTOMERS C, EMPLOYEES E

WHERE C.NAME_ID = E.NAME_ID (+)< /FONT >

AND C.NAME_ID IS NULL

Review the database SQL manuals to determine the cost benefits or liabilities of certain SQL clauses as they may change
based on the database engine.

● In lookups from large tables, try to limit the rows returned to the set of rows matching the set in the source
qualifier. Add the WHERE clause conditions to the lookup. For example, if the source qualifier selects sales orders
entered into the system since the previous load of the database, then, in the product information lookup, only
select the products that match the distinct product IDs in the incremental sales orders.
● Avoid range lookups. This is a SELECT that uses a BETWEEN in the WHERE clause that uses values retrieved
from a table as limits in the BETWEEN. Here is an example:

SELECT

R.BATCH_TRACKING_NO,

R.SUPPLIER_DESC,

R.SUPPLIER_REG_NO,

R.SUPPLIER_REF_CODE,

R.GCW_LOAD_DATE

FROM CDS_SUPPLIER R,

(SELECT LOAD_DATE_PREV AS LOAD_DATE_PREV,

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 934 of 1017


L.LOAD_DATE) AS LOAD_DATE

FROM ETL_AUDIT_LOG L

WHERE L.LOAD_DATE_PREV IN

(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV

FROM ETL_AUDIT_LOG Y)

)Z

WHERE

R.LOAD_DATE BETWEEN Z.LOAD_DATE_PREV AND Z.LOAD_DATE

The work-around is to use an in-line view to get the lower range in the FROM clause and join it to the main query that limits
the higher date range in its where clause. Use an ORDER BY the lower limit in the in-line view. This is likely to reduce the
throughput time from hours to seconds.

Here is the improved SQL:

SELECT

R.BATCH_TRACKING_NO,

R.SUPPLIER_DESC,

R.SUPPLIER_REG_NO,

R.SUPPLIER_REF_CODE,

R.LOAD_DATE

FROM

/* In-line view for lower limit */

(SELECT

R1.BATCH_TRACKING_NO,

R1.SUPPLIER_DESC,

R1.SUPPLIER_REG_NO,

R1.SUPPLIER_REF_CODE,

R1.LOAD_DATE

FROM CDS_SUPPLIER R1,

(SELECT MAX(Y.LOAD_DATE_PREV) AS LOAD_DATE_PREV

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 935 of 1017


FROM ETL_AUDIT_LOG Y) Z

WHERE R1.LOAD_DATE >= Z.LOAD_DATE_PREV< /FONT>

ORDER BY R1.LOAD_DATE) R,

/* end in-line view for lower limit */

(SELECT MAX(D.LOAD_DATE) AS LOAD_DATE

FROM ETL_AUDIT_LOG D) A /* upper limit /*

WHERE R. LOAD_DATE <= A.LOAD_DATE< /FONT>

Tuning System Architecture

Use the following steps to improve the performance of any system:

1. Establish performance boundaries (baseline).


2. Define performance objectives.
3. Develop a performance monitoring plan.
4. Execute the plan.
5. Analyze measurements to determine whether the results meet the objectives. If objectives are met, consider
reducing the number of measurements because performance monitoring itself uses system resources. Otherwise
continue with Step 6.
6. Determine the major constraints in the system.
7. Decide where the team can afford to make trade-offs and which resources can bear additional load.
8. Adjust the configuration of the system. If it is feasible to change more than one tuning option, implement one at a
time. If there are no options left at any level, this indicates that the system has reached its limits and hardware
upgrades may be advisable.
9. Return to Step 4 and continue to monitor the system.
10. Return to Step 1.
11. Re-examine outlined objectives and indicators.
12. Refine monitoring and tuning strategy.

System Resources

The PowerCenter Server uses the following system resources:

● CPU
● Load Manager shared memory
● DTM buffer memory
● Cache memory

When tuning the system, evaluate the following considerations during the implementation process.

● Determine if the network is running at an optimal speed. Recommended best practice is to minimize the number of
network hops between the PowerCenter Server and the databases.
● Use multiple PowerCenter Servers on separate systems to potentially improve session performance.
● When all character data processed by the PowerCenter Server is US-ASCII or EBCDIC, configure the
PowerCenter Server for ASCII data movement mode. In ASCII mode, the PowerCenter Server uses one byte to
store each character. In Unicode mode, the PowerCenter Server uses two bytes for each character, which can
potentially slow session performance
● Check hard disks on related machines. Slow disk access on source and target databases, source and target file

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 936 of 1017


systems, as well as the PowerCenter Server and repository machines can slow session performance.
● When an operating system runs out of physical memory, it starts paging to disk to free physical memory.
Configure the physical memory for the PowerCenter Server machine to minimize paging to disk. Increase system
memory when sessions use large cached lookups or sessions have many partitions.
● In a multi-processor UNIX environment, the PowerCenter Server may use a large amount of system resources.
Use processor binding to control processor usage by the PowerCenter Server.
● In a Sun Solaris environment, use the psrset command to create and manage a processor set. After creating a
processor set, use the pbind command to bind the PowerCenter Server to the processor set so that the processor
set only runs the PowerCenter Sever. For details, see project system administrator and Sun Solaris
documentation.
● In an HP-UX environment, use the Process Resource Manager utility to control CPU usage in the system. The
Process Resource Manager allocates minimum system resources and uses a maximum cap of resources. For
details, see project system administrator and HP-UX documentation.
● In an AIX environment, use the Workload Manager in AIX 5L to manage system resources during peak demands.
The Workload Manager can allocate resources and manage CPU, memory, and disk I/O bandwidth. For details,
see project system administrator and AIX documentation.

Database Performance Features

Nearly everything is a trade-off in the physical database implementation. Work with the DBA in determining which of the
many available alternatives is the best implementation choice for the particular database. The project team must have a
thorough understanding of the data, database, and desired use of the database by the end-user community prior to
beginning the physical implementation process. Evaluate the following considerations during the implementation process.

● Denormalization. The DBA can use denormalization to improve performance by eliminating the constraints and
primary key to foreign key relationships, and also eliminating join tables.
● Indexes. Proper indexing can significantly improve query response time. The trade-off of heavy indexing is a
degradation of the time required to load data rows in to the target tables. Carefully written pre-session scripts are
recommended to drop indexes before the load and rebuilding them after the load using post-session scripts.
● Constraints. Avoid constraints if possible and try to exploit integrity enforcement through the use of incorporating
that additional logic in the mappings.
● Rollback and Temporary Segments. Rollback and temporary segments are primarily used to store data for
queries (temporary) and INSERTs and UPDATES (rollback). The rollback area must be large enough to hold all
the data prior to a COMMIT. Proper sizing can be crucial to ensuring successful completion of load sessions,
particularly on initial loads.
● OS Priority. The priority of background processes is an often-overlooked problem that can be difficult to
determine after the fact. DBAs must work with the System Administrator to ensure all the database processes
have the same priority.
● Striping. Database performance can be increased significantly by implementing either RAID 0 (striping) or RAID 5
(pooled disk sharing) disk I/O throughput.
● Disk Controllers. Although expensive, striping and RAID 5 can be further enhanced by separating the disk
controllers.

Last updated: 13-Feb-07 17:47

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 937 of 1017


Advanced Client Configuration Options

Challenge

Setting the Registry to ensure consistent client installations, resolve potential missing or invalid
license key issues, and change the Server Manager Session Log Editor to your preferred editor.

Description
Ensuring Consistent Data Source Names

To ensure the use of consistent data source names for the same data sources across the domain,
the Administrator can create a single "official" set of data sources, then use the Repository
Manager to export that connection information to a file. You can then distribute this file and import
the connection information for each client machine.

Solution:

● From Repository Manager, choose Export Registry from the Tools drop-down menu.
● For all subsequent client installs, simply choose Import Registry from the Tools drop-down
menu.

Resolving Missing or Invalid License Keys

The “missing or invalid license key” error occurs when attempting to install PowerCenter Client
tools on NT 4.0 or Windows 2000 with a userid other than Administrator.

This problem also occurs when the client software tools are installed under the Administrator
account, and a user with a non-administrator ID subsequently attempts to run the tools. The user
who attempts to log in using the normal ‘non-administrator’ userid will be unable to start the
PowerCenter Client tools. Instead, the software displays the message indicating that the license
key is missing or invalid.

Solution:

● While logged in as the installation user with administrator authority, use regedt32 to edit
the registry.
● Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client Tools/.
● From the menu bar, select Security/Permissions, and grant read access to the users that
should be permitted to use the PowerMart Client. (Note that the registry entries for both
PowerMart and PowerCenter Server and client tools are stored as PowerMart Server and
PowerMart Client tools.)

Changing the Session Log Editor

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 938 of 1017


In PowerCenter versions 6.0 to 7.1.2, the session and workflow log editor defaults to Wordpad
within the workflow monitor client tool. To choose a different editor, just select Tools>Options in the
workflow monitor. Then browse for the editor that you want on the General tab.

For PowerCenter versions earlier than 6.0, the editor does not default to Wordpad unless the
wordpad.exe can be found in the path statement. Instead, a window appears the first time a
session log is viewed from the PowerCenter Server Manager prompting the user to enter the full
path name of the editor to be used to view the logs. Users often set this parameter incorrectly and
must access the registry to change it.

Solution:

● While logged in as the installation user with administrator authority, use regedt32 to go into
the registry.
● Move to registry path location: HKEY_CURRENT_USER Software\Informatica\PowerMart
Client Tools\[CLIENT VERSION]\Server Manager\Session Files. From the menu bar,
select View Tree and Data.
● Select the Log File Editor entry by double clicking on it.
● Replace the entry with the appropriate editor entry (i.e., typically WordPad.exe or Write.
exe).
● Select Registry --> Exit from the menu bar to save the entry.

For PowerCenter version 7.1 and above, you should set the log editor option in the Workflow
Monitor.

The following figure shows the Workflow Monitor Options Dialog box to use for setting the editor for
workflow and session logs.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 939 of 1017


Adding a New Command Under Tools Menu

Other tools, in addition to the PowerCenter client tools, are often needed during development and
testing. For example, you may need a tool such as Enterprise manager (SQL Server) or Toad
(Oracle) to query the database. You can add shortcuts to executable programs from any client
tool’s ‘Tools’ drop-down menu to provide quick access to these programs.

Solution:

Choose ‘Customize’ under the Tools menu and add a new item. Once it is added, browse to find
the executable it is going to call (as shown below).

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 940 of 1017


When this is done once, you can easily call another program from your PowerCenter client tools.

In the following example, TOAD can be called quickly from the Repository Manager tool.

Changing Target Load Type

In PowerCenter versions 6.0 and earlier, each time a session was created, it defaulted to be of type
‘bulk’, although this was not necessarily what was desired and could cause the session to fail under
certain conditions if not changed. In versions 7.0 and above, you can set a property in Workflow
Manager to choose the default load type to be either 'bulk' or 'normal'.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 941 of 1017


Solution:

● In the Workflow Manager tool, choose Tools > Options and go to the Miscellaneous tab.
● Click the button for either 'normal' or 'bulk', as desired.
● Click OK, then close and open the Workflow Manager tool.

After this, every time a session is created, the target load type for all relational targets will default to
your choice.

Resolving Undocked Explorer Windows

The Repository Navigator window sometimes becomes undocked. Docking it again can be
frustrating because double clicking on the window header does not put it back in place.

Solution:

● To get the Window correctly docked, right-click in the white space of the Navigator
window.
● Make sure that ‘Allow Docking’ option is checked. If it is checked, double-click on the title
bar of the Navigator Window.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 942 of 1017


Resolving Client Tool Window Display Issues

If one of the windows (e.g., Navigator or Output) in a PowerCenter 7.x or later client tool (e.
g., Designer) disappears, try the following solutions to recover it:

● Clicking View > Navigator


● Toggling the menu bar
● Uninstalling and reinstalling Client tools

Note: If none of the above solutions resolve the problem, you may want to try the following solution
using the Registry Editor. Be aware, however, that using the Registry Editor incorrectly can cause
serious problems that may require reinstalling the operating system. Informatica does not
guarantee that any problems caused by using Registry Editor incorrectly can be resolved. Use the
Registry Editor at your own risk.

Solution:

Starting with PowerCenter 7.x, the settings for the client tools are in the registry. Display issues can
often be resolved as follows:

● Close the client tool.


● Go to Start > Run and type "regedit".
● Go to the key HKEY_CURRENT_USER\Software\Informatica\PowerMart Client Tools\x.y.z
Where x.y.z is the version and maintenance release level of the PowerCenter client as
follows:

PowerCenter Folder
Version Name
7.1 7.1
7.1.1 7.1.1
7.1.2 7.1.1
7.1.3 7.1.1
7.1.4 7.1.1
8.1 8.1

● Open the key of the affected tool (for the Repository Manager open Repository Manager
Options).
● Export all of the Toolbars sub-folders and rename them.
● Re-open the client tool.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 943 of 1017


Enhancing the Look of the Client Tools

The PowerCenter client tools allow you to customize the look and feel of the display. Here are a
few examples of what you can do.

Designer

● From the Menu bar, select Tools > Options.


● In the dialog box, choose the Format tab.
● Select the feature that you want to modify (i.e., workspace colors, caption colors, or fonts).

Changing the background workspace colors can help identify which workspace is currently
open. For example, changing the Source Analyzer workspace color to green or the Target Designer
workspace to purple to match their respective metadata definitions helps to identify the workspace.

Alternatively, click the Select Theme button to choose a color theme, which displays background
colors based on predefined themes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 944 of 1017


Workflow Manager

You can modify the Workflow Manager using the same approach as the Designer tool.

From the Menu bar, select Tools > Options and click the Format tab. Select a color theme or
customize each element individually.

Workflow Monitor

You can modify the colors in the Gantt Chart view to represent the various states of a task. You can
also select two colors for one task to give it a dimensional appearance; this can be helpful in

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 945 of 1017


distinguishing between running tasks, succeeded tasks, etc.

To modify the Gantt chart appearance, go to the Menu bar and select Tools > Options and Gantt
Chart.

Using Macros in Data Stencil

Data Stencil contains unsigned macros. Set the security level in Visio to Medium so you can enable
macros when you start Data Stencil. If the security level for Visio is set to High or Very High, you

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 946 of 1017


cannot run the Data Stencil macros.

To use the security level for the Visio, select Tools > Macros > Security from the menu. On the
Security Level tab, select Medium.

When you start Data Stencil, Visio displays a security warning about viruses in macros. Click
Enable Macros to enable the macros for Data Stencil.

Last updated: 19-Mar-08 19:00

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 947 of 1017


Advanced Server Configuration Options

Challenge

Correctly configuring Advanced Integration Service properties, Integration Service process variables, and automatic memory settings;
using custom properties to write service logs to files; and adjusting semaphore and shared memory settings in the UNIX environment.

Description
Configuring Advanced Integration Service Properties

Use the Administration Console to configure the advanced properties, such as the character set of the Integration Service logs. To
edit the advanced properties, select the Integration Service in the Navigator, and click the Properties tab > Advanced Properties >
Edit.

The following Advanced properties are included:

Limit on Resilience Optional Maximum amount of time (in seconds) that the service holds on to resources
Timeouts for resilience purposes. This property places a restriction on clients that
connect to the service. Any resilience timeouts that exceed the limit are cut off
at the limit. If the value of this property is blank, the value is derived from the
domain-level settings.

Valid values are between 0 and 2592000, inclusive. Default is blank.

Resilience Timeout Optional Period of time (in seconds) that the service tries to establish or reestablish a
connection to another service. If blank, the value is derived from the domain-
level settings.

Valid values are between 0 and 2592000, inclusive. Default is blank.

Configuring Integration Service Process Variables

One configuration best practice is to properly configure and leverage the Integration service (IS) process variables. The benefits
include:

● Ease of deployment across environments (DEV > TEST > PRD)


● Ease of switching sessions from one IS to another without manually editing all the sessions to change directory paths.
● All the variables are related to directory paths used by a given Integration Service.

You must specify the paths for Integration Service files for each Integration Service process. Examples of Integration Service files
include run-time files, state of operation files, and session log files.

Each Integration Service process uses run-time files to process workflows and sessions. If you configure an Integration Service to run
on a grid or to run on backup nodes, the run-time files must be stored in a shared location. Each node must have access to the run-
time files used to process a session or workflow. This includes files such as parameter files, cache files, input files, and output files.

State of operation files must be accessible by all Integration Service processes.When you enable an Integration Service, it creates
files to store the state of operations for the service. The state of operations includes information such as the active service requests,
scheduled tasks, and completed and running processes. If the service fails, the Integration Service can restore the state and recover
operations from the point of interruption.

All Integration Service processes associated with an Integration Service must use the same shared location. However, each
Integration Service can use a separate location.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 948 of 1017


By default, the installation program creates a set of Integration Service directories in the server\infa_shared directory. You can set
the shared location for these directories by configuring the process variable $PMRootDir to point to the same location for each
Integration Service process.

You must specify the directory path for each type of file. You specify the following directories using service process variables:

Each registered server has its own set of variables. The list is fixed, not user-extensible.

Service Process Variable Value

$PMRootDir (no default – user must insert a path)

$PMSessionLogDir $PMRootDir/SessLogs

$PMBadFileDir $PMRootDir/BadFiles

$PMCacheDir $PMRootDir/Cache

$PMTargetFileDir $PMRootDir/TargetFiles

$PMSourceFileDir $PMRootDir/SourceFiles

$PMExtProcDir $PMRootDir/ExtProc

$PMTempDir $PMRootDir/Temp

$PMSuccessEmailUser (no default – user must insert a path)

$PMFailureEmailUser (no default – user must insert a path)

$PMSessionLogCount 0

$PMSessionErrorThreshold 0

$PMWorkflowLogCount 0

$PMWorkflowLogDir $PMRootDir/WorkflowLogs

$PMLookupFileDir $PMRootDir/LkpFiles

$PMStorageDir $PMRootDir/Storage

Writing PowerCenter 8 Service Logs to Files

Starting with PowerCenter 8, all the logging for the services and sessions created use the log service and can only be viewed through
the PowerCenter Administration Console. However, it is still possible to get this information logged into a file similar to the previous
versions.

To write all Integration Service logs (session, workflow, server, etc.) to files:

1. <!--[endif]-->Log in to the Admin Console.


2. Select the Integration Service

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 949 of 1017


3. Add a Custom property called UseFileLog and set its value to "Yes".
4. Add a Custom property called LogFileName and set its value to the desired file name.
5. Restart the service.

Integration Service Custom Properties (undocumented server parameters) can be entered here as well:

1. At the bottom of the list enter the Name and Value of the custom property
2. Click OK.

Adjusting Semaphore Settings on UNIX Platforms

When PowerCenter runs on a UNIX platform, it uses operating system semaphores to keep processes synchronized and to prevent
collisions when accessing shared data structures. You may need to increase these semaphore settings before installing the server.

Seven semaphores are required to run a session. Most installations require between 64 and 128 available semaphores, depending
on the number of sessions the server runs concurrently. This is in addition to any semaphores required by other software, such as
database servers.

The total number of available operating system semaphores is an operating system configuration parameter, with a limit per user and
system. The method used to change the parameter depends on the operating system:

● HP/UX: Use sam (1M) to change the parameters.


● Solaris: Use admintool or edit /etc/system to change the parameters.
● AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters on UNIX Platforms

Informatica recommends setting the following parameters as high as possible for the UNIX operating system. However, if you set
these parameters too high, the machine may not boot. Always refer to the operating system documentation for parameter limits. Note
that different UNIX operating systems set these variables in different ways or may be self tuning. Always reboot the system after
configuring the UNIX kernel.

HP-UX

For HP-UX release 11i the CDLIMIT and NOFILES parameters are not implemented. In some versions, SEMMSL is hard-coded to
500. NCALL is referred to as NCALLOUT.

Use the HP System V IPC Shared-Memory Subsystem to update parameters.

To change a value, perform the following steps:

1. Enter the /usr/sbin/sam command to start the System Administration Manager (SAM) program.
2. Double click the Kernel Configuration icon.
3. Double click the Configurable Parameters icon.
4. Double click the parameter you want to change and enter the new value in the Formula/Value field.
5. Click OK.
6. Repeat these steps for all kernel configuration parameters that you want to change.
7. When you are finished setting all of the kernel configuration parameters, select Process New Kernel from the Action menu.

The HP-UX operating system automatically reboots after you change the values for the kernel configuration parameters.

IBM AIX

None of the listed parameters requires tuning because each is dynamically adjusted as needed by the kernel.

SUN Solaris

Keep the following points in mind when configuring and tuning the SUN Solaris platform:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 950 of 1017


1. Edit the /etc/system file and add the following variables to increase shared memory segments:

set shmsys:shminfo_shmmax=value
set shmsys:shminfo_shmmin=value
set shmsys:shminfo_shmmni=value
set shmsys:shminfo_shmseg=value
set semsys:seminfo_semmap=value
set semsys:seminfo_semmni=value
set semsys:seminfo_semmns=value
set semsys:seminfo_semmsl=value
set semsys:seminfo_semmnu=value
set semsys:seminfo_semume=value

2. Verify the shared memory value changes:

# grep shmsys /etc/system

3. Restart the system:

# init 6

Red Hat Linux

The default shared memory limit (shmmax) on Linux platforms is 32MB. This value can be changed in the proc file system without a
restart.
For example, to allow 128MB, type the following command:

$ echo 134217728 >/proc/sys/kernel/shmmax

You can put this command into a script run at startup.

Alternatively, you can use sysctl(8), if available, to control this parameter. Look for a file called /etc/sysctl.conf and add a line similar
to the following:

kernel.shmmax = 134217728

This file is usually processed at startup, but sysctl can also be called explicitly later.

To view the values of other parameters, look in the files /usr/src/linux/include/asm-xxx/shmparam.h and /usr/src/linux/include/linux/
sem.h.

SuSE Linux

The default shared memory limits (shhmax and shmall) on SuSE Linux platforms can be changed in the proc file system without a
restart. For example, to allow 512MB, type the following commands:

#sets shmall and shmmax shared memory

echo 536870912 >/proc/sys/kernel/shmall #Sets shmall to 512 MB

echo 536870912 >/proc/sys/kernel/shmmax #Sets shmmax to 512 MB

You can also put these commands into a script run at startup.

Also change the settings for the system memory user limits by modifying a file called /etc/profile. Add lines similar to the following:

#sets user limits (ulimit) for system memory resources

ulimit -v 512000 #set virtual (swap) memory to 512 MB

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 951 of 1017


ulimit -m 512000 #set physical memory to 512 MB

Configuring Automatic Memory Settings

With Informatica PowerCenter 8, you can configure the Integration Service to determine buffer memory size and session cache size
at runtime. When you run a session, the Integration Service allocates buffer memory to the session to move the data from the source
to the target. It also creates session caches in memory. Session caches include index and data caches for the Aggregator, Rank,
Joiner, and Lookup transformations, as well as Sorter and XML target caches.

Configure buffer memory and cache memory settings in the Transformation and Session Properties. When you configure buffer
memory and cache memory settings, consider the overall memory usage for best performance.

Enable automatic memory settings by configuring a value for the Maximum Memory Allowed for Auto Memory Attributes or the
Maximum Percentage of Total Memory Allowed for Auto Memory Attributes. If the value is set to zero for either of these attributes, the
Integration Service disables automatic memory settings and uses default values.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 952 of 1017


Organizing and Maintaining Parameter Files &
Variables

Challenge

Organizing variables and parameters in Parameter files and maintaining Parameter files for ease of use.

Description

Parameter files are a means of providing run time values for parameters and variables defined in a
workflow, worklet, session, mapplet, or mapping. A parameter file can have values for multiple
workflows, sessions, and mappings, and can be created using text editors such as notepad, vi, shell
script, or an Informatica mapping.

Variable values are stored in the repository and can be changed within mappings. However, variable
values specified in parameter files supersede values stored in the repository. The values stored in the
repository can be cleared or reset using workflow manager.

Parameter File Contents

A Parameter File contains the values for variables and parameters. Although a parameter file can
contain values for more than one workflow (or session), it is advisable to build a parameter file to contain
values for a single or logical group of workflows for ease of administration. When using the command
line mode to execute workflows, multiple parameter files can also be configured and used for a single
workflow if the same workflow needs to be run with different parameters.

Types of Parameters and Variables

A parameter file contains the following types of parameters and variables:

● Service Variable. Defines a service variable for an Integration Service.


● Service Process Variable. Defines a service process variable for an Integration Service that
runs on a specific node.
● Workflow Variable. References values and records information in a workflow. For example,
use a workflow variable in a decision task to determine whether the previous task ran properly.
● Worklet Variable. References values and records information in a worklet. You can use
predefined worklet variables in a parent workflow, but cannot use workflow variables from the
parent workflow in a worklet.
● Session Parameter. Defines a value that can change from session to session, such a
database connection or file name.
● Mapping Parameter. Defines a value that remains constant throughout a session, such as a
state sales tax rate.
● Mapping Variable. Defines a value that can change during the session. The Integration
Service saves the value of a mapping variable to the repository at the end of each successful

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 953 of 1017


session run and uses that value the next time the session runs.

Configuring Resources with Parameter File

If a session uses a parameter file, it must run on a node that has access to the file. You create a
resource for the parameter file and make it available to one or more nodes. When you configure the
session, you assign the parameter file resource as a required resource. The Load Balancer dispatches
the Session task to a node that has the parameter file resource. If no node has the parameter file
resource available, the session fails.

Configuring Pushdown Optimization with Parameter File

Depending on the database workload, you may want to use source-side, target-side, or full pushdown
optimization at different times. For example, you may want to use partial pushdown optimization during
the database's peak hours and full pushdown optimization when activity is low. Use the $
$PushDownConfig mapping parameter to use different pushdown optimization configurations at different
times. The parameter lets you run the same session using the different types of pushdown optimization.

When you configure the session, choose $$PushdownConfig for the Pushdown Optimization attribute.

Define the parameter in the parameter file. Enter one of the following values for $$PushdownConfig in
the parameter file:

● None. The Integration Service processes all transformation logic for the session.
● Source. The Integration Service pushes part of the transformation logic to the source database.
● Source with View. The Integration Service creates a view to represent the SQL override value,
and runs an SQL statement against this view to push part of the transformation logic to the
source database.
● Target. The Integration Service pushes part of the transformation logic to the target database.
● Full. The Integration Service pushes all transformation logic to the database.
● Full with View. The Integration Service creates a view to represent the SQL override value,
and runs an SQL statement against this view to push part of the transformation logic to the
source database. The Integration Service pushes any remaining transformation logic to the
target database.

Parameter File Name

Informatica recommends giving the Parameter File the same name as the workflow with a suffix of “.
par”. This helps in identifying and linking the parameter file to a workflow.

Parameter File: Order of Precedence

While it is possible to assign Parameter Files to a session and a workflow, it is important to note that a
file specified at the workflow level always supersedes files specified at session levels.

Parameter File Location

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 954 of 1017


Each Integration Service process uses run-time files to process workflows and sessions. If you configure
an Integration Service to run on a grid or to run on backup nodes, the run-time files must be stored in a
shared location. Each node must have access to the run-time files used to process a session or
workflow. This includes files such as parameter files, cache files, input files, and output files.

Place the Parameter Files in directory that can be accessed using the server variable. This helps to
move the sessions and workflows to a different server without modifying workflow or session properties.
You can override the location and name of parameter file specified in the session or workflow while
executing workflows via the pmcmd command.

The following points apply to both Parameter and Variable files, however these are more relevant to
Parameters and Parameter files, and are therefore detailed accordingly.

Multiple Parameter Files for a Workflow

To run a workflow with different sets of parameter values during every run:

1. Create multiple parameter files with unique names.


2. Change the parameter file name (to match the parameter file name defined in Session or
Workflow properties). You can do this manually or by using a pre-session shell (or batch script).
3. Run the workflow.

Alternatively, run the workflow using pmcmd with the -paramfile option in place of steps 2 and 3.

Generating Parameter Files

Based on requirements, you can obtain the values for certain parameters from relational tables or
generate them programmatically. In such cases, the parameter files can be generated dynamically using
shell (or batch scripts) or using Informatica mappings and sessions.

Consider a case where a session has to be executed only on specific dates (e.g., the last working day of
every month), which are listed in a table. You can create the parameter file containing the next run date
(extracted from the table) in more than one way.

Method 1:

1. The workflow is configured to use a parameter file.


2. The workflow has a decision task before running the session: comparing the Current System
date against the date in the parameter file.
3. Use a shell (or batch) script to create a parameter file. Use an SQL query to extract a single
date, which is greater than the System Date (today) from the table and write it to a file with
required format.
4. The shell script uses pmcmd to run the workflow.
5. The shell script is scheduled using cron or an external scheduler to run daily. The
following figure shows the use of a shell script to generate a parameter file.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 955 of 1017


The following figure shows a generated parameter file.

Method 2:

1. The Workflow is configured to use a parameter file.


2. The initial value for the data parameter is the first date on which the workflow is to run.
3. The workflow has a decision task before running the session: comparing the Current System
date against the date in the parameter file
4. The last task in the workflow generates the parameter file for the next run of the workflow (using

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 956 of 1017


a command task calling a shell script) or a session task, which uses a mapping. This task
extracts a date that is greater than the system date (today) from the table and writes into
parameter file in the required format.
5. Schedule the workflow using Scheduler, to run daily (as shown in the following figure).

Parameter File Templates

In some other cases, the parameter values change between runs, but the change can be incorporated
into the parameter files programmatically. There is no need to maintain separate parameter files for
each run.

Consider, for example, a service provider who gets the source data for each client from flat files located
in client-specific directories and writes processed data into global database. The source data structure,
target data structure, and processing logic are all same. The log file for each client run has to be
preserved in a client-specific directory. The directory names have the client id as part of directory
structure (e.g., /app/data/Client_ID/)

You can complete the work for all clients using a set of mappings, sessions, and a workflow, with one
parameter file per client. However, the number of parameter files may become cumbersome to manage
when the number of clients increases.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 957 of 1017


In such cases, a parameter file template (i.e., a parameter file containing values for some parameters
and placeholders for others) may prove useful. Use a shell (or batch) script at run time to create actual
parameter file (for a specific client), replacing the placeholders with actual values, and then execute the
workflow using pmcmd.

[PROJ_DP.WF:Client_Data]

$InputFile_1=/app/data/Client_ID/input/client_info.dat

$LogFile=/app/data/Client_ID/logfile/wfl_client_data_curdate.log

Using a script, replace “Client_ID” and “curdate” to actual values before executing the workflow.

The following text is an excerpt from a parameter file that contains service variables for one Integration
Service and parameters for four workflows:

[Service:IntSvs_01]

$PMSuccessEmailUser=pcadmin@mail.com

$PMFailureEmailUser=pcadmin@mail.com

[HET_TGTS.WF:wf_TCOMMIT_INST_ALIAS]

$$platform=unix

[HET_TGTS.WF:wf_TGTS_ASC_ORDR.ST:s_TGTS_ASC_ORDR]

$$platform=unix

$DBConnection_ora=qasrvrk2_hp817

[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1]

$$DT_WL_lvl_1=02/01/2005 01:05:11

$$Double_WL_lvl_1=2.2

[ORDERS.WF:wf_PARAM_FILE.WT:WL_PARAM_Lvl_1.WT:NWL_PARAM_Lvl_2]

$$DT_WL_lvl_2=03/01/2005 01:01:01

$$Int_WL_lvl_2=3

$$String_WL_lvl_2=ccccc

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 958 of 1017


Use Case 1: Fiscal Calendar-Based Processing

Some Financial and Retail industries use Fiscal calendar for accounting purposes. Use the mapping
parameters to process the correct fiscal period.

For example, create a calendar table in the database with the mapping between the Gregorian calendar
and fiscal calendar. Create mapping parameters in the mappings for the starting and ending dates.
Create another mapping with the logic to create a parameter file. Run the parameter file creation
session before running the main session.

The calendar table can be directly joined with the main table, but the performance may not be good in
some databases depending upon how the indexes are defined. Using a parameter file can resolve the
index and result in better performance.

Use Case 2: Incremental Data Extraction

Mapping parameters and variables can be used to extract inserted/updated data since previous extract.
Use the mapping parameters or variables in the source qualifier to determine the beginning timestamp
and the end timestamp for extraction.

For example, create a user-defined mapping variable $$PREVIOUS_RUN_DATE_TIME that saves the
timestamp of the last row the Integration Service read in the previous session. Use this variable for the
beginning timestamp and the built-in variable $$$SessStartTime for the end timestamp in the source
filter.

Use the following filter to incrementally extract data from the database:

LOAN.record_update_timestamp > TO_DATE(‘$$PREVIOUS_DATE_TIME’) and

LOAN.record_update_timestamp <= TO_DATE(‘$$$SessStartTime’)

Use Case 3: Multi-Purpose Mapping

Mapping parameters can be used to extract data from different tables using a single mapping. In some
cases the table name is the only difference between extracts.

For example, there are two similar extracts from tables FUTURE_ISSUER and EQUITY_ISSUER; the
column names and data types within the tables are same. Use mapping parameter $$TABLE_NAME in
the source qualifier SQL override, create two parameter files for each table name. Run the workflow
using the pmcmd command with the corresponding parameter file, or create two sessions with
corresponding parameter file.

Use Case 4: Using Workflow Variables

You can create variables within a workflow. When you create a variable in a workflow, it is valid only in

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 959 of 1017


that workflow. Use the variable in tasks within that workflow. You can edit and delete user-defined
workflow variables.

Use user-defined variables when you need to make a workflow decision based on criteria you specify.
For example, you create a workflow to load data to an orders database nightly. You also need to load a
subset of this data to headquarters periodically, every tenth time you update the local orders database.
Create separate sessions to update the local database and the one at headquarters. Use a user-defined
variable to determine when to run the session that updates the orders database at headquarters.

To configure user-defined workflow variables, set up the workflow as follows:

Create a persistent workflow variable, $$WorkflowCount, to represent the number of times the workflow
has run. Add a Start task and both sessions to the workflow. Place a Decision task after the session that
updates the local orders database. Set up the decision condition to check to see if the number of
workflow runs is evenly divisible by 10. Use the modulus (MOD) function to do this. Create an
Assignment task to increment the $$WorkflowCount variable by one.

Link the Decision task to the session that updates the database at headquarters when the decision
condition evaluates to true. Link it to the Assignment task when the decision condition evaluates to false.

When you configure workflow variables using conditions, the session that updates the local database
runs every time the workflow runs. The session that updates the database at headquarters runs every
10th time the workflow runs.

Last updated: 09-Feb-07 16:20

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 960 of 1017


Platform Sizing

Challenge

Determining the appropriate platform size to support the PowerCenter environment


based on customer infrastructure and requirements.

Description

The main factors that affect the sizing estimate are the input parameters that are based
on the requirements and the constraints imposed by the existing infrastructure and
budget. Other important factors include choice of Grid/High Availability Option, future
growth estimates and real time versus batch load requirements.

The required platform size to support PowerCenter depends upon each customer’s
unique infrastructure and processing requirements: The Integration Service allocates
resources for individual extraction, transformation and load (ETL) jobs or sessions.
Each session has its own resource requirement.The resources required for the
Integration Service depend upon the number of sessions, the complexity of each
session (i.e., what it does while moving data) and how many sessions run concurrently.
This Best Practice discusses the relevant questions pertinent to estimating the platform
requirements.

TIP
An important concept regarding platform sizing is not to size your environment
too soon in the project lifecycle. A common mistake is to size the servers
before any ETL is designed or developed, and in many cases these platforms
are too small for the resulting system. Thus, it is better to analyze sizing
requirements after the data transformation processes have been well defined
during the design and development phases.

Environment Questions

To determine platform size, consider the following questions regarding your


environment:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 961 of 1017


● What sources do you plan to access?
● How do you currently access those sources?
● Have you decided on the target environment (e.g., database, hardware,
operating system)? If so, what is it?
● Have you decided on the PowerCenter environment (e.g., hardware, operating
system, 32/64-bit processing)?
● Is it possible for the PowerCenter services to be on the same server as the
target?
● How do you plan to access your information (e.g., cube, ad-hoc query tool)
and what tools will you use to do this?
● What other applications or services, if any, run on the PowerCenter server?

What are the latency requirements for the PowerCenter loads?


PowerCenter Sizing Questions

To determine server size, consider the following questions:

● Is the overall ETL task currently being performed? If so, how is it being done,
and how long does it take?
● What is the total volume of data to move?
● What is the largest table (i.e., bytes and rows)? Is there any key on this table
that can be used to partition load sessions, if needed?
● How often does the refresh occur?
● Will refresh be scheduled at a certain time, or driven by external events?
● Is there a "modified" timestamp on the source table rows?
● What is the batch window available for the load?
● Are you doing a load of detail data, aggregations, or both?
● If you are doing aggregations, what is the ratio of source/target rows for the
largest result set? How large is the result set (bytes and rows)?

The answers to these questions provide an approximation guide to the factors that
affect PowerCenter's resource requirements. To simplify the analysis, focus on large
jobs that drive the resource requirement.

PowerCenter Resource Consumption

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 962 of 1017


The following sections summarize some recommendations for PowerCenter resource
consumption.

Processor

1 to 1.5 CPUs per concurrent non-partitioned session or transformation job.

Note: - However Virtual CPU is considered as 0.75 CPU. For example 4 CPU with 4
cores each, could be considered as 12 Virtual CPUs.

Memory

20 to 30MB of memory for the Integration Service for session


coordination.

20 to 30MB of memory per session, if there are no aggregations,


lookups, or heterogeneous data joins. Note that 32-bit systems have
an operating system limitation of 2GB per session.

Caches for aggregation, lookups or joins use additional memory:


Lookup tables are cached in full; the memory consumed depends on


the size of the tables and selected data ports.

Aggregate caches store the individual groups; more memory is used


if there are more groups. Sorting the input to aggregations greatly
reduces the need for memory.

Joins cache the master table in a join; memory consumed depends


on the size of the master.

Full Pushdown Optimization uses much less resources on


PowerCenter server in comparison to partial (source/target)
pushdown optimization.

System Recommendations

PowerCenter has a service-oriented architecture that provides the ability to scale


services and share resources across multiple servers using the Grid Option. The Grid

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 963 of 1017


Option allows for adding capacity at a low cost while providing implicit High Availability
with the active/active Integration Service configuration. Below are the
recommendations for a single node PowerCenter server.

Minimum Server

1 Node, 4 CPUs and 16GB of memory (instead of the minimal requirement of 4GB
RAM) and 6 GB storage for PowerCenter binaries. A separate file system is
recommended for the infa_shared working file directory and it can be sized depending
on the work load profile.

Disk Space

Disk space is not a factor if the machine is used only for PowerCenter services, unless
the following conditions exist:

● Data is staged to flat files on the PowerCenter machine.


● Data is stored in incremental aggregation files for adding data to aggregates.
The space consumed is about the size of the data aggregated.
● Temporary space is needed for paging for transformations that require large
caches that cannot be entirely cached by system memory
● Sessions logs are saved by timestamp

If any of these factors is true additional storage should be allocated for the file system
used by the infa_shared directory. Typically Informatica customers allocate a minimum
of 100 to 200 GB for this file system. Informatica recommends monitoring disk
space on a regular basis or maintaining some type of script to purge unused files.

Sizing Analysis.

The basic goal is to size the server so that all jobs can complete within the specified
load window. You should consider the answers to the questions in the "Environment"
and "PowerCenter Server Sizing" sections to estimate the required number of sessions,
the volume of data that each session moves, and its lookup table, aggregation, and
heterogeneous join caching requirements. Use these estimates with the
recommendations in the "PowerCenter Resource Consumption" section to determine
the required number of processors, memory, and disk space to achieve the required
performance to meet the load window.PowerCenter provides an advanced level of auto
memory configuration with the option of using manual configuration. The minimum
required cache memory for each active transformation in a mapping can be calculated

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 964 of 1017


and accumulated for concurrent jobs.

You can use the “Cache Calculator” feature for Aggregator, Joiner, Rank, and Lookup
transformations:

Note that the deployment environment often creates performance constraints that
hardware capacity cannot overcome. The Integration Service throughput is usually
constrained by one or more of the environmental factors addressed by the questions in
the "Environment" section. For example, if the data sources and target are both remote
from the PowerCenter server, the network is often the constraining factor. At some
point, additional sessions, processors, and memory may not yield faster execution
because the network (not the PowerCenter services) imposes the performance limit.
The hardware sizing analysis is highly dependent on the environment in which the
server is deployed. You need to understand the performance characteristics of the
environment before making any sizing conclusions.

It is also vitally important to remember that other applications (in addition to


PowerCenter) are likely to use the platform. PowerCenter often runs on a server with a
database engine and query/analysis tools. In fact, in an environment where
PowerCenter, the target database, and query/analysis tools all run on the same server,
the query/analysis tool often drives the hardware requirements. However, if the loading
is performed after business hours, the query/analysis tools requirements may not be a
sizing limitation.

Last updated: 27-May-08 14:44

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 965 of 1017


PowerExchange for Oracle CDC

Challenge

Configure the Oracle environment for optimal performance when using


PowerExchange Change Data Capture (CDC) in a production environment.

Description

There are two performance types that need to be considered when dealing with Oracle
CDC; latency of the data and restartability of the environment. Some of the factors that
impact these performance types are configurable within PowerExchange, while others
are not. These two performance types are addressed separately in this Best Practice.

Data Latency Performance

The objective of latency performance is to minimize the amount of time that it takes for
a change made to the source database to appear in the target database. Some of the
factors that can affect latency performance are discussed below.

Location of PowerExchange CDC

The optimal location for installing PowerExchange CDC is on the server that contains
the Oracle source database. This eliminates the need to use the network to pass data
between Oracle’s LogMiner and PowerExchange. It also eliminates the need to use
SQL*Net for this process and it minimizes the amount of data being moved across the
network. For best results, install the PowerExchange Listener on the same server as
the source database server.

Volume of Data

The volume of data that the Oracle Log Miner has to process in order to provide
changed data to PowerExchange can have a significant impact on performance. Bear
in mind that in addition to the changed data rows, other processes may be writing large
volumes of data to the Oracle redo logs. These include, but are not limited to:

● Oracle catalog dumps

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 966 of 1017


● Oracle workload monitor customizations
● Other (non-Oracle) tools that use the redo logs to provide proprietary
information

In order to optimize PowerExchange’s CDC performance, the amount of data these


processes write to the Oracle redo logs needs to be minimized (both in terms of volume
and frequency). This includes minimizing the invocations of the LogMiner to just a
single occurrence. Review the processes that are actively writing data to the Oracle
redo logs and tune them within the context of a production environment. Monitoring the
redo log switches and the creation of archived log files is one way to determine how
busy the source database is. The size of the archived log files and how often they are
being created over a day will give a good idea about performance implications.

Server Workload

Optimize the performance of the Oracle database server by reducing the number of
unnecessary tasks it is performing concurrently with the PowerExchange CDC
components. This may include a full review of the backup and restore schedules,
Oracle import and export processing and other application software utilized within the
production server environment.

PowerCenter also contributes to the workload on the server where PowerExchange


CDC is running; so it is important to optimize these workload tasks. This can be
accomplished through mapping design. If possible, include all of the processing of
PowerExchange CDC sources within the same mapping. This will minimize the number
of tasks generated and will ensure that all of the required data from either the Oracle
archive log (i.e., near real time) or the CDC files (i.e., CAPXRT, condense) process
within a single pass of the logs or CDC files.

Condense Option Considerations

The condense option for Oracle CDC provides only the required data by reducing the
collected data based on the Unit of Work information. This can prevent the transfer of
unnecessary data and save CPU and memory resources. In order to properly allocate
space for the files created by the condense process it is necessary to perform capacity
planning.

In determining the space required for the CDC data files it is important to know whether
before and after images (or just after images) are required. Also, the retention period
for these files must be considered. The retention period is defined in the
COND_CDCT_RET_P parameter in the dtlca.cfg file. The value that appears for this

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 967 of 1017


parameter specifies the retention period in days. The general algorithms for calculating
this space are outlined below.

After Image Only –

Estimated condense file disk space for Table A =


((The width of Table A in bytes * Estimated number of data changes for Table A
per
24 hour period) + 700 bytes for the six fields added to each CDC record) *
the value on theCOND_CDCT_RET_P parameter

Before/After Image –

Estimated condense file disk space for Table A =


((The width of Table A in bytes * Estimated number of data changes for Table A
per
24 hour period) * 2) + 700 bytes for the six fields added to each CDC record) *
the value on theCOND_CDCT_RET_P parameter

Accurate capacity planning can be accomplished by running sample condense jobs for
a given number of source changes to determine the storage required. The size of files
created by the condense process can be used for projecting the actual storage required
in a production environment.

Continuous Capture Extract Option Considerations

When Continuous Capture Extract is used for Oracle CDC, condense files can be
consumed with CAPXRT processing. Since the PowerCenter session waits for the
creation of new condense files (rather than stopping and restarting) the CPU and
memory impact of real-time processing is reduced. Similar to the Condense option,
there is a need to perform proper capacity planning for the files created as a result of
using the Continuous Capture Extract option.

PowerExchange CDC Restart Performance

The amount of time required to restart the PowerExchange CDC process should be
considered when determining performance. The PowerExchange CDC process will
need to be restarted whenever any of the following events occur:

● A schema change is made to a table.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 968 of 1017


● An existing Change Registration is amended.
● The PowerExchange service pack is applied or a configuration file is changed.
● An Oracle patch or bug fix is applied.
● An Operating System patch or upgrade is applied.

A copy of the Oracle catalog must be placed on the archive log in order for LogMiner to
function correctly. The frequency of these copies is very site specific and it can impact
the amount of time that it takes the CDC process to restart.

There are several parameters that appear in the dbmover.cfg configuration file that can
assist in optimizing restart performance. These parameters are:

RSTRADV: The RSTRADV parameter specifies the number of seconds to wait after
receiving a Unit of Work (UOW) for a source table before advancing the restart tokens
by returning an “empty” UOW. This parameter is very beneficial in cases where the
frequency of updates on some tables is low in comparison to other tables.

CATINT: The CATINT parameter specifies the frequency in which the Oracle catalog is
copied to the archive logs. Since LogMiner needs a copy of the catalog on the archive
log to become operational, this is an important parameter as it will have an impact on
which archive log is used to restart the CDC process. When Oracle places a catalog
copy on the archive log, it will first flush all of the online redo logs to the archive logs
prior to writing out the catalog.

CATBEGIN: The CATBEGIN parameter specifies the time of day that the Oracle
catalog copy process should begin. The time of day that is specified in this parameter is
based on a 24 hour clock.

CATEND: The CATEND parameter specifies the time of day that the Oracle catalog
copy process should end. The time of day that is specified in this parameter is based
on a 24 hour clock.

It is important to carefully code these parameters as it will impact the amount of time it
takes to restart the PowerExchange CDC process.

Sample of the dbmover.cfg parameters that affect the Oracle CDC process.

/********************************************************************/
/* Change Data Capture Connection Specifications
/********************************************************************/

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 969 of 1017


/*
CAPT_PATH=/mountpoint/infapwx/v851/chgreg
CAPT_XTRA=/mountpoint/infapwx/v851/chgreg/camaps
/*
CAPI_SRC_DFLT=(ORA,CAPIUOWC)
CAPI_SRC_DFLT= (CAPX,CAPICAPX)
/*
/********************************************************************/
/* Oracle Change Data Capture Parameters
/********************************************************************/
/* see Oracle Adapter Guide
/* Chapter 3 – Preparing for Oracle CDC
/* see Reference Guide
/* Chapter 9 - Configuration File Parameters
/* see Readme_ORACAPT.txt
/********************************************************************/
/*
/********************************************************************/
/*************** Oracle - Change Data Capture **************/
/********************************************************************/
ORACLEID=(ORACAPT,oracle_sid,connect_string,capture_connect_string)
CAPI_CONNECTION=(NAME=CAPIUOWC,TYPE=(UOWC,CAPINAME=CAPIORA,
RSTRADV=60))
CAPI_CONNECTION=(NAME=CAPIORA,DLLTRACE=ABC,TYPE=(ORCL,
CATINT=30,
CATBEGIN=00:01,CATEND=23:59, COMMITINT=5,
REPNODE=local,BYPASSUF=Y,ORACOLL=ORACAPT))
/*
/****************** Oracle - Continuous CAPX ***************/
/*
/*CAPI_CONNECTION=(NAME=CAPICAPX,TYPE=(CAPX,DFLTINST=ORACAPT))
/*
Sample of the dtlca.cfg parameters that control the Oracle CDC condense
process.
/********************************************************************/
/* PowerExchange Condense Configuration File
/* See Oracle Adapter Guide
/* Chapter 3 – Preparing for Oracle CDC
/* Chapter 6 – Condensing Changed Data
/********************************************************************/
/* The value for the DBID parameter must match the Collection-ID
/* contained in the ORACLE-ID statement in the dbmover.cfg file.
/********************************************************************/
/*
DBID=ORACAPT

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 970 of 1017


DB_TYPE=ORA
/*
EXT_CAPT_MASK=/mountpoint/infapwx/v851/condense/condense
CHKPT_BASENAME=/mountpoint/infapwx/v851/condense/condense.CHKPT
CHKPT_NUM=10
COND_CDCT_RET_P=5
/*
/********************************************************************/
/* COLL_END_LOG equal to 1 means BATCH MODE
/* COLL_END_LOG equal to 0 means CONTINUOUS MODE
/********************************************************************/
/*
COLL_END_LOG=0
NO_DATA_WAIT=2
NO_DATA_WAIT2=60
/*
/********************************************************************/
/* FILE_SWITCH_CRIT of M means minutes
/* FILE_SWITCH_CRIT of R means records
/********************************************************************/
/*
FILE_SWITCH_CRIT=M
FILE_SWITCH_VAL=15
/*
/********************************************************************/
/* CAPT_IMAGE of AI means AFTER IMAGE
/* CAPT_IMAGE of BA means BEFORE and AFTER IMAGE
/********************************************************************/
/*
CAPT_IMAGE=AI
/*
UID=Database User Id
PWD=Database User Id Password
/*
SIGNALLING=Y
/*
/********************************************************************/
/* The following parameters are only used during a cold start and forces
/* the cold start to use the most recent catalog copy. Without these parameter,
/* if the v_$transaction and v_$archive_log views are out of sync the latest
/* there is a very good chance that the most recent catalog copy will not
/* be used for the cold start.
/********************************************************************/
/*
SEQUENCE_TOKEN=0

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 971 of 1017


RESTART_TOKEN=0

Last updated: 27-May-08 15:07

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 972 of 1017


PowerExchange for SQL Server CDC

Challenge

Install, configure, and performance tune PowerExchange for MS SQL Server Change Data Capture (CDC).

Description

PowerExchange Real-Time for MS SQL Server uses SQL Server publication technology to capture changed
data. To use this feature Distribution must be enabled. The publisher database handles replication while the
distributor database transfers the replicated data to PowerExchange; which is installed on the distribution
database server.

The following figure depicts a typical high-level architecture:

When looking at the architecture for SQL Server capture, we see that PowerExchange treats the SQL Server
Publication process as a “virtual” change stream. By turning the standard SQL Server publication process on,
SQL Server publishes changes to the SQL Server Distribution database. PowerExchange then reads the
changes from the Distribution database.

When Publication is used, and the Distribution function is enabled, support for capturing changes for a table of
interest are dynamically activated through the registration of a source in the PowerExchange Navigator GUI (i.e.,
PowerExchange makes the appropriate calls to SQL Server automatically, via SQL DMO objects).

Key Setup Steps

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 973 of 1017


The key steps involved in setting up the change capture process are:

1. Modify the PowerExchange dbmover.cfg file on the server.

Example statements that must be added:

CAPI_CONN_NAME=CAPIMSSC
CAPI_CONNECTION=(NAME=CAPIMSSC,
TYPE=(MSQL,DISTSRV=SDMS052,DISTDB=distribution,repnode=SDMS052))

2. Configure MS SQLServer replication.

Microsoft SQL Server Replication must be enabled using the Microsoft SQL Server Publication
technology. Informatica recommends enabling distribution through the SQL Server Management Console.

Multiple SQL Servers can use a single Distribution database. However, Informatica recommends using a
single Distribution database for Production and a separate one for Development/Test. In addition, for a
busy environment, placing the Distribution database on a separate server is advisable. Also, configure
the Distribution database for a retention period of 10 to 14 days.

3.
Ensure that the MS SQL Server Agent Service is running.
4.
Register sources using the PowerExchange Navigator.

Source tables must have a primary key. Note that system admin authority is required to register source
tables.

Performance Tuning Tips

If you plan to capture large numbers of transaction updates, consider using a dedicated distributed server as the
host of the distribution database. This will avoid contention for CPU and disk storage with a production instance.

Sometimes SQL Server CDC performance is slow. It requires approximately ten seconds for changes made at
the source to take effect at the target. This is specifically when data is coming in low volumes.

You can alter the following parameters to enhance this performance:

● POLWAIT
● PollingInterval

POLWAIT

This parameter specifies the number of seconds to wait between polling for new data after end of current data
has been achieved.

● Specify this parameter in the dbmover.cfg file of the Microsoft SQL Distribution database machine.
● The default is ten seconds. Reducing this value to one or two seconds can improve the performance.

PollingInterval

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 974 of 1017


You can also decrease the polling interval parameter of the log reader agent in Microsoft SQL Server. Reducing
this to a lower value reduces the delay in polling for new records.

● Modify this parameter using the SQL Server Enterprise Manager.


● The default value for this parameter is 10 seconds.

Be aware, however, that the trade-off with the above options is, to some extent, increased overhead and
frequency of access to the source distribution database. To minimize overhead and frequency of access to the
database, increase the delay between the time an update is performed and the time it is extracted.

Increasing the value of POLWAIT in the dbmover cfg file reduces the frequency with which the source distribution
database is accessed. In addition, increasing the value of Real-Time Flush Latency in the PowerCenter
Application Connection can also reduce the frequency of access to the source.

Last updated: 27-May-08 12:39

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 975 of 1017


PowerExchange Installation (for
Mainframe)

Challenge

Installing and configuring a PowerExchange listener on a mainframe, ensuring that the


process is both efficient and effective.

Description

PowerExchange installation is very straight-forward and can generally be accomplished


in a timely fashion. When considering a PowerExchange installation, be sure that the
appropriate resources are available. These include, but are not limited to:

● MVS systems operator


● Appropriate database administrator; this depends on what (if any) databases
are going to be sources/and or targets (e.g., IMS, IDMS, etc.).
● MVS Security resources

Be sure to adhere to the sequence of the following steps to successfully install


PowerExchange. Note that in this very typical scenario, the mainframe source data is
going to be “pulled” across to a server box.

1. Complete the PowerExchange pre-install checklist and obtain valid license keys.
2. Install PowerExchange on the mainframe.
3. Start the PowerExchange jobs/tasks on the mainframe.
4. Install the PowerExchange client (Navigator) on a workstation.
5. Test connectivity to the mainframe from the workstation.
6. Install Navigator on the UNIX/NT server.
7. Test connectivity to the mainframe from the server.

Complete the PowerExchange Pre-install Checklist and Obtain Valid


License Keys

Reviewing the environment and recording the information in a detailed checklist


facilitates the PowerExchange install. The checklist (which is a prerequisite) is installed
in the Documentation Folder when the PowerExchange software is installed. It is also

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 976 of 1017


available within the client from the PowerExchange Program Group. Be sure to
complete all relevant sections.

You will need a valid license key in order to run any of the PowerExchange
components. This is a 44 or 64-byte key that uses hyphens every 4 bytes. For example:

1234-ABCD-1234-EF01-5678-A9B2-E1E2-E3E4-A5F1

The key is not case-sensitive and uses hexadecimal digits and letters (0-9 and A-F).
Keys are valid for a specific time period and are also linked to an exact or generic TCP/
IP address. They also control access to certain databases. You cannot successfully
install PowerExchange without a valid key for all required components.

Note: When copying software from one machine to another, you may encounter
license key problems since the license key is IP specific. Be prepared to deal with this
eventuality, especially if you are going to a backup site for disaster recovery testing. In
the case of such an event Informatica Product Shipping or Support can generate a
temporary key very quickly.

Install PowerExchange on the Mainframe

Step 1: Create a folder c:\PWX on the workstation. Copy the file with a naming
convention similar to PWXOS26.Vxxx.EXE from the PowerExchange CD or from
the extract of the zip file downloaded to this directory. Double click the file to
unzip its contents into this directory.

Step 2: Create the PDS “HLQ.PWXVxxx.RUNLIB” and “HLQ.PWXVxxx.


BINLIB” with fixed blocks and a length of 80 attributes on the mainframe in
order to pre-allocate the needed libraries.. Ensure sufficient space for the
required jobs/tasks by setting the cylinders to 150 and directory blocks of 50.

Step 3: Run the “MVS_Install” file. This displays the MVS Install Assistant.
Configure the IP Address, Logon ID, Password, HLQ, and Default volume
setting on the display screen. Also, enter the license key.

Click the Custom buttons to configure the desired data sources.

Be sure that the HLQ on this screen matches the HLQ of the allocated
RUNLIB (from step 2).

Save these settings and click Process. This creates the JCL libraries

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 977 of 1017


and opens the following screen to FTP these libraries to MVS. Click
XMIT to complete the FTP process.

Note: A new installer GUI was added as of PowerExchange 8.5.


Simply follow the installation screens in the GUI for this step.

Step 4: Edit JOBCARD in RUNLIB and configure as per the environment (e.g.,
execution class, message class, etc.)

Step 5: Edit the SETUPBLK member in RUNLIB. Copy in the JOBCARD and
SUBMIT. This process can submit from 5 to 24 jobs. All jobs should end with
return code 0 (success) or 1, and a list of the needed installation jobs can be
found in the XJOBS member.

Start The PowerExchange Jobs/Tasks on the Mainframe

The installed PowerExchange Listener can be run as a normal batch job or as a started
task. Informatica recommends that it initially be submitted as a batch job: RUNLIB
(STARTLST). If it will be run as a started task then copy the PSTRTLST member in
runlib to the started task proclib.

It should return: DTL-00607 Listener VRM x.x.x Build Vxxx_P0x started.

If implementing change capture, start the PowerExchange Agent (as a started task):

/S DTLA

It should return: DTLEDMI1722561: EDM Agent DTLA has completed initialization.

Note: The load libraries must be APF authorized prior to starting the Agent.

Install The PowerExchange Client (Navigator) on a Workstation

Step 1: Run the Windows or UNIX installation file in the software folder on the
installation CD and follow the prompts.

Step 2: Enter the license key.

Step 3: Follow the wizard to complete the install and reboot the machine.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 978 of 1017


Step 4: Add a node entry to the configuration file “\Program Files\Informatica
\Informatica Power Exchange\dbmover.cfg” to point to the Listener on the
mainframe.

node = (mainframe location name, TCPIP, mainframe IP address, 2480)

Test Connectivity to the Mainframe from the Workstation

Ensure communication to the PowerExchange Listener on the mainframe by entering


the following in DOS on the workstation:

DTLREXE PROG=PING LOC=mainframe location or nodename in dbmover.cfg

It should return: DLT-00755 DTLREXE Command OK!

Install PowerExchange on the UNIX Server

Step 1: Create a user for the PowerExchange installation on the UNIX box.

Step 2: Create a UNIX directory “/opt/inform/pwxvxxxp0x”.

Step 3: FTP the file “\software\Unix\dtlxxx_vxxx.tar” on the installation CD to


the pwx installation directory on UNIX.

Step 4: Use the UNIX tar command to extract the files. The command is “tar –
xvf pwxxxx_vxxx.tar”.

Step 5: Update the logon profile with the correct path, library path, and
home environment variables.

Step 6: Update the license key file on the server.

Step 7: Update the configuration file on the server (dbmover.cfg) by adding a


node entry to point to the Listener on the mainframe.

Step 8: If using an ETL tool in conjunction with PowerExchange, via ODBC,


update the odbc.ini file on the server by adding data source entries that point to
PowerExchange-accessed data:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 979 of 1017


[pwx_mvs_db2]

DRIVER=<install dir>/libdtlodbc.so

DESCRIPTION=MVS DB2

DBTYPE=db2

LOCATION=mvs1

DBQUAL1=DB2T

Test Connectivity to the Mainframe from the Server

Ensure communication to the PowerExchange Listener on the mainframe by entering


the following on the UNIX server:

DTLREXE PROG=PING LOC=mainframe location

It should return: DLT-00755 DTLREXE Command OK!

Changed Data Capture

There is a separate manual for each type of change data capture option. This
manual contains the specifics on the following general steps. You will need
to understand the appropiate options guide to ensure success.

Step 1: APF authorize the .LOAD and the .LOADLIB libraries. This is required
for external security.

Step 2: Copy the Agent from the PowerExchange PROCLIB to the system site
PROCLIB.

Step 3: After the Agent has been started, run job SETUP2.

Step 4: Create an active registration for a table/segment/record in Navigator


that is setup for changes.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 980 of 1017


Step 5: Start the ECCR.

Step 6: Issue a change to the table/segment/record that you registered in


Navigator.

Step 7: Perform an extraction map row test in Navigator

Last updated: 10-Jun-08 15:40

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 981 of 1017


Assessing the Business Case

Challenge

Assessing the business case for a project must consider both the tangible and
intangible potential benefits. The assessment should also validate the benefits and
ensure they are realistic to the Project Sponsor and Key Stakeholders to
ensure project funding.

Description

A Business Case should include both qualitative and quantitative measures of potential
benefits.

The Qualitative Assessment portion of the Business Case is based on the Statement
of Problem/Need and the Statement of Project Goals and Objectives (both generated in
Subtask 1.1.1 Establish Business Project Scope ) and focuses on discussions with the
project beneficiaries regarding the expected benefits in terms of problem alleviation,
cost savings or controls, and increased efficiencies and opportunities.

Many qualitative items are intangible, but you may be able to cite examples of the
potential costs or risks if the system is not implemented. An example may be the cost
of bad data quality resulting in the loss of a key customer or an invalid analysis
resulting in bad business decisions. Risk factors may be classified as business,
technical, or execution in nature. Examples of these risks are uncertainty of value or
the unreliability of collected information, new technology employed, or a major change
in business thinking for personnel executing change.

It is important to identify an estimated value added or cost eliminated to strengthen the


business case. The better definition of the factors, the better the value to the business
case.

The Quantitative Assessment portion of the Business Case provides specific


measurable details of the proposed project, such as the estimated ROI. This may
involve the following calculations:

● Cash flow analysis- Projects positive and negative cash flows for the
anticipated life of the project. Typically, ROI measurements use the cash flow
formula to depict results.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 982 of 1017


● Net present value - Evaluates cash flow according to the long-term value of
current investment. Net present value shows how much capital needs to be
invested currently, at an assumed interest rate, in order to create a stream of
payments over time. For instance, to generate an income stream of $500 per
month over six months at an interest rate of eight percent would require an
investment (i.e., a net present value) of $2,311.44.
● Return on investment - Calculates net present value of total incremental cost
savings and revenue divided by the net present value of total costs multiplied
by 100. This type of ROI calculation is frequently referred to as return-on-
equity or return-on-capital.
● Payback Period - Determines how much time must pass before an initial
capital investment is recovered.

The following are steps to calculate the quantitative business case or ROI:

Step 1 – Develop Enterprise Deployment Map. This is a model of the project phases
over a timeline, estimating as specifically as possible participants, requirements, and
systems involved. A data integration or migration initiative or amendment may require
estimating customer participation (e.g., by department and location), subject area and
type of information/analysis, numbers of users, numbers and complexity of target data
systems (data marts or operational databases, for example) and data sources, types of
sources, and size of data set. A data migration project may require customer
participation, legacy system migrations, and retirement procedures. The types of
estimations vary by project types and goals. It is important to note that the more details
you have for estimations, the more precise your phased solutions are likely to be. The
scope of the project should also be made known in the deployment map.

Step 2 – Analyze Potential Benefits. Discussions with representative managers and


users or the Project Sponsor should reveal the tangible and intangible benefits of the
project. The most effective format for presenting this analysis is often a "before" and
"after" format that compares the current situation to the project expectations, Include in
this step, costs that can be avoided by the deployment of this project.

Step 3 – Calculate Net Present Value for all Benefits. Information gathered in this
step should help the customer representatives to understand how the expected
benefits are going to be allocated throughout the organization over time, using the
enterprise deployment map as a guide.

Step 4 – Define Overall Costs. Customers need specific cost information in order to
assess the dollar impact of the project. Cost estimates should address the following
fundamental cost components:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 983 of 1017


● Hardware
● Networks
● RDBMS software
● Back-end tools
● Query/reporting tools
● Internal labor
● External labor
● Ongoing support
● Training

Step 5 – Calculate Net Present Value for all Costs. Use either actual cost estimates
or percentage-of-cost values (based on cost allocation assumptions) to calculate costs
for each cost component, projected over the timeline of the enterprise deployment map.
Actual cost estimates are more accurate than percentage-of-cost allocations, but much
more time-consuming. The percentage-of-cost allocation process may be valuable for
initial ROI snapshots until costs can be more clearly predicted.

Step 6 – Assess Risk, Adjust Costs and Benefits Accordingly. Review potential
risks to the project and make corresponding adjustments to the costs and/or benefits.
Some of the major risks to consider are:

● Scope creep, which can be mitigated by thorough planning and tight project
scope.
● Integration complexity, which may be reduced by standardizing on vendors
with integrated product sets or open architectures.
● Architectural strategy that is inappropriate.
● Current support infrastructure may not meet the needs of the project.
● Conflicting priorities may impact resource availability.
● Other miscellaneous risks from management or end users who may withhold
project support; from the entanglements of internal politics; and from
technologies that don't function as promised.
● Unexpected data quality, complexity, or definition issues often are discovered
late in the course of the project and can adversely affect effort, cost, and
schedule. This can be somewhat mitigated by early source analysis.

Step 7 – Determine Overall ROI. When all other portions of the business case are
complete, calculate the project's "bottom line". Determining the overall ROI is simply a
matter of subtracting net present value of total costs from net present value of (total

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 984 of 1017


incremental revenue plus cost savings).

Final Deliverable

The final deliverable of this phase of development is a complete business case that
documents both tangible (quantified) and in-tangible (non-quantified, but estimate of
benefits and risks) to be presented to the Project Sponsor and Key Stakeholders. This
allows them to review the Business Case in order to justify the development effort.

If your organization has the concept of a Project Office which provides the governance
for project and priorities, many times this is part of the original Project Charter which
states items like scope, initial high level requirements, and key project stakeholders.
However, developing a full Business Case can validate any initial analysis and provide
additional justification. Additionally, the Project Office should provide guidance in
building and communicating the Business Case.

Once completed, the Project Manager is responsible for scheduling the review and
socialization of the Business Case.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 985 of 1017


Defining and Prioritizing Requirements

Challenge

Defining and prioritizing business and functional requirements is often accomplished


through a combination of interviews and facilitated meetings (i.e., workshops) between
the Project Sponsor and beneficiaries and the Project Manager and Business Analyst.

Requirements need to be gathered from business users who currently use and/or have
the potential to use the information being assessed. All input is important since the
assessment should encompass an enterprise view of the data rather than a limited
functional, departmental, or line-of-business view.

Types of specific detailed data requirements gathered include:

● Data names to be assessed


● Data definitions
● Data formats and physical attributes
● Required business rules including allowed values
● Data usage
● Expected quality levels

By gathering and documenting some of the key detailed data requirements, a solid
understanding the business rules involved is reached. Certainly, all elements can’t be
analyzed in detail, but it helps in getting to the heart of the business system so you are
better prepared when speaking with business and technical users.

Description

The following steps are key for successfully defining and prioritizing requirements:

Step 1: Discovery

Gathering business requirements is one of the most important stages of any data
integration project. Business requirements affect virtually every aspect of the data
integration project starting from Project Planning and Management to End-User

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 986 of 1017


Application Specification. They are like a hub that sits in the middle and touches the
various stages (spokes) of the data integration project. There are two basic techniques
for gathering requirements and investigating the underlying operational data: interviews
and facilitated sessions.

Data Profiling

Informatica Data Explorer (IDE) is an automated data profiling and analysis software
product that can be extremely beneficial in defining and prioritizing requirements. It
provides a detailed description of data content, structure, rules, and quality by profiling
the actual data that is loaded into the product.

Some industry examples of why data profiling is crucial prior to beginning the
development process are:

Cost of poor data quality is 15 to 25 percent of operating profit.


Poor data management is costing global business $1.4 billion a year.


37 percent of projects are cancelled; 50 percent are completed but


with 20 percent overruns, leaving only 13 percent completed on time
and within budget.

Using a Data Profiling Tool can lower the risk and lower the cost of
the project and increase the chances of success.

Data Profiling reports can be posted to a central presence where all


team members can review results and track accuracy.

IDE provides the ability to promote collaboration through tags, notes, action items,
transformations and rules. By profiling the information, the framework is set to have an
effective interview process with business and technical users.

Interviews

By conducting interview research before starting the requirements gathering process,


interviewees can be categorized into functional business management and Information
Technology (IT) management. This, in conjunction with effective data profiling, helps
to establish a comprehensive set of business requirements.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 987 of 1017


Business Interviewees. Depending on the needs of the project, even though you may
be focused on a single primary business area, it is always beneficial to interview
horizontally to achieve a good cross-functional perspective of the enterprise. This also
provides insight into how extensible your project is across the enterprise.

Before you interview, be sure to develop an interview questionnaire based upon


profiling results, as well as business questions; schedule the interview time and place;
and prepare the interviewees by sending a sample agenda. When interviewing
business people, it is always important to start with the upper echelons of management
so as to understand the overall vision, assuming you have the business background,
confidence and credibility to converse at those levels.

If not adequately prepared, the safer approach is to interview middle management. If


you are interviewing across multiple teams, you might want to scramble interviews
among teams. This way if you hear different perspectives from finance and marketing,
you can resolve the discrepancies with a scrambled interview schedule. A note to keep
in mind is that business is sponsoring the data integration project and is going to be the
end-users of the application. They will decide the success criteria of your data
integration project and determine future sponsorship. Questioning during these
sessions should include the following:

● Who are the stakeholders for this milestone delivery (IT, field business
analysts, executive management)?
● What are the target business functions, roles, and responsibilities?
● What are the key relevant business strategies, decisions, and processes (in
brief)?
● What information is important to drive, support, and measure success for
those strategies/processes? What key metrics? What dimensions for those
metrics?
● What current reporting and analysis is applicable? Who provides it? How is it
presented? How is it used? How can it be improved?

IT interviewees. The IT interviewees have a different flavor than the business user
community. Interviewing the IT team is generally very beneficial because it is
composed of data gurus who deal with the data on a daily basis. They can provide
great insight into data quality issues, help in systematic exploration of legacy source
systems, and understanding business user needs around critical reports. If you are
developing a prototype, they can help get things done quickly and address important
business reports. Questioning during these sessions should include the following:

● Request an overview of existing legacy source systems. How does data

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 988 of 1017


current flow from these systems to the users?
● What day-to-day maintenance issues does the operations team encounter with
these systems?
● Ask for their insight into data quality issues.
● What business users do they support? What reports are generated on a daily,
weekly, or monthly basis? What are the current service level agreements for
these reports?
● How can the DI project support the IS department needs?
● Review data profiling reports and analyze the anomalies in the data. Note and
record each of the comments from the more detailed analysis. What are the
key business rules involved in each item?

Facilitated Sessions

Facilitated sessions - known sometimes as JAD (Joint Application Development) or


RAD (Rapid Application Development) - are ways to work as a group of business and
technical users to capture the requirements. This can be very valuable in gathering
comprehensive requirements and building the project team. The difficulty is the amount
of preparation and planning required to make the session a pleasant, and
worthwhile experience.

Facilitated sessions provide quick feedback by gathering all the people from the various
teams into a meeting and initiating the requirements process. You need a facilitator
who is experienced in these meetings to ensure that all the participants get a chance to
speak and provide feedback. During individual (or small group) interviews with high-
level management, there is often focus and clarity of vision that may be hindered in
large meetings. Thus, it is extremely important to encourage all attendees to
participate and minimize a small number from dominating the requirement process.

A challenge of facilitated sessions is matching everyone’s busy schedules and actually


getting them into a meeting room. However, this part of the process must be focused
and brief or it can become unwieldy with too much time expended just trying to
coordinate calendars among worthy forum participants. Set a time period and target list
of participants with the Project Sponsor, but avoid lengthening the process if some
participants aren't available. Questions asked during facilitated sessions are similar to
the questions asked to business and IS interviewees.

Step 2: Validation and Prioritization

The Business Analyst, with the help of the Project Architect, documents the findings of

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 989 of 1017


the discovery process after interviewing the business and IT management. The next
step is to define the business requirements specification. The resulting Business
Requirements Specification includes a matrix linking the specific business requirements
to their functional requirements.

Defining the business requirements is a time consuming process and should be


facilitated by forming a working group team. A working group team usually consists of
business users, business analysts, project manager, and other individuals who can
help to define the business requirements. The working group should meet weekly to
define and finalize business requirements. The working group helps to:

● Design the current state and future state


● Identify supply format and transport mechanism
● Identify required message types
● Develop Service Level Agreement(s), including timings
● Identify supply management and control requirements
● Identify common verifications, validations, business validations and
transformation rules
● Identify common reference data requirements
● Identify common exceptions
● Produce the physical message specification

At this time also, the Architect develops the Information Requirements Specification to
clearly represent the structure of the information requirements. This document, based
on the business requirements findings, can facilitate discussion of informational details
and provide the starting point for the target model definition.

The detailed business requirements and information requirements should be reviewed


with the project beneficiaries and prioritized based on business need and the stated
project objectives and scope.

Step 3: The Incremental Roadmap

Concurrent with the validation of the business requirements, the Architect begins the
Functional Requirements Specification providing details on the technical requirements
for the project.

As general technical feasibility is compared to the prioritization from Step 2, the Project

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 990 of 1017


Manager, Business Analyst, and Architect develop consensus on a project "phasing"
approach. Items of secondary priority and those with poor near-term feasibility are
relegated to subsequent phases of the project. Thus, they develop a phased, or
incremental, "roadmap" for the project (Project Roadmap).

Final Deliverable

The final deliverable of this phase of development is a complete list of business


requirement, a diagram of current and future state, and a list of high-level business
rules affected by the requirements that will effect the change from current to future.
This provides the development team with much of the information in order to begin the
design effort of the system modifications. Once completed, the Project Manager is
responsible for scheduling the review and socialization of the requirements and plan to
achieve sign-off on the deliverable.

This is presented to the Project Sponsor for approval and becomes the first "increment"
or starting point for the Project Plan.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 991 of 1017


Developing a Work Breakdown Structure (WBS)

Challenge

Developing a comprehensive work breakdown structure (WBS) is crucial for capturing all the tasks required
for a data integration project. Many times, items such as full analysis, testing, or even specification
development, can create a sense of false optimism for the project. The WBS clearly depicts all of the
various tasks and subtasks required to complete a project. Most project time and resource estimates are
supported by the WBS. A thorough, accurate WBS is critical for effective monitoring and also
facilitates communication with project sponsors and key stakeholders.

Description

The WBS is a deliverable-oriented hierarchical tree that allows large tasks to be visualized as a group of
related smaller, more manageable subtasks. These tasks and subtasks can then be assigned to various
resources, which helps to identify accountability and is invaluable for tracking progress. The WBS serves
as a starting point as well as a monitoring tool for the project.

One challenge in developing a thorough WBS is obtaining the correct balance between sufficient detail, and
too much detail. The WBS shouldn’t include every minor detail in the project, but it does need to break the
tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at
least a day. It is also important to maintain consistency across project for the level of detail.

A well designed WBS can be extracted at a higher level to communicate overall project progress, as shown
in the following sample. The actual WBS for the project manager may, for example, may be a level of detail
deeper than the overall project WBS to ensure that all steps are completed, but the communication can roll
up a level or two to make things more clear.

% Budget Actual
Plan Complete Hours Hours

Architecture - Set up of Informatica Environment 82% 167 137

Develop analytic solution architecture 46% 28 13

Design development architecture 59% 32 19

Customize and implement Iterative Framework

Data Profiling 100% 32 32

Legacy Stage 150% 10 15

Pre-Load Stage 150% 10 15

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 992 of 1017


Reference Data 128% 18 23

Reusable Objects 56% 27 15

Review and signoff of Architecture 50% 10 5

Analysis - Target-to-Source Data Mapping 48% 1000 479

Customer (9 tables) 87% 135 117

Product (7 tables) 98% 215 210

Inventory (3 tables) 0% 60 0

Shipping (3 tables) 0% 60 0

Invoicing (7 tables) 0% 140 0

Orders (13 tables) 37% 380 140

Review and signoff of Functional Specification 0% 10 0

Total Architecture and Analysis 52% 1167 602

A fundamental question is to whether to include “activities” as part of a WBS. The following statements are
generally true for most projects, most of the time, and therefore are appropriate as the basis for resolving
this question.

● The project manager should have the right to decompose the WBS to whatever level of detail he or
she requires to effectively plan and manage the project. The WBS is a project management tool
that can be used in different ways, depending upon the needs of the project manager.

The lowest level of the WBS can be activities.


The hierarchical structure should be organized by deliverables and milestones with


process steps detailed within it. The WBS can be structured from a process or life cycle
basis (i.e., the accepted concept of Phases), with non-deliverables detailed within it.

At the lowest level in the WBS, an individual should be identified and held accountable
for the result. This person should be an individual contributor, creating the deliverable
personally, or a manager who will in turn create a set of tasks to plan and manage the
results.

The WBS is not necessarily a sequential document. Tasks in the hierarchy are often

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 993 of 1017


completed in parallel. At part, the goal is to list every task that must be completed; it is
not necessary to determine the critical path for completing these tasks.
❍ For example, multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3).

Subtasks 4.3.1 through 4.3.4 may have sequential requirements that forces them to be
completed in order while subtasks 4.3.5 through 4.3.7 can - and should - be completed in
parallel if they do not have sequential requirements.
❍ It is important to remember that a task is not complete until all of its corresponding subtasks
are completed - whether sequentially or in parallel. For example, the Build Phase is not
complete until tasks 4.1 through 4.7 are complete, but some work can (and should) begin
for the Deploy Phase long before the Build Phase is complete.

The Project Plan provides a starting point for further development of the project WBS. This sample is a
Microsoft Project file that has been "pre-loaded" with the phases, tasks, and subtasks that make up the
Informatica methodology. The Project Manager can use this WBS as a starting point, but should review it to
ensure that it corresponds to the specific development effort, removing any steps that aren’t relevant or
adding steps as necessary. Many projects require the addition of detailed steps to accurately represent the
development effort.

If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown
Structure is also available. The phases, tasks, and subtasks can be exported from Excel into many other
project management tools, simplifying the effort of developing the WBS.

Sometimes it is best to build an initial task list and timeline with a project team using a facilitator with the
project team. The project manager can act as a facilitator or can appoint one, freeing up the project
manager and enabling team members to focus on determining the actual tasks and effort needed.

Depending on the size and scope of the project, sub-projects may be beneficial, with multiple project teams
creating their own project plans. The overall project manager then brings the plans together into a master
project plan. This group of projects can be defined as a program and the project manager and project
architect manage the interaction among the various development teams.

Caution: Do not expect plans to be set in stone. Plans inevitably change as the project progresses;
new information becomes available; scope, resources and priorities change; deliverables are (or are not)
completed on time, etc. The process of estimating and modifying the plan should be repeated many times
throughout the project. Even initial planning is likely to take several iterations to gather enough information.
Significant changes to the project plan become the basis to communicate with the project sponsor(s) and/
or key stakeholders with regard to decisions to be made and priorities rearranged. The goal of the project
manager is to be non-biased toward any decision, but to place the responsibility with the sponsor to shape
direction.

Approaches to Building WBS Structures: Waterfall vs. Iterative

Data integration projects differ somewhat from other types of development projects, although they also
share some key attributes. The following list summarizes some unique aspects of data integration projects:

● Business requirements are less tangible and predictable than in OLTP (online transactional
processing) projects.
● Database queries are very data intensive, involving few or many tables, but with many, many rows.
In OLTP, transactions are data selective, involving few or many tables and comparatively few

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 994 of 1017


rows.

Metadata is important, but in OLTP the meaning of fields is predetermined on a screen


or report. In a data integration project (e.g., warehouse or common data
management, etc.), metadata and traceability are much more critical.

Data integration projects, like all development projects, must be managed. To manage
them, they must follow a clear plan. Data integration project managers often have a
more difficult job than those managing OLTP projects because there are so many
pieces and sources to manage.
Two purposes of the WBS are to manage work and ensure success. Although this is the same as any
project, data integration projects are unlike typical waterfall projects in that they are based on a iterative
approach. Three of the main principles of iteration are as follows:

Iteration. Division of work into small “chunks” of effort using lessons learned from
earlier iterations.

Time boxing. Delivery of capability in short intervals, with the first release typically
requiring from three to nine months (depending on complexity) and quarterly releases
thereafter.

Prototyping. Early delivery of a prototype, with a working database delivered


approximately one-third of the way through.

Incidentally, most iteration projects follow an essentially waterfall process within a given increment. The
danger is that projects can iterate or spiral out of control..

The three principles listed above are very important because even the best data integration plans are
likely to invite failure if these principles are ignored. An example of a failure waiting to happen, even with a
fully detailed plan, is a large common data management project that gathers all requirements upfront and
delivers the application all-at-once after three years. It is not the "large" that is the problem, but the "all
requirements upfront" and the "all-at-once in three years."

Even enterprise data warehouses are delivered piece-by-piece using these three (and other) principles. The
feedback you can gather from increment to increment is critical to the success of the future increments. The
benefit is that such incremental deliveries establish patterns for development that can be used and
leveraged for future deliveries.

What is the Correct Development Approach?

The correct development approach is usually dictated by corporate standards and by departments such as
the Project Management Office (PMO). Regardless of the development approach chosen, high-level phases
typically include planning the project; gathering data requirements; developing data models; designing and
developing the physical database(s); developing the source, profile, and map data; and extracting,
transforming, and loading the data. Lower-level planning details are typically carried out by the project
manager and project team leads.

Preparing the WBS

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 995 of 1017


The WBS can be prepared using manual or automated techniques, or a combination of the two.

In many cases, a manual technique is used identify and record the high-level phases and tasks, then the
information is transferred to project tracking software such as Microsoft Project. Project team members
typically begin by identifying the high-level phases and tasks, writing the relevant information on large sticky
notes or index cards, then mount the notes or cards on a wall or white board. Use one sticky note or card
per phase or task so that you can easily be rearrange them as the project order evolves. As the project plan
progresses, you can add information to the cards or notes to flesh out the details, such as task owner, time
estimates, and dependencies. This information can then be fed into the project tracking software.

Once you have a fairly detailed methodology, you can enter the phase and task information into your project
tracking software. When the project team is assembled, you can enter additional tasks and details directly
into the software. Be aware however, that the project team can better understand a project and its various
components if they actually participate in the high-level development activities, as they do in the manual
approach. Using software alone, without input from relevant project team members, to designate phases,
tasks, dependencies and time lines can be difficult and prone to errors and ommissions.

Benefits of developing the project timeline manually, with input from team members include:

Tasks, effort and dependencies are visible to all team members.


Team has a greater understanding of and commitment to the project.


Team members have an opportunity to work with each other and set the foundation.
This is particularly important if the team is geographically dispersed and cannot work
face-to-face throughout much of the project.

How Much Descriptive Information is Needed?

The project plan should incorporate a thorough description of the project and its goals. Be sure to review the
business objectives, constraints, and high-level phases but keep the description as short and simple as
possible. In many cases, a verb-noun form works well (e.g., interview users, document requirments, etc.).
After you have described the project on a high-level, identify the tasks needed to complete each phase. It is
often helpful to use the notes section in the tracking software (e.g., Microsoft Project) to provide narrative for
each task or subtask. In general, decompose the tasks until they have a rough durations of two to 20 days.

Remember to break down the tasks only to the level of detail that you are willing to track. Include key
checkpoints or milestones as tasks to be completed. Again, a noun-verb form works well for milestones (e.
g., requirements completed, data model completed, etc.).

Assigning and Delegating Responsibility

Identify a single owner for each task in the project plan. Although other resources may help to complete the
task; the individual who is designated as the owner is ultimately responsible for ensuring that the task, and
any associated deliverables, is completed on time.

After the WBS is loaded into the selected project tracking software and refined for the specific project
requirements, the Project Manager can begin to estimate the level of effort involved in completing each of
the steps. When the estimate is complete, the project manager can assign individual resources and prepare

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 996 of 1017


a project schedule . The end result is the Project Plan. Refer to Developing and Maintaining the Project Plan
for further information about the project plan.

Use your project plan to track progress. Be sure to review and modify estimates and keep the project plan
updated throughout the project.

Last updated: 09-Feb-07 16:29

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 997 of 1017


Developing and Maintaining the Project Plan

Challenge

The challenge of developing and maintaing a project plan is to incorporate all of the necessary components while
retaining the flexibility necessary to accommodate change.

A two-fold approach is required to meet the challenge:

1. A project that is clear in scope contains the following elements:

● A designated begin and end date.


● Well-defined business and technical requirements
● Adequate resources must be assigned.

Without these components, the project is subject slippage and incorrect expectations set with the Project Sponsor.

2. Project Plans are subject to revision and change throughout the project. It is imperative to establish a
communication plan with the Project Sponsor; such communication may involve a weekly status report of
accomplishments, and/or a report on issues and plans for the following week. This type of forum is very helpful in
involving the Project Sponsor to actively make decisions with regards to change in scope or timeframes.

If your organization has the concept of a Project Office that provides governance for the project and priorities, look for
a Project Charter that contains items like scope, initial high-level requirements, and key project stakeholders. Additionally,
the Project Office should provide guidance in funding and resource allocation for key projects.

Informatica’s PowerCenter and Data Quality are not exempted from this project planning process. However, the purpose
here is to provide some key elements that can be used to develop and maintain a data integration, data migration,
or data quality project.

Description
Use the following steps as a guide for developing the initial project plan:

1. Define major milestones based on the project scope. (Be sure to list all key items such as analysis, design,
development, and testing.)
2. Break the milestones down into major tasks and activities. The Project Plan should be helpful as a starting point
or for recommending tasks for inclusion.
3. Continue the detail breakdown, if possible, to a level at which there are logical “chunks” of work can be completed
and assigned to resources for accountability purposes. This level provides satisfactory detail to facilitate
estimation, assignment of resources, and tracking of progress. If the detail tasks are too broad in scope, such as
assigning multiple resources, estimates are much less likely to be accurate and resource accountability becomes
difficult to maintain.
4. Confer with technical personnel to review the task definitions and effort estimates (or even to help define them, if
applicable). This helps to build commitment for the project plan.
5. Establish the dependencies among tasks, where one task cannot be started until another is completed (or must
start or complete concurrently with another).
6. Define the resources based on the role definitions and estimated number of resources needed for each role.
7. Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.
8. Ensure that project plan follows your organization’s system development methodology.

Note: Informatica Professional Services has found success in projects that blend the“waterfall” method with the “iterative”
method. The“Waterfall” method works well in the early stages of a project, such as analysis and initial design.
The “Iterative” methods work well in accelerating development and testing where feedback from extensive testing

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 998 of 1017


validates the design of the system.

At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor
relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's activities.
Set the constraint type to “As Soon As Possible” and avoid setting a constraint date. Use the Effort-Driven approach so
that the Project Plan can be easily modified as adjustments are made.

By setting the initial definition of tasks and efforts, the resulting schedule should provide a realistic picture of the
project, unfettered by concerns about ideal user-requested completion dates. In other words, be as realistic as possible in
your initial estimations, even if the resulting scheduling is likely to miss Project Sponsor expectations. This helps to
establish good communications with your Project Sponsor so you can begin to negotiate scope and resources in good
faith.

This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for
opportunities for parallel activities, perhaps adding resources if necessary, to improve the schedule.

When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions,
dependencies, assignments, milestone dates, etc. Expect to modify the plan as a result of this review.

Reviewing and Revising the Project Plan

Once the Project Sponsor and Key Stakeholders agree to the initial plan, it becomes the basis for assigning tasks
and setting expectations regarding delivery dates. The planning activity then shifts to tracking tasks against the schedule
and updating the plan based on status and changes to assumptions.

One of the key communication methods is building the concept of a weekly or bi-weekly Project Sponsor
meeting. Attendance at this meeting should include the Project Sponsor, Key Stakeholders, Lead Developers, and the
Project Manager.

Elements of a Project Sponsor meeting should include: a) Key Accomplishments (milestones, events at a high-level),
b) Progress to Date against the initial plan, c) Actual Hours vs. Budgeted Hours, d) Key Issues and e) Plans for Next
Period.

Key Accomplishments

Listing key accomplishments provides an audit trail of activities completed for comparison against the initial plan. This is
an opportunity to bring in the lead developers and have them report to management on what they have accomplished;
it also provides them with an opportunity to raise concerns, which is very good from a motivation perspective since they
have to own and account to management.

Keep accomplishments at a high-level and coach the team members to be brief, keeping their presentation to a five to
ten minute maximum during this portion of the meeting.

Progress against Initial Plan

The following matrix shows progress on relevant stages of the project. Roll-up tasks to a management level so it is
readable to the Project Sponsor (see sample below).

Percent Budget
Plan Complete Hours
Architecture - Set up of Informatica Migration Environment 167
Develop data integration solution architecture 10% 28
Design development architecture 28% 32
Customize and implement Iterative Migration Framework

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 999 of 1017


Data Profiling 80% 32
Legacy Stage 100% 10
Pre-Load Stage 100% 10
Reference Data 83% 18
Reusable Objects 19% 27
Review and signoff of Architecture 0% 10

Analysis - Target-to-Source Data Mapping 1000


Customer (9 tables) 90% 135
Product (6 tables) 90% 215
Inventory (3 tables) 0% 60
Shipping (3 tables) 0% 60
Invoicing (7 tables) 57% 140
Orders (19 tables) 40% 380
Review and signoff of Functional Specification 0% 10

Budget versus Actual

A key measure to be aware of is budgeted vs. actual cost of the project. The Project Sponsor needs to know if additional
funding is required; forecasting actual hours against budgeted hours allows the Project Sponsor to determine when
additional funding or a change in scope is required.

Many projects are cancelled because of cost overruns, so it is the Project Manager’s job to keep expenditures under
control. The following example shows how a budgeted vs. actual report may look.

10- 17- 24- 22- 29-


Apr Apr Apr 1-May 8-May 15-May May May
Resource A 28 40 24 40 40 40 40 32 284
Resource B 10 40 40 40 40 32 202
Resource C 40 36 40 40 32 188
Resource D 24 40 36 40 40 32 212
Project Manager 12 8 8 16 32 76
*462 962

110 160 97 160 160 160 160 160 1167


687

Key Issues

This is the most important part of the meeting. Presenting key issues such as resource commitment, user roadblocks,
key design concerns, etc, to the Project Sponsor and Key Stakeholders as they occur allows them to make immediate
decisions and minimizes the risk of impact to the project.

Plans for Next Period

This communicates back to the Project Sponsor where the resources are to be deployed. If key issues dictate a change,
this is an opportunity to redirect the resources and use them correctly.

Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment Sample
Deliverable), or changes in priority or approach as they arise to determine if they effect the plan. It may be necessary to
revise the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add
new tasks or postpone existing ones.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1000 of 1017


Tracking Changes

One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it. With
Microsoft Project, this involves creating a "Baseline" that remains static as changes are applied to the schedule. If
company and project management do not require tracking against a baseline, simply maintain the plan through updates
without a baseline. Maintain all records of Project Sponsor meetings and recap changes in scope after the meeting is
completed.

Summary

Managing a data integration, data migration, or data quality project requires good project planning and
communications. Many data integration project fail because of issues such as poor data quality or complexity of
integration issues. However, good communication and expectation setting with the Project Sponsor can prevent such
issues from causing a project to fail.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1001 of 1017


Developing the Business Case

Challenge

Identifying the departments and individuals that are likely to benefit directly from the
project implementation. Understanding these individuals, and their business information
requirements, is key to defining and scoping the project.

Description

The following four steps summarize business case development and lay a good
foundation for proceeding into detailed business requirements for the project.

1. One of the first steps in establishing the business scope is identifying the project
beneficiaries and understanding their business roles and project participation. In many
cases, the Project Sponsor can help to identify the beneficiaries and the various
departments they represent. This information can then be summarized in an
organization chart that is useful for ensuring that all project team members understand
the corporate/business organization.

● Activity - Interview project sponsor to identify beneficiaries, define their


business roles and project participation.
● Deliverable - Organization chart of corporate beneficiaries and participants.

2. The next step in establishing the business scope is to understand the business
problem or need that the project addresses. This information should be clearly defined
in a Problem/Needs Statement, using business terms to describe the problem. For
example, the problem may be expressed as "a lack of information" rather than "a lack
of technology" and should detail the business decisions or analysis that is required to
resolve the lack of information. The best way to gather this type of information is by
interviewing the Project Sponsor and/or the project beneficiaries.

● Activity - Interview (individually or in forum) Project Sponsor and/or


beneficiaries regarding problems and needs related to project.
● Deliverable - Problem/Need Statement

3. The next step in creating the project scope is defining the business goals and
objectives for the project and detailing them in a comprehensive Statement of Project

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1002 of 1017


Goals and Objectives. This statement should be a high-level expression of the desired
business solution (e.g., what strategic or tactical benefits does the business expect to
gain from the project,) and should avoid any technical considerations at this point.
Again, the Project Sponsor and beneficiaries are the best sources for this type of
information. It may be practical to combine information gathering for the needs
assessment and goals definition, using individual interviews or general meetings to
elicit the information.

● Activity - Interview (individually or in forum) Project Sponsor and/or


beneficiaries regarding business goals and objectives for the project.
● Deliverable - Statement of Project Goals and Objectives

4. The final step is creating a Project Scope and Assumptions statement that clearly
defines the boundaries of the project based on the Statement of Project Goals and
Objective and the associated project assumptions. This statement should focus on the
type of information or analysis that will be included in the project rather than what will
not.

The assumptions statements are optional and may include qualifiers on the scope,
such as assumptions of feasibility, specific roles and responsibilities, or availability of
resources or data.

● Activity -Business Analyst develops Project Scope and Assumptions


statement for presentation to the Project Sponsor.
● Deliverable - Project Scope and Assumptions statement

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1003 of 1017


Managing the Project Lifecycle

Challenge

To provide an effective communications plan to provide on-going management


throughout the project lifecycle and to inform the Project Sponsor regarding status of
the project.

Description

The quality of a project can be directly correlated to the amount of review that occurs
during its lifecycle and the involvement of the Project Sponsor and Key Stakeholders.

Project Status Reports

In addition to the initial project plan review with the Project Sponsor, it is critical to
schedule regular status meetings with the sponsor and project team to review status,
issues, scope changes and schedule updates. This is known as the project sponsor
meeting.

Gather status, issues and schedule update information from the team one day before
the status meeting in order to compile and distribute the Project Status Report . In
addition, make sure lead developers of major assignments are present to report on the
status and issues, if applicable.

Project Management Review

The Project Manager should coordinate, if not facilitate, reviews of requirements, plans
and deliverables with company management, including business requirements reviews
with business personnel and technical reviews with project technical personnel.

Set a process in place beforehand to ensure appropriate personnel are invited, any
relevant documents are distributed at least 24 hours in advance, and that reviews focus
on questions and issues (rather than a laborious "reading of the code").

Reviews may include:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1004 of 1017


● Project scope and business case review.
● Business requirements review.
● Source analysis and business rules reviews.
● Data architecture review.
● Technical infrastructure review (hardware and software capacity and
configuration planning).
● Data integration logic review (source to target mappings, cleansing and
transformation logic, etc.).
● Source extraction process review.
● Operations review (operations and maintenance of load sessions, etc.).
● Reviews of operations plan, QA plan, deployment and support plan.

Project Sponsor Meetings

A project sponsor meeting should be completed weekly to bi-weekly to communicate


progress to the Project Sponsor and Key Stakeholders. The purpose is to keep key
user management involved and engaged in the process. In addition, it is to
communicate any changes to the initial plan and to have them weigh in on the decision
process.

Elements of the meeting include:

● Key Accomplishments.
● Activities Next Week.
● Tracking of Progress to-Date (Budget vs. Actual).
● Key Issues / Roadblocks.

It is the Project Manager’s role to stay neutral to any issue and to effectively state facts
and allow the Project Sponsor or other key executives to make decisions. Many times
this process builds the partnership necessary for success.

Change in Scope

Directly address and evaluate any changes to the planned project activities, priorities,
or staffing as they arise, or are proposed, in terms of their impact on the project plan.

The Project Manager should institute a change management process in response to


any issue or request that appears to add or alter expected activities and has the

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1005 of 1017


potential to affect the plan.

● Use the Scope Change Assessment to record the background problem or


requirement and the recommended resolution that constitutes the potential
scope change. Note that such a change-in-scope document helps capture key
documentation that is particularly useful if the project overruns or fails
to deliver upon Project Sponsor expectations.
● Review each potential change with the technical team to assess its impact on
the project, evaluating the effect in terms of schedule, budget, staffing
requirements, and so forth.
● Present the Scope Change Assessment to the Project Sponsor for acceptance
(with formal sign-off, if applicable). Discuss the assumptions involved in the
impact estimate and any potential risks to the project.

Even if there is no evident effect on the schedule, it is important to document these


changes because they may affect project direction and it may become necessary, later
in the project cycle, to justify these changes to management.

Management of Issues

Any questions, problems, or issues that arise and are not immediately resolved should
be tracked to ensure that someone is accountable for resolving them so that their effect
can also be visible.

Use the Issues Tracking template, or something similar, to track issues, their owner,
and dates of entry and resolution as well as the details of the issue and of its solution.

Significant or "showstopper" issues should also be mentioned on the status report and
communicated through the weekly project sponsor meeting. This way, the Project
Sponsor has the opportunity to resolve and cure a potential issue.

Project Acceptance and Close

A formal project acceptance and close helps document the final status of the project.
Rather than simply walking away from a project when it seems complete, this explicit
close procedure both documents and helps finalize the project with the Project Sponsor.

For most projects this involves a meeting where the Project Sponsor and/or department
managers acknowledge completion or sign a statement of satisfactory completion.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1006 of 1017


● Even for relatively short projects, use the Project Close Report to finalize the
project with a final status report detailing:

❍ What was accomplished.


❍ Any justification for tasks expected but not completed.
❍ Recommendations.

● Prepare for the close by considering what the project team has learned about
the environments, procedures, data integration design, data architecture, and
other project plans.
● Formulate the recommendations based on issues or problems that need to be
addressed. Succinctly describe each problem or recommendation and if
applicable, briefly describe a recommended approach.

Last updated: 01-Feb-07 18:54

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1007 of 1017


Using Interviews to Determine Corporate
Data Integration Requirements

Challenge

Data warehousing projects are usually initiated out of a business need for a certain
type of reports (i.e., “we need consistent reporting of revenue, bookings and backlog”).
Except in the case of narrowly-focused, departmental data marts however, this is not
enough guidance to drive a full data integration solution. Further, a successful, single-
purpose data mart can build a reputation such that, after a relatively brief period of
proving its value to users, business management floods the technical group with
requests for more data marts in other areas. The only way to avoid silos of data marts
is to think bigger at the beginning and canvas the enterprise (or at least the
department, if that’s your limit of scope) for a broad analysis of data
integration requirements.

Description

Determining the data integration requirements in satisfactory detail and clarity is a


difficult task however, especially while ensuring that the requirements are
representative of all the potential stakeholders. This Best Practice summarizes the
recommended interview and prioritization process for this requirements analysis.

Process Steps

The first step in the process is to identify and interview “all” major sponsors and
stakeholders. This typically includes the executive staff and CFO since they are likely to
be the key decision makers who will depend on the data integraton. At a minimum,
figure on 10 to 20 interview sessions.

The next step in the process is to interview representative information providers. These
individuals include the decision makers who provide the strategic perspective on what
information to pursue, as well as details on that information, and how it is currently
used (i.e., reported and/or analyzed). Be sure to provide feedback to all of the sponsors
and stakeholders regarding the findings of the interviews and the recommended
subject areas and information profiles. It is often helpful to facilitate a Prioritization
Workshop with the major stakeholders, sponsors, and information providers in order to
set priorities on the subject areas.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1008 of 1017


Conduct Interviews

The following paragraphs offer some tips on the actual interviewing process. Two
sections at the end of this document provide sample interview outlines for the executive
staff and information providers.

Remember to keep executive interviews brief (i.e., an hour or less) and to the point. A
focused, consistent interview format is desirable. Don't feel bound to the script,
however, since interviewees are likely to raise some interesting points that may not be
included in the original interview format. Pursue these subjects as they come up,
asking detailed questions. This approach often leads to “discoveries” of strategic uses
for information that may be exciting to the client and provide sparkle and focus to the
project.

Questions to the “executives” or decision-makers should focus on what business


strategies and decisions need information to support or monitor them. (Refer to Outline
for executive Interviews at the end of this document). Coverage here is critical if key
managers are left out, you may miss a critical viewpoint and may miss an important
buy-in.

Interviews of information providers are secondary but can be very useful. These are the
business analyst-types who report to decision-makers and currently provide reports
and analyses using Excel or Lotus or a database program to consolidate data from
more than one source and provide regular and ad hoc reports or conduct sophisticated
analysis. In subsequent phases of the project, you must identify all of these individuals,
learn what information they access, and how they process it. At this stage however,
you should focus on the basics, building a foundation for the project and discovering
what tools are currently in use and where gaps may exist in the analysis and reporting
functions.

Be sure to take detailed notes throughout the interview process. If there are a lot of
interviews, you may want the interviewer to partner with someone who can take good
notes, perhaps on a laptop to save note transcription time later. It is important to take
down the details of what each person says because, at this stage, it is difficult to know
what is likely to be important. While some interviewees may want to see detailed notes
from their interviews, this is not very efficient since it takes time to clean up the notes
for review. The most efficient approach is to simply consolidate the interview notes into
a summary format following the interviews.

Be sure to review previous interviews as you go through the interviewing process, You
can often use information from earlier interviews to pursue topics in later interviews in
more detail and with varying perspectives.

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1009 of 1017


The executive interviews must be carried out in “business terms.” There can be no
mention of the data warehouse or systems of record or particular source data entities
or issues related to sourcing, cleansing or transformation. It is strictly forbidden to use
any technical language. It can be valuable to have an industry expert prepare and even
accompany the interviewer to provide business terminology and focus. If the interview
falls into “technical details,” for example, into a discussion of whether certain
information is currently available or could be integrated into the data warehouse, it is up
to the interviewer to re-focus immediately on business needs. If this focus is not
maintained, the opportunity for brainstorming is likely to be lost, which will reduce the
quality and breadth of the business drivers.

Because of the above caution, it is rarely acceptable to have IS resources present at


the executive interviews. These resources are likely to engage the executive (or vice
versa) in a discussion of current reporting problems or technical issues and thereby
destroy the interview opportunity.

Keep the interview groups small. One or two Professional Services personnel should
suffice with at most one client project person. Especially for executive interviews, there
should be one interviewee. There is sometimes a need to interview a group of middle
managers together, but if there are more than two or three, you are likely to get much
less input from the participants.

Distribute Interview Findings and Recommended Subject Areas

At the completion of the interviews, compile the interview notes and consolidate the
content into a summary.This summary should help to breakout the input into
departments or other groupings significant to the client. Use this content and your
interview experience along with “best practices” or industry experience to recommend
specific, well-defined subject areas.

Remember that this is a critical opportunity to position the project to the decision-
makers by accurately representing their interests while adding enough creativity to
capture their imagination. Provide them with models or profiles of the sort of information
that could be included in a subject area so they can visualize its utility. This sort of
“visionary concept” of their strategic information needs is crucial to drive their
awareness and is often suggested during interviews of the more strategic thinkers. Tie
descriptions of the information directly to stated business drivers (e.g., key processes
and decisions) to further accentuate the “business solution.”

A typical table of contents in the initial Findings and Recommendations document might
look like this:

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1010 of 1017


I. Introduction
II. Executive Summary
A. Objectives for the Data Warehouse
B. Summary of Requirements
C. High Priority Information Categories
D. Issues
III. Recommendations
A. Strategic Information Requirements
B. Issues Related to Availability of Data
C. Suggested Initial Increments
D. Data Warehouse Model
IV. Summary of Findings
A. Description of Process Used
B. Key Business Strategies [this includes descriptions of processes,
decisions, other drivers)
C. Key Departmental Strategies and Measurements
D. Existing Sources of Information
E. How Information is Used
F. Issues Related to Information Access
V. Appendices
A. Organizational structure, departmental roles
B. Departmental responsibilities, and relationships

Conduct Prioritization Workshop

This is a critical workshop for consensus on the business drivers. Key executives and
decision-makers should attend, along with some key information providers. It is
advisable to schedule this workshop offsite to assure attendance and attention, but the
workshop must be efficient — typically confined to a half-day.

Be sure to announce the workshop well enough in advance to ensure that key
attendees can put it on their schedules. Sending the announcement of the workshop
may coincide with the initial distribution of the interview findings.

The workshop agenda should include the following items:

● Agenda and Introductions


● Project Background and Objectives
● Validate Interview Findings: Key Issues
● Validate Information Needs
● Reality Check: Feasibility

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1011 of 1017


● Prioritize Information Needs
● Data Integration Plan
● Wrap-up and Next Steps

Keep the presentation as simple and concise as possible, and avoid technical
discussions or detailed sidetracks.

Validate information needs

Key business drivers should be determined well in advance of the workshop, using
information gathered during the interviewing process. Prior to the workshop, these
business drivers should be written out, preferably in display format on flipcharts or
similar presentation media, along with relevant comments or additions from the
interviewees and/or workshop attendees.

During the validation segment of the workshop, attendees need to review and discuss
the specific types of information that have been identified as important for triggering or
monitoring the business drivers. At this point, it is advisable to compile as complete a
list as possible; it can be refined and prioritized in subsequent phases of the project.
As much as possible, categorize the information needs by function, maybe even by
specific driver (i.e., a strategic process or decision). Considering the information needs
on a function by function basis fosters discussion of how the information is used and by
whom.

Reality check: feasibility

With the results of brainstorming over business drivers and information needs listed (all
over the walls, presumably), take a brief detour into reality before prioritizing and
planning. You need to consider overall feasibility before establishing the first priority
information area(s) and setting a plan to implement the data warehousing solution with
initial increments to address those first priorities.

Briefly describe the current state of the likely information sources (SORs). What
information is currently accessible with a reasonable likelihood of the quality and
content necessary for the high priority information areas? If there is likely to be a high
degree of complexity or technical difficulty in obtaining the source information, you may
need to reduce the priority of that information area (i.e., tackle it after some successes
in other areas).

Avoid getting into too much detail or technical issues. Describe the general types of
information that will be needed (e.g., sales revenue, service costs, customer descriptive

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1012 of 1017


information, etc.), focusing on what you expect will be needed for the highest priority
information needs.

Data Integration Plan

The project sponsors, stakeholders, and users should all understand that the process
of implementing the data warehousing solution is incremental.. Develop a high-level
plan for implementing the project, focusing on increments that are both high-value and
high-feasibility. Implementing these increments first provides an opportunity to build
credibility for the project. The objective during this step is to obtain buy-in for your
implementation plan and to begin to set expectations in terms of timing. Be practical
though; don't establish too rigorous a timeline!

Wrap-up and next steps

At the close of the workshop, review the group's decisions (in 30 seconds or less),
schedule the delivery of notes and findings to the attendees, and discuss the next steps
of the data warehousing project.

Document the Roadmap

As soon as possible after the workshop, provide the attendees and other project
stakeholders with the results:

● Definitions of each subject area, categorized by functional area


● Within each subject area, descriptions of the business drivers and information
metrics
● Lists of the feasibility issues
● The subject area priorities and the implementation timeline.

Outline for Executive Interviews

I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
● Interviews to understand business information strategies and

expectations
● Document strategy findings

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1013 of 1017


● Consensus-building meeting to prioritize information
requirements and identify “quick hits”
● Model strategic subject areas
● Produce multi-phase Business Intelligence strategy
III. Goals for this meeting
A. Description of business vision, strategies
B. Perspective on strategic business issues and how they drive information
needs
● Information needed to support or achieve business goals

● How success is measured


IV. Briefly describe your roles and responsibilities?
● The interviewee may provide this information before the actual

interview. In this case, simply review with the interviewee and


ask if there is anything to add.
A. What are your key business strategies and objectives?
● How do corporate strategic initiatives impact your group?

● These may include “MBOs” (personal performance objectives),


and workgroup objectives or strategies.
B. What do you see as the Critical Success Factors for an Enterprise
Information Strategy?
● What are its potential obstacles or pitfalls?

C. What information do you need to achieve or support key decisions


related to your business objectives?
D. How will your organization’s progress and final success be measured (e.
g., metrics, critical success factors)?
E. What information or decisions from other groups affect your success?
F. What are other valuable information sources (i.e., computer reports,
industry reports, email, key people, meetings, phone)?
G. Do you have regular strategy meetings? What information is shared as
you develop your strategy?
H. If it is difficult for the interviewee to brainstorm about information needs,
try asking the question this way: "When you return from a two-week
vacation, what information do you want to know first?"
I. Of all the information you now receive, what is the most valuable?
J. What information do you need that is not now readily available?
K. How accurate is the information you are now getting?
L. To whom do you provide information?
M. Who provides information to you?
N. Who would you recommend be involved in the cross-functional
Consensus Workshop?

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1014 of 1017


Outline for Information Provider Interviews

I. Introductions
II. General description of information strategy process
A. Purpose and goals
B. Overview of steps and deliverables
● Interviews to understand business information strategies and

expectations
● Document strategy findings and model the strategic subject
areas
● Consensus-building meeting to prioritize information
requirements and identify “quick hits”
● Produce multi-phase Business Intelligence strategy
III. Goals for this meeting
1. Understanding of how business issues drive information needs
2. High-level understanding of what information is currently provided to
whom
● Where does it come from

● How is it processed
● What are its quality or access issues
IV. Briefly describe your roles and responsibilities?
● The interviewee may provide this information before the actual

interview. In this case, simply review with the interviewee and


ask if there is anything to add.
A. Who do you provide information to?
B. What information do you provide to help support or measure the
progress/success of their key business decisions?
C. Of all the information you now provide, what is the most requested or
most widely used?
D. What are your sources for the information (both in terms of systems and
personnel)?
E. What types of analysis do you regularly perform (i.e., trends,
investigating problems)? How do you provide these analyses (e.g.,
charts, graphs, spreadsheets)?
F. How do you change/add value to the information?
G. Are there quality or usability problems with the information you work
with? How accurate is it?

Last updated: 05-Jun-08 15:16

INFORMATICA CONFIDENTIAL Velocity v8 Methodology - Data Warehousing 1015 of 1017


Sample Deliverables

● Business Requirements
Specification
● Change Request Form
● Data Migration Communication Plan
● Data Quality Plan Design
● Database Sizing Model
● Functional Requirements Specification
● Information Requirements Specification
● Issues Tracking
● Mapping Inventory
● Mapping Specifications
● Metadata Inventory
● Migration Request Checklist
● Operations Manual
● Physical Data Model Review Agenda
● Project Definition
● Project Plan
● Project Roadmap
● Project Role Matrix
● Prototype Feedback
● Restartability Matrix
● Scope Change Assessment
● Source Availability Matrix
● System Test Plan
● Target-Source Matrix
● Technology Evaluation Checklist
● Test Case List
● Test Condition Results
● Unit Test Plan
● Work Breakdown Structure
VELOCITY
SAMPLE DELIVERABLE

Business Requirements
Specification
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
BUSINESS REQUIREMENTS SPECIFICATION

DOCUMENT OVERVIEW

This document presents a brief description of the business of CompanyX and specific business
requirements applicable to the project. Any subsequent changes, additions, or deletions are not part of
this document and will be submitted to CompanyX separately for acceptance and inclusion as an
additional requirement for the project.

1.1 BUSINESS OVERVIEW

<Describe high-level view of business environment, strategy, reason for system implementation, etc>

1.2 BUSINESS REQUIREMENTS MATRIX

Business Requirement Functional Requirement Priority


1. Data from log files and the application
database will be extracted and 1
distributed into central repository.
1a. The central repository will be a
1
flat file database.
1b The data in the central repository
will be partitioned by cluster, time
period, portal and merchant such
1
that the data will be readily
accessible by reading the fewest
number of rows.
1c. Data from the log files will be
gathered from each cluster
1
machine and distributed into
central repository.
1d. Data from the Oracle database
will be extracted and distributed 1
into the central repository.
1e. The data will be kept in 9 types of
data files: open connection log,
close connection log, command
log, search command log, 1
assistant log and product listing
log, ats referral stats, ats order
confirmations, ats order items.
1f. The data will be aggregated into
9 types aggregate files, one for 1
each type of data file.
2. The data will be aggregated by
1
merchant for each time period.
2a. PowerCenter will be used to
generate the aggregated data 2
files.

2 of 4 Informatica Velocity – Sample Deliverable


BUSINESS REQUIREMENTS SPECIFICATION

Business Requirement Functional Requirement Priority


2b. The data will be aggregated on at
least a monthly basis. If possible, 1
daily or hourly basis.
2c. The data will be aggregated by
merchant/portal combination by
1
hour, day, month, year and
cumulative.
3. A flat file will be extracted from the
central repository for loading into the 1
billing system.
3a. The billing extract will be
performed on (at least) a monthly 1
basis.
3b. Data file(s) will be formatted for
the billing system. These
1
requirements will be defined at a
later time.
4. The architecture will support the
simultaneous use of multiple versions 2
of the log files.
4a. The system will handle extracting
from and loading to multiple 2
versions of the log files.
5. FUTURE REQ: Users will be able to
build sessions to extract selected data 2
from the flat file database.
5a. The architecture will support the
generation of required data files 2
for reporting needs.

1.3 BUSINESS REQUIREMENTS DETAIL

1.3.1 REQUIREMENT 1 DETAIL

Business Data from log files and the application database will be extracted and distributed
Requirement into central repository.

The log files will be transported from each cluster machine and placed in the
Constraints repository.
The application server must be available for data extract.
Hourly log files from the cluster processing machines.
Inputs
Tables from the application database server.
Flat file database built from the log files and tables using a hierarchical
Outputs
directory/file structure.
Each processing machine in each cluster will correctly generate the log files.
Dependencies The log files will be transferred to the repository machine intact.
The database is operational.

Informatica Velocity – Sample Deliverable 3 of 4


BUSINESS REQUIREMENTS SPECIFICATION

Central Repository Machine: SUN E4500, 4 CPU, 4GB RAM, 70GB HDD. Sun
Hardware /
Solaris 2.6
Software
PowerCenter will be used to extract data from the application database.
Requirements
PERL scripts will be used to extract data from the log files.

1.3.2 REQUIREMENT 2 DETAIL

Business
Requirement

Constraints

Inputs

Outputs

Dependencies
Hardware /
Software
Requirements

1.3.3 REQUIREMENT 3 DETAIL

Business
Requirement

Constraints

Inputs

Outputs

Dependencies
Hardware /
Software
Requirements

4 of 4 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Change Request Form


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
CHANGE REQUEST FORM

REQUESTOR INFORMATION:
Name: Phone Number: Date:

REQUEST TYPE: (Check appropriate box)


Change: Migration: Both:

REQUEST INCLUDES: (Check appropriate box)


Source Mapping Transformation Worklet
Target Mapplet Session Other
Data Map DQ Plan Workflow

DESCRIPTION OF OTHER:

REQUEST INFORMATION: (Enter all applicable information for current environment)


Object Name:
Source Type: Source Location:
Target Type: Target Location:
If relational source or target change, has change been applied to physical database? YES NO
If change to flat file or copybook, has change been applied to file? YES NO
Shared Folder: YES NO
Reusable Object: YES NO
From Folder: From Repository:
To Folder: To Repository:
Current Source Database Connection: Current Target Database Connection:

MIGRATION INFORMATION:
Migrate From: DEV TEST QA PRODUCTION
Migrate To: DEV TEST QA PRODUCTION
Deployment Group:
Label:

WORKFLOW/SESSION DETAILS: (Include any special details about the session configuration,
automatic memory configuration, recovery options, load strategy, etc.)

SPECIFICATIONS: (Document the detailed requirements of this request)

IMPLEMENTATION INFORMATION:
Reviewed/Approved By: Implemented By:
Date Received: Date Implemented:
Comments:

2 of 2 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Communication Plan
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
COMMUNICATION PLAN

TABLE OF CONTENTS

INTRODUCTION ........................................................................................ 3

OVERALL OWNERSHIP .............................................................................. 3

CONTACT INFORMATION ........................................................................... 3

TEAM INFORMATION ................................................................................. 3

CONFERENCE CALL DETAILS ..................................................................... 3

TASK LEVEL START NOTIFICATION .............................................................. 3

TASK LEVEL COMPLETION NOTIFICATION .................................................... 3

TASK LEVEL LACK OF START NOTIFICATION ................................................. 4

GO/NO-GO PROCEDURE ........................................................................... 4

PUNCH LIST – MASTER LOCATION .............................................................. 4

2 of 4 Informatica Velocity – Sample Deliverable


COMMUNICATION PLAN

INTRODUCTION
[Provide a brief introduction to the Communication Plan -- specify the purpose of the document.]

OVERALL OWNERSHIP
[Provide verbiage that identifies the owner of the migration project. This person is the decision-maker for all issues
and points that require clarification (i.e., Joe Blob, Implementation Architect will own the plan and make all final
decisions. Responsibilities include setting up ad-hoc calls, consulting with the PMO and Project Manager and
relaying information about appropriate options.]

CONTACT INFORMATION
Name Team Cell Phone Number Home Phone Number

TEAM INFORMATION
Team Name Role Manager

CONFERENCE CALL DETAILS


There will be planned project conference status calls as shown below:
Call Topic Date Time Required/Optional

Additional calls will be scheduled as needed. Required attendees will receive notification of ad-hoc conference calls
via cell phone and Email invitations.

Teleconference Access:
Phone Number: 1-800-123-4567
Conference Code: 1234

TASK LEVEL START NOTIFICATION


[Identify the communications that should occur upon the start of any given task on the punch list.]

TASK LEVEL COMPLETION NOTIFICATION


[Identify the communications that should occur upon completion of any given task on the punch list.]

Informatica Velocity – Sample Deliverable 3 of 4


COMMUNICATION PLAN

TASK LEVEL LACK OF START NOTIFICATION


[Identify what actions should occur if appropriate communications do not occur at the start of any given task on the
punch list.]

GO/NO-GO PROCEDURE
[Provide details on how a Go/No-Go decision will be determined and communicated.]

PUNCH LIST – MASTER LOCATION


[The master Punch List will be located at <<location>> or identify who will hold the master list (a centralized location
is recommended.)]

4 of 4 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Data Quality Plan Design


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
DATA QUALITY PLAN DESIGN

DATA QUALITY PLAN DESIGN SAMPLE DELIVERABLE


A Data Quality Plan Design describes the design and operation of one or more data quality plans within a
consultancy project in a manner that business users related to the project can understand.

The document should serve as a plan handover document for business users and be written in a manner that a user
trained in IDQ can understand and update the plan design unaided.

We recommend that you build a plan design document as you build the plan.

A Plan Design document should contain the following sections:

ƒ Introduction
ƒ Document scope and readership
ƒ Document history
ƒ Plan heading [plan name].pln
ƒ Overview
ƒ Inputs
ƒ Component descriptions
ƒ Dictionaries
ƒ Outputs
ƒ Next steps

THE INTRODUCTION
The introduction should describe the data quality objectives of the plan and its relationship to the parent project.
When writing the introduction, consider these questions:

ƒ What is the name of the plan?


ƒ What project is the plan part of? Where does the plan fit in the overall project?
ƒ What particular aspect of the project does the plan address?
ƒ What are the objectives of the plan?
ƒ What issues, if any, apply to the plan or its data?
ƒ What business rules are used in the plan? What is the origin of these rules?
ƒ What department or group uses the plan output?
ƒ What are the before and after states of the plan data?
ƒ Where are the plans located (include machine details and folder location) and when were the plans executed?
ƒ What steps were taken or should be taken following plan execution?

PROJECT NAME/PLAN NAME


This section describes the Informatica Data Quality plan in technical detail. Include the path to the plan within the IDQ
Project Manager or on the file system in the heading or in the first paragraph below the heading. Note that the plan
name may contain a suffix such as .pln or .xml if it has been saved out from the Data Quality Repository.

This section has four main sub-sections:

2 of 4 Informatica Velocity – Sample Deliverable


DATA QUALITY PLAN DESIGN

OVERVIEW

The overview provides the following information:

ƒ The plan type (e.g. standardization, matching).


ƒ The data or business objective of the plan.
ƒ Who ran (or should run) the plan, and when.
ƒ The version of IDQ in which the plan was designed, the Informatica application that will run the plan (e.g. Data
Quality Server), and the platform on which the plan will run.
ƒ A screengrab of the plan layout in the Workbench user interface.
ƒ Any other relevant information.

INPUTS

This section identifies the source data for the plan. Consider these questions:

ƒ To what data file/table do the plan’s source components connect?


ƒ Where is the source file located? What are the format and origin of the database table?
ƒ What data source(s) are used?
ƒ If a data file, what type? Are any parameters set at this stage (e.g. Unicode)?
ƒ If a table, what operations are performed at data source level? Provide SQL statements if appropriate.
ƒ Is the source data an output from another Informatica Data Quality plan, and if so, which one?

COMPONENT DESCRIPTIONS

This section describes at a low level the operational components and business rules used in the plan. Where
possible, these should be listed in order of their interaction with the data. How much detail you go onto depends on
the audience for the document and what their needs are. It also depends on whether the business rules are
documented elsewhere.

Component functionality can be described at a high level as shown in the examples below:

Search Replace component takes Addr Line1 from CSV Source and removes spaces anywhere and full
stops from end.

Output from Search Replace component is put through the Word Manager and Addr Line 1 is standardized
using ‘Address Prefix’ and ‘Address Suffix’ dictionaries.
Output from Word Manager is used as input to Token Labeller and profiled using the following dictionaries in
this order: A.dic, Bb.dic, C.dic.

Continue stepping through each component in this manner.

For fine-grained plan description, consider these questions:

ƒ What is the component name?


ƒ What instances are defined, and how are they named?
ƒ For each instance:
ƒ What input fields are selected?

Informatica Velocity – Sample Deliverable 3 of 4


DATA QUALITY PLAN DESIGN

ƒ What parameters are set?


ƒ What filters are defined?
ƒ What reference dictionaries are applied?
ƒ What business rules are defined? (Provide the logical statements if appropriate.)
ƒ Are the dictionaries/business rules specified by the client?
ƒ What are the outputs for the instance, and how are they named?

DICTIONARIES

List dictionaries and other reference content used, and their file locations.

Cross-refer each dictionary to the component(s) that use it.

OUTPUTS
In this section, describe the plan output and identify its file or database destination. Consider these questions:

ƒ What is the sink name?


ƒ Where is the sink output written: report, database table, file?
ƒ What output fields are selected for the sink?
ƒ Are there exception files? If so, where are they written to?

Provide SQL statements if appropriate.

[PLAN NAME.PLN]
This section is optional; it can be used in the same manner as the previous plan section and its subsections, above, if
another plan is described in this document.

NEXT STEPS
This section is relevant if there are other actions dependent on the plan and if the plan output is to be used elsewhere
in the project, as is typically the case. Consider these questions:

ƒ What is the next step in the project?


ƒ Will the plan(s) be re-used?
ƒ Who receives the plan output data, and what actions will they take?
ƒ What steps, if any, will the Informatica consultant take in connection with these plans?

4 of 4 Informatica Velocity – Sample Deliverable


DATABASE SIZING MODEL

Row Count Estimate Table Size Estimate


Row
Table Name Description # Cols Width PCTFREE Month 0 Month 12 Month 24 Month 36 Month 0 Month 12 Month 24 Month 36
TABLE_1 Sample 1 3 69 10% 10,000 100,000 200,000 1,000,000 759,000 7,590,000 15,180,000 75,900,000
TABLE_2 Sample 2 (Static) 2 44 0% 220 220 220 220 9,680 9,680 9,680 9,680
TABLE_3 Sample 3 9 268 20% 1,000,000 2,000,000 3,000,000 4,000,000 321,600,000 643,200,000 964,800,000 1,286,400,000
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Data Size: 322,368,680 650,799,680 979,989,680 1,362,309,680
Index %: 30% 30% 30% 30%
Total: 419,079,284 846,039,584 1,273,986,584 1,771,002,584
MB: 399.7 806.8 1,215.0 1,689.0
GB: 0.39 0.79 1.19 1.65

1 of 1 Summary Estimate Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Functional Requirements
Specification
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
FUNCTIONAL REQUIREMENTS SPECIFICATION

1.1 BUSINESS OVERVIEW


<Describe business environment>

1.2 FUNCTIONAL REQUIREMENTS MATRIX


Page
Functional Requirement Priority Reference
1a. The central repository will be a flat file database. 1
1b The data in the central repository will be partitioned by cluster, time period, 1
portal and merchant such that the data will be readily accessible by reading
the fewest number of rows.
1c. Data from the log files will be gathered from each cluster machine and 1
distributed into central repository.
1d. Data from the application database will be extracted and distributed into the 1
central repository.
1e. The data will be kept in 9 types of data files: open connection log, close 1
connection log, command log, search command log, assistant log and product
listing log, referral stats, order confirmations, order items.
1f. The data will be aggregated into 9 types aggregate files, one for each type of 1
data file.
2a. PowerCenter will be used to generate the aggregated data files. 2
2b. The data will be aggregated on at least monthly basis. If possible, daily or 1
hourly basis.
2c. The data will be aggregated by merchant/portal combination by hour, day, 1
month, year and cumulative.
3a. The billing extract will be performed on (at least) a monthly basis. 1
3b. Data file(s) will be formatted as per requirements specified by the billing 1
system. These requirements will be defined at a later time.
4a. The system will handle extracting from and loading to multiple versions of the 2
log files.
5a. The architecture will support the generation of required data files for reporting 2
needs.

1.3 FUNCTIONAL REQUIREMENTS DETAIL


1.3.1 FUNCTIONAL REQUIREMENT 1A

Functional Requirement The central repository will be a flat file database.

Constraints

2 of 3 Informatica Velocity – Sample Deliverable


FUNCTIONAL REQUIREMENTS SPECIFICATION

Inputs Source data from the shopping logs and application database.

Outputs The central repository flat file database.


Ability to extract data from the shopping engine processing clusters.
Dependencies Ability to extract data from the ATS database.
Ability to create the required flat file database
SUN E4500
Hardware / Software 4 CPUs
Requirements 4 GB RAM
70 GB HDD (RAID Level 5)

1.3.2 FUNCTIONAL REQUIREMENT 1B

Functional Requirement

Constraints

Inputs

Outputs

Dependencies

Hardware / Software
Requirements

Informatica Velocity – Sample Deliverable 3 of 3


VELOCITY
SAMPLE DELIVERABLE

Information Requirements
Specification
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
INFORMATION REQUIREMENTS SPECIFICATION

INFORMATION REQUIREMENTS DEFINITIONS

Net Revenue Revenue after subtracting operating costs


Dimensional Views: Below are the views in which Net Revenue can be analyzed
Time
- Year
- Quarter Geography Customer
-Month - Region
-Fiscal Year - District
-Fiscal Qtr - City

SUBJECT AREA DATA SETS


Subject Area Dimension Set Metrics
Sales Analysis
Time
Geography
Customer
Net Revenue
Gross Margin

2 of 2 Informatica Velocity – Sample Deliverable


ISSUES TRACKING

Issue Resolution
# Short Description Assign To Status Priority Severity Date ID'd Date ID'd By Description Work Around Investigation Solution

10

11

12

13

14

15

1 of 1 Issues Informatica Velocity - Sample Deliverable


MAPPING INVENTORY

Map ID Mapping Name Target Table(s) Source(s) Volume Complexity Assigned Estimate Actual Issues
1 m_DM_Customer_Dimension DataMart.CUSTOMER Staging.RS_CUSTOMER, Med Low <Developer> 5 days 4 days Need to determine how to merge with second source system: i.e. based
Staging.RS_CUSTOMER_ADDR on what fields can the two files be joined?
ESS
2

10

11

12

13

14

15

16

17

18

19

20

1 of 1 Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Mapping Specifications
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
MAPPING SPECIFICATIONS

Mapping Name:

Source System(s):

Target System(s):

Initial Rows: Rows/Load:

Short Description:

Load Frequency:

Preprocessing:

Post Processing:

Error Strategy:

Reload Strategy:

Unique Source
Fields (PK):

Dependant
Objects

SOURCES

Tables
Table Name System/Schema/Owner Selection/Filter

Files
File Name File Location Fixed/Delimited Additional File
Info

TARGETS

Tables Schema Owner


Table Name Update Delete Insert Unique Key

2 of 3 Informatica Velocity – Sample Deliverable


MAPPING SPECIFICATIONS

Files
File Name File Location Fixed/Delimited Additional File Info

LOOKUPS

Lookup Name
Table Location
Match Condition(s)

Persistent / Dynamic

Filter/SQL Override

HIGH LEVEL PROCESS OVERVIEW


<Insert high level diagram of process flow to show sequence of events>

Source Target

PROCESSING DESCRIPTION (DETAIL)


<Describe processing logic contained in mapping >

SOURCE TO TARGET FIELD MATRIX

Target Source Default Data


System/ Target Data System/ Source Data Value Issues/
Table Column type Table Column type Expression if Null Quality

Informatica Velocity – Sample Deliverable 3 of 3


METADATA INVENTORY

METADATA INVENTORY - SUMMARY


PROJECT NAME: PROJECT STAGE:
Source Source Investigation
No. Source Name Type/Format Priority* Content Cross-references By Date Required reporti
1 BO Finance UniverseBO/Oracle 9 H Finance Reporting Universe BO Universe Descriptions v3.1 DJP 15-Dec-04 Lineage
2 INFA_FIN_REP INFA/Oracle 9 H Finance ETL Inputs, ouputs and transformaETL Design, Finance v2.5 EWK 21-Dec-04 Lineage, source-target audi
3 General Ledger Designer/Ora H Company General Ledger ? Lineage, source audit
4 Foreign Credits Universe DB M Foreign currency credits (summary posted ? Lineage, source audit
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
* Priority: H (High), M (Medium), L (Low)
** X-connect availability: One of the standard X-connects, or custom-built

1 of 1 Metadata Summary Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Migration Request Checklist


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
MIGRATION REQUEST CHECKLIST

REQUESTOR INFORMATION:
Name: Phone Number:

Completion Date:

PRIORITY: (Check appropriate box)


Emergency: High: Normal:

DESCRIPTION:

INFORMATICA OBJECTS:
From Repository To Repository From Folder To Folder

DATABASE OBJECTS:
From Database To Database From Schema To Schema

CODE REVIEW:
Reviewer Approval Date Code Review Comments

FUNCTIONAL TEST:
Functional User Approval Date

GOVERNANCE CHECKLIST:
Are workflows set with ‘if previous task list completed successfully? Yes No N/A

Are flat file details set up correctly? Yes No N/A

Are ftp file details set up correctly? Yes No N/A

Are database target details set up correctly? Yes No N/A

Are target file locations set correctly? Yes No N/A

Are source file locations set correctly? Yes No N/A

Is the parameter file location set correctly? Yes No N/A

2 of 4 Informatica Velocity – Sample Deliverable


MIGRATION REQUEST CHECKLIST

MIGRATION LIST:

INFORMATICA OBJECTS:

Sources

SOURCE NAME NEW MODIFIED SHARE

Targets

TARGET NAME NEW MODIFIED SHARE

Re-Usable Transformations & Mapplets

TRANSFORMATION NAME NEW MODIFIED SHARE

Mappings

MAPPING NAME NEW MODIFIED SHARE

Workflow

WORKFLOW NAME NEW MODIFIED SHARE

Are there any sequences that need to be reset? Yes No N/A

DATABASE OBJECTS: (i.e. Tables, Stored Procedures, Triggers, Views, Sequences, Functions…)

OBJECT NAME OBJECT TYPE SCRIPT NAME FOR CREATION

Flat File Source Information (If source is Flat File, Provide this information….)

Informatica Velocity – Sample Deliverable 3 of 4


MIGRATION REQUEST CHECKLIST

SOURCE FILE NAME DIRECTORY PATH FILE TYPE (Fixed, Delimited)

Flat File Target Information (If target is Flat File, Provide this information….)

TARGET FILE NAME DIRECTORY PATH FILE TYPE (Fixed, Delimited)

Pre-Session Information

SCRIPT NAME DIRECTORY PATH COMMENTS

Post-Session Information

SCRIPT NAME DIRECTORY PATH COMMENTS

MIGRATION SPECIAL INSTRUCTIONS:

4 of 4 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Operations Manual
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
OPERATIONS MANUAL

TABLE OF CONTENTS

INTRODUCTION .................................................................................................................. 3
INFRASTRUCTURE .............................................................................................................. 3
POWERCENTER INFRASTRUCTURE ....................................................................................... 3
STEPS TO STOP/RESTART INFORMATICA COMPONENTS....................................................... 3
HIGH AVAILABILITY CONFIGURATION .................................................................................. 3
POWERCENTER – WORKFLOW MANAGER ........................................................................... 3
DATA EXTRACTION.............................................................................................................. 3
DATA TRANSFORMATION ..................................................................................................... 4
POWERCENTER TRANSFORMATIONS ................................................................................. 4
PERFORM DATA TRANSFORMATIONS ................................................................................. 4
LOAD THE TARGET TABLE ............................................................................................. 4
REPROCESS CORRECTED DATA .................................................................................... 4
SQL SCRIPTS .................................................................................................................. 4
STORED PROCEDURES .................................................................................................... 4
DATA LOAD ........................................................................................................................ 5
SUBJECT AREA LOAD ORDER ............................................................................................ 5
MAPPING LOAD ORDER AND RECOVERY ............................................................................. 5
WORKFLOW/SESSIONS A .............................................................................................. 5
RESTART STEPS:......................................................................................................... 6
RECOVERY/ROLLBACK PROCEDURES: ........................................................................... 6
ERROR HANDLING .............................................................................................................. 6
ERROR REPROCESSING STEPS:........................................................................................ 6
METADATA ......................................................................................................................... 6
[OTHER SOFTWARE] DESCRIPTION ....................................................................................... 7
[OTHER SOFTWARE] PROCEDURES ....................................................................................... 7
[OTHER SOFTWARE] OPERATIONS ........................................................................................ 7
SUPPORT/MAINTENANCE ..................................................................................................... 7
SERVICE LEVEL AGREEMENT ............................................................................................ 7
CONTACT INTERNAL SUPPORT PERSONNEL ....................................................................... 7
CONTACT EXTERNAL VENDOR CUSTOMER SUPPORT ........................................................... 7
INFORMATICA GLOBAL CUSTOMER SUPPORT .................................................................. 7
COMMUNICATIONS WITH OTHER TEAMS ............................................................................. 8
MAINTENANCE/OUTAGE SCHEDULE ................................................................................... 8
APPENDIX A - REFERENCES ................................................................................................. 8

TABLE OF FIGURES
[List of any figures or diagrams used in this document]

2 of 8 Informatica Velocity – Sample Deliverable


OPERATIONS MANUAL

INTRODUCTION
[Provide a brief introduction for the Operation Manual. Specify the purpose of the document.]

INFRASTRUCTURE
[Describe the project’s infrastructure. If possible, provide a diagram of the system.]

POWERCENTER INFRASTRUCTURE
[Describe the PowerCenter infrastructure. Include setup and location of Informatica servers.]

STEPS TO STOP/RESTART INFORMATICA COMPONENTS

[Outline steps to perform graceful shutdown and restart of the PowerCenter domain, PowerCenter node,
PowerCenter service, Repository service, Data Analyzer web service providers and PowerCenter Repository. Include
the locations and names of any startup and shutdown scripts.

HIGH AVAILABILITY CONFIGURATION

[Outline the high availability configuration for the domain gateway, the PowerCenter services and other service
components.]

POWERCENTER – WORKFLOW MANAGER

[Discuss the set-up within Workflow Manger. Include the setup of the database connections and an explanation of the
workflow manager variables.

Outline steps to stop/restart workflows or sessions within workflow manager.]

DATA EXTRACTION
[Describe the source data that are being processed in this system.]

The data feeds received are:

Subject Area Description Feed Names Source(s) Load Frequency

Informatica Velocity – Sample Deliverable 3 of 8


OPERATIONS MANUAL

Subject Area Description Feed Names Source(s) Load Frequency

DATA TRANSFORMATION
[Provide an overview of how data is loaded into the data warehouse/mart tables.]

POWERCENTER TRANSFORMATIONS

[Describe the overall data transformation process. Indicate how data flows from the source to the target(s). If
necessary, discuss the error handling process. If possible, include a diagram that illustrates the data flows and
describe each stage in the process.]

PERFORM DATA TRANSFORMATIONS

[Describe how the data is processed in the mappings.]

LOAD THE TARGET TABLE

[Discuss how and when data is loaded into the target database. If there are intermediate steps, describe these as
well. Describe the distinctions between the initial target table load and incremental load, if any.]

REPROCESS CORRECTED DATA

[If applicable, discuss how error records are reprocessed.]

SQL SCRIPTS

[Describe any SQL scripts that are used in the process. Include where scripts are located and when in the load
process they are executed.]

STORED PROCEDURES

[Describe any stored procedures that must be used throughout the process. Include where the procedures are
located and any special permissions needed to execute them.]

Schema/Stored Procedure Description

4 of 8 Informatica Velocity – Sample Deliverable


OPERATIONS MANUAL

DATA LOAD
[Describe the data load order.]

SUBJECT AREA LOAD ORDER

[List the source data load order.]

Level Subject Area Load Frequency Dependencies

MAPPING LOAD ORDER AND RECOVERY

[Discuss the individual mappings and their load orders. If applicable, discuss the steps necessary to recover from a
failure. List each workflow and/or session involved in this system.]

WORKFLOW/SESSIONS A

The workflow hierarchy is as follows:

wf_A :
s_Session1
s_Session2
cmd_Task1

Target tables loaded during this workflow are:


Workflow Session Mapping Stored Tables Load Average
Name Name Name Procedures Loaded Frequency Run Time

Informatica Velocity – Sample Deliverable 5 of 8


OPERATIONS MANUAL

RESTART STEPS:

[Outline steps to restart/terminate load operations.]

RECOVERY/ROLLBACK PROCEDURES:

[Refer to workflow monitor, session logs and database audit trail to determine failed sessions and point of failure.
Outline steps to rollback/recover load operations.

List actions to be taken if jobs significantly overrun]

Running Data Analyzer Reports

Identify and describe when to run Data Analyzer scheduled reports manually, for example where down time has
resulted in the scheduled time passing.

ERROR HANDLING
[Provide a detailed description of the error handling strategies used in this system. Be sure to include any error
tables or files that are created in the process. Also mention any scripting that is executed or emails that are sent to
notify either operations or development staff of session failures.]

ERROR REPROCESSING STEPS:

[If applicable, describe how error reprocessing is to take place. Be sure to include any procedures that the operations
staff must perform as well as any automate procedures.

Outline escalation procedures for errors that consistently reoccur.]

METADATA
[If applicable, describe the metadata strategy in use in this system. If possible, list the metadata elements what will
be most important for the operations and development staff to track for error handling, reprocessing, and general
volume estimating purposes.]

Entity Name Attribute Name Attribute Description Source

6 of 8 Informatica Velocity – Sample Deliverable


OPERATIONS MANUAL

[OTHER SOFTWARE] DESCRIPTION


[Describe any other software in the system. Duplicate this section as necessary to describe the other software that
will be maintained by the operations group.]

[OTHER SOFTWARE] PROCEDURES


[Describe the installation and maintenance of any other software in the system. Duplicate this section as necessary
to describe the other software that will be maintained by the operations group.]

[OTHER SOFTWARE] OPERATIONS


[Describe the day-to-day functionality of the software. Include steps necessary to perform critical operations. Be sure
to include any restart steps, error handling, etc.]

SUPPORT/MAINTENANCE

SERVICE LEVEL AGREEMENT

[Outline agreed service level information, including a formal Service Level Agreement.]

CONTACT INTERNAL SUPPORT PERSONNEL

[List organizational contacts in the event of a component failure that cannot be resolved by the Production Support
team. Include type of contact, contact name, department, telephone number, and e-mail address (if applicable).]

CONTACT EXTERNAL VENDOR CUSTOMER SUPPORT

INFORMATICA GLOBAL CUSTOMER SUPPORT

[Include details of named Informatica support contacts and support contract (and project id) with Informatica in case
they need to be contacted.]

Informatica Velocity – Sample Deliverable 7 of 8


OPERATIONS MANUAL

COMMUNICATIONS WITH OTHER TEAMS

[Provide a list of teams (i.e. SysAdmin, DBAdmin) that require coordination between the operations and its specific
support function (e.g., system, database maintenance, security, etc.). Describe regular communications and include
a schedule for coordination activities.]

MAINTENANCE/OUTAGE SCHEDULE

[Describe the process for maintenance/planned outage scheduling and list their potential impact on system reliability.
List system maintenance/outage schedule.]

APPENDIX A - REFERENCES
[List any documents used as reference in creating this document.]

8 of 8 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Physical Data Model


Review Agenda
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PHYSICAL DATA MODEL REVIEW AGENDA

REVIEW LOGICAL DATA MODEL


Provide a hard copy of the logical data model used to create the physical model. Briefly review the relationships
between tables.

REVIEW PHYSICAL MODEL


Provide a hard copy of the physical data model.

ƒ Review Source-to-Target Relationships – discuss the preliminary analysis performed regarding how the
physical model is different from the logical.

ƒ Perform Physical Model Walk-through - Using a selection of 20 user-query scenarios, step through the
tables and attributes required to answer each query.

ƒ Review Source-to-Target Relationships – Discuss the preliminary analysis performed regarding how data
will be integrated from source systems to target systems and what data transformations are expected to be
required.

2 of 2 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

PROJECT DEFINITION
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PROJECT DEFINITION

PROJECT OBJECTIVES
These are the key business “drivers” for the project—business-focused goals and objectives.

ƒ <Objective 1>
ƒ <Objective 2>
ƒ <Objective 3> …

PROJECT TIMING
Key Milestone or Deliverable Target Date(s)

TECHNICAL ENVIRONMENT
Informatica Products, versions
Platforms / OS
Source systems / data characteristics
Target systems / DBMS
Target architectures

PROJECT BACKGROUND AND STATUS


What has been accomplished on the project?
What documents have been completed? (Requirements, design, models, etc.)
What source analyses, what target models have been completed?

Document / Model Description Name / Location

EXPECTATIONS FOR CONSULTANT INVOLVEMENT


Role Primary Activities

2 of 3 Informatica Velocity – Sample Deliverable


PROJECT DEFINITION

PROJECT PERSONNEL
Personnel that you are likely to interact with (e.g., DBA, Business Analyst, System Administrator, etc.)?
Name Phone(s) E-mail Role

PRIMARY TASKS, ACTIVITIES AND DELIVERABLES


Effort
Task/Deliverable Description (days) Target Date(s)

Informatica Velocity – Sample Deliverable 3 of 3


Velocity Project Plan

ID Task Name Duration Start Finish Dec 31, '06


S S M T
1 1 Manage 1 day? Mon 1/1/07 Mon 1/1/07
2 1.1 Define Project 1 day? Mon 1/1/07 Mon 1/1/07
3 1.1.1 Establish Business Project Scope 1 day? Mon 1/1/07 Mon 1/1/07
4 1.1.2 Build Business Case 1 day? Mon 1/1/07 Mon 1/1/07
5 1.1.3 Assess Centralized Resources 1 day? Mon 1/1/07 Mon 1/1/07
6 1.2 Plan and Manage Project 1 day? Mon 1/1/07 Mon 1/1/07
7 1.2.1 Establish Project Roles 1 day? Mon 1/1/07 Mon 1/1/07
8 1.2.2 Develop Project Estimate 1 day? Mon 1/1/07 Mon 1/1/07
9 1.2.3 Develop Project Plan 1 day? Mon 1/1/07 Mon 1/1/07
10 1.2.4 Manage Project 1 day? Mon 1/1/07 Mon 1/1/07
11 1.3 Perform Project Close 1 day? Mon 1/1/07 Mon 1/1/07
12 2 Analyze 1 day? Mon 1/1/07 Mon 1/1/07
13 2.1 Define Business Drivers, Objectives and Goals 1 day? Mon 1/1/07 Mon 1/1/07
14 2.2 Define Business Requirements 1 day? Mon 1/1/07 Mon 1/1/07
15 2.2.1 Define Business Rules and Definitions 1 day Mon 1/1/07 Mon 1/1/07
16 2.2.2 Establish Data Stewardship 1 day? Mon 1/1/07 Mon 1/1/07
17 2.3 Define Business Scope 1 day? Mon 1/1/07 Mon 1/1/07
18 2.3.1 Identify Source Data Systems 1 day? Mon 1/1/07 Mon 1/1/07
19 2.3.2 Determine Sourcing Feasibility 1 day? Mon 1/1/07 Mon 1/1/07
20 2.3.3 Determine Target Requirements 1 day? Mon 1/1/07 Mon 1/1/07
21 2.3.4 Determine Business Process Data Flows 1 day? Mon 1/1/07 Mon 1/1/07
22 2.3.5 Build Roadmap for Incremental Delivery 1 day? Mon 1/1/07 Mon 1/1/07
23 2.4 Define Functional Requirements 1 day? Mon 1/1/07 Mon 1/1/07
24 2.5 Define Metadata Requirements 1 day? Mon 1/1/07 Mon 1/1/07
25 2.5.1 Establish Inventory of Technical Metadata 1 day? Mon 1/1/07 Mon 1/1/07
26 2.5.2 Review Metadata Sourcing Requirements 1 day? Mon 1/1/07 Mon 1/1/07
27 2.5.3 Assess Technical Strategies and Policies 1 day? Mon 1/1/07 Mon 1/1/07
28 2.6 Determine Technical Readiness 1 day? Mon 1/1/07 Mon 1/1/07
29 2.7 Determine Regulatory Requirements 1 day? Mon 1/1/07 Mon 1/1/07

Task Milestone External Tasks


Project: Velocity_Project_Plan.mpp
Split Summary External Milestone
Date: Mon 7/28/08
Progress Project Summary Deadline

1 of 5 Informatica Velocity - Sample Deliverable


Velocity Project Plan

ID Task Name Duration Start Finish Dec 31, '06


S S M T
30 2.8 Perform Data Quality Audit 1 day? Mon 1/1/07 Mon 1/1/07
31 2.8.1 Perform Data Analysis of Source Data 1 day? Mon 1/1/07 Mon 1/1/07
32 2.8.2. Report Analysis Results to the Business 1 day? Mon 1/1/07 Mon 1/1/07
33 3 Architect 1 day? Mon 1/1/07 Mon 1/1/07
34 3.1 Develop Solution Architecture 1 day? Mon 1/1/07 Mon 1/1/07
35 3.1.1 Define Technical Requirements 1 day? Mon 1/1/07 Mon 1/1/07
36 3.1.2 Develop Architecture Logical View 1 day? Mon 1/1/07 Mon 1/1/07
37 3.1.3 Develop Configuration Recommendations 1 day? Mon 1/1/07 Mon 1/1/07
38 3.1.4 Develop Architecture Physical View 1 day? Mon 1/1/07 Mon 1/1/07
39 3.1.5 Estimate Volume Requirements 1 day? Mon 1/1/07 Mon 1/1/07
40 3.2 Design Development Architecture 1 day? Mon 1/1/07 Mon 1/1/07
41 3.2.1 Develop Quality Assurance Strategy 1 day? Mon 1/1/07 Mon 1/1/07
42 3.2.2 Define Development Environments 1 day? Mon 1/1/07 Mon 1/1/07
43 3.2.3 Develop Change Control Procedures 1 day? Mon 1/1/07 Mon 1/1/07
44 3.2.4 Determine Metadata Strategy 1 day? Mon 1/1/07 Mon 1/1/07
45 3.2.5 Develop Change Management Process 1 day? Mon 1/1/07 Mon 1/1/07
46 3.3 Implement Technical Architecture 1 day? Mon 1/1/07 Mon 1/1/07
47 3.3.1 Procure Hardware and Software 1 day? Mon 1/1/07 Mon 1/1/07
48 3.3.2 Install/Configure Software 1 day? Mon 1/1/07 Mon 1/1/07
49 4 Design 1 day? Mon 1/1/07 Mon 1/1/07
50 4.1 Develop Data Model(s) 1 day? Mon 1/1/07 Mon 1/1/07
51 4.1.1 Develop Enterprise Data Warehouse Model 1 day? Mon 1/1/07 Mon 1/1/07
52 4.1.2 Develop Data Mart Model(s) 1 day? Mon 1/1/07 Mon 1/1/07
53 4.2 Analyze Data Sources 1 day? Mon 1/1/07 Mon 1/1/07
54 4.2.1 Develop Source to Target Relationships 1 day? Mon 1/1/07 Mon 1/1/07
55 4.2.2 Determine Source Availability 1 day? Mon 1/1/07 Mon 1/1/07
56 4.3 Design Physical Database 1 day? Mon 1/1/07 Mon 1/1/07
57 4.3.1 Develop Physical Database Design 1 day? Mon 1/1/07 Mon 1/1/07
58 4.4 Design Presentation Layer 1 day? Mon 1/1/07 Mon 1/1/07

Task Milestone External Tasks


Project: Velocity_Project_Plan.mpp
Split Summary External Milestone
Date: Mon 7/28/08
Progress Project Summary Deadline

2 of 5 Informatica Velocity - Sample Deliverable


Velocity Project Plan

ID Task Name Duration Start Finish Dec 31, '06


S S M T
59 4.4.1 Design Presentation Layer Prototype 1 day? Mon 1/1/07 Mon 1/1/07
60 4.4.2 Present Prototype to Business Analysts 1 day? Mon 1/1/07 Mon 1/1/07
61 4.4.3 Develop Presentation Layout Design 1 day? Mon 1/1/07 Mon 1/1/07
62 5 Build 1 day? Mon 1/1/07 Mon 1/1/07
63 5.1 Launch Build Phase 1 day? Mon 1/1/07 Mon 1/1/07
64 5.1.1 Review Project Scope and Plan 1 day? Mon 1/1/07 Mon 1/1/07
65 5.1.2 Review Physical Model 1 day? Mon 1/1/07 Mon 1/1/07
66 5.1.3 Define Defect Tracking Process 1 day? Mon 1/1/07 Mon 1/1/07
67 5.2 Implement Physical Database 1 day? Mon 1/1/07 Mon 1/1/07
68 5.3 Design and Build Data Quality Process 1 day? Mon 1/1/07 Mon 1/1/07
69 5.3.1 Design Data Quality Technical Rules 1 day? Mon 1/1/07 Mon 1/1/07
70 5.3.2 Determine Dictionary and Reference Data Requirements 1 day? Mon 1/1/07 Mon 1/1/07
71 5.3.3 Design and Execute Data Enhancement Processes 1 day? Mon 1/1/07 Mon 1/1/07
72 5.3.4 Design Run-time and Real-Time Processes for Operate Phase Execution 1 day? Mon 1/1/07 Mon 1/1/07
73 5.3.5 Develop Inventory of Data Quality Processes 1 day? Mon 1/1/07 Mon 1/1/07
74 5.3.6 Review and Package Data Transformation Specification Processes and Documents 1 day? Mon 1/1/07 Mon 1/1/07
75 5.4 Design and Develop Data Integration Processes 1 day? Mon 1/1/07 Mon 1/1/07
76 5.4.1 Design High Level Load Process 1 day? Mon 1/1/07 Mon 1/1/07
77 5.4.2 Develop Error Handling Strategy 1 day? Mon 1/1/07 Mon 1/1/07
78 5.4.3 Plan Restartability Process 1 day? Mon 1/1/07 Mon 1/1/07
79 5.4.4 Develop Inventory of Mappings & Reusable Objects 1 day? Mon 1/1/07 Mon 1/1/07
80 5.4.5 Design Individual Mappings & Reusable Objects 1 day? Mon 1/1/07 Mon 1/1/07
81 5.4.6 Build Mappings & Reusable Objects 1 day? Mon 1/1/07 Mon 1/1/07
82 5.4.7 Perform Unit Test 1 day? Mon 1/1/07 Mon 1/1/07
83 5.4.8 Conduct Peer Reviews 1 day? Mon 1/1/07 Mon 1/1/07
84 5.5 Populate and Validate Database 1 day? Mon 1/1/07 Mon 1/1/07
85 5.5.1 Build Load Process 1 day? Mon 1/1/07 Mon 1/1/07
86 5.5.2 Perform Integrated ETL Testing 1 day? Mon 1/1/07 Mon 1/1/07
87 5.6 Build Presentation Layer 1 day? Mon 1/1/07 Mon 1/1/07

Task Milestone External Tasks


Project: Velocity_Project_Plan.mpp
Split Summary External Milestone
Date: Mon 7/28/08
Progress Project Summary Deadline

3 of 5 Informatica Velocity - Sample Deliverable


Velocity Project Plan

ID Task Name Duration Start Finish Dec 31, '06


S S M T
88 5.6.1 Develop Presentation Layer 1 day? Mon 1/1/07 Mon 1/1/07
89 5.6.2 Demonstrate Presentation Layer to Business Analysts 1 day? Mon 1/1/07 Mon 1/1/07
90 6 Test 1 day? Mon 1/1/07 Mon 1/1/07
91 6.1 Define Overall Test Strategy 1 day? Mon 1/1/07 Mon 1/1/07
92 6.1.1 Define Test Data Strategy 1 day? Mon 1/1/07 Mon 1/1/07
93 6.1.2 Define Unit Test Plan 1 day? Mon 1/1/07 Mon 1/1/07
94 6.1.3 Define System Test Plan 1 day? Mon 1/1/07 Mon 1/1/07
95 6.1.4 Define User Acceptance Test Plan 1 day? Mon 1/1/07 Mon 1/1/07
96 6.1.5 Define Test Scenarios 1 day? Mon 1/1/07 Mon 1/1/07
97 6.1.6 Build/Maintain Test Source Data Set 1 day? Mon 1/1/07 Mon 1/1/07
98 6.2 Prepare for Testing Process 1 day? Mon 1/1/07 Mon 1/1/07
99 6.2.1 Prepare Environments 1 day? Mon 1/1/07 Mon 1/1/07
100 6.2.2 Prepare Defect Management Processes 1 day? Mon 1/1/07 Mon 1/1/07
101 6.3 Execute System Test 1 day? Mon 1/1/07 Mon 1/1/07
102 6.3.1 Prepare for System Test 1 day? Mon 1/1/07 Mon 1/1/07
103 6.3.2 Execute Complete System Test 1 day? Mon 1/1/07 Mon 1/1/07
104 6.3.3 Perform Data Validation 1 day? Mon 1/1/07 Mon 1/1/07
105 6.3.4 Conduct Disaster Recovery Testing 1 day? Mon 1/1/07 Mon 1/1/07
106 6.3.5 Conduct Volume Testing 1 day? Mon 1/1/07 Mon 1/1/07
107 6.4 Conduct User Acceptance Testing 1 day? Mon 1/1/07 Mon 1/1/07
108 6.5 Tune System Performance 1 day? Mon 1/1/07 Mon 1/1/07
109 6.5.1 Benchmark 1 day? Mon 1/1/07 Mon 1/1/07
110 6.5.2 Identify Areas for Improvement 1 day? Mon 1/1/07 Mon 1/1/07
111 6.5.3 Tune Data Integration Performance 1 day? Mon 1/1/07 Mon 1/1/07
112 6.5.4 Tune Reporting Performance 1 day? Mon 1/1/07 Mon 1/1/07
113 7 Deploy 1 day? Mon 1/1/07 Mon 1/1/07
114 7.1 Plan Deployment 1 day? Mon 1/1/07 Mon 1/1/07
115 7.1.1 Plan User Training 1 day? Mon 1/1/07 Mon 1/1/07
116 7.1.2 Plan Metadata Documentation and Rollout 1 day? Mon 1/1/07 Mon 1/1/07

Task Milestone External Tasks


Project: Velocity_Project_Plan.mpp
Split Summary External Milestone
Date: Mon 7/28/08
Progress Project Summary Deadline

4 of 5 Informatica Velocity - Sample Deliverable


Velocity Project Plan

ID Task Name Duration Start Finish Dec 31, '06


S S M T
117 7.1.3 Plan User Documentation Rollout 1 day? Mon 1/1/07 Mon 1/1/07
118 7.1.4 Develop Punch List 1 day? Mon 1/1/07 Mon 1/1/07
119 7.1.5 Develop Communication Plan 1 day? Mon 1/1/07 Mon 1/1/07
120 7.1.6 Develop Run Book 1 day? Mon 1/1/07 Mon 1/1/07
121 7.2 Deploy Solution 1 day? Mon 1/1/07 Mon 1/1/07
122 7.2.1 Train Users 1 day? Mon 1/1/07 Mon 1/1/07
123 7.2.2 Migrate Development to Production 1 day? Mon 1/1/07 Mon 1/1/07
124 7.2.3 Package Documentation 1 day? Mon 1/1/07 Mon 1/1/07
125 8 Operate 1 day? Mon 1/1/07 Mon 1/1/07
126 8.1 Define Production Support Procedures 1 day? Mon 1/1/07 Mon 1/1/07
127 8.1.1 Develop Operations Manual 1 day? Mon 1/1/07 Mon 1/1/07
128 8.2 Operate Solution 1 day? Mon 1/1/07 Mon 1/1/07
129 8.2.1 Execute First Production Run 1 day? Mon 1/1/07 Mon 1/1/07
130 8.2.2 Monitor Load Volume 1 day? Mon 1/1/07 Mon 1/1/07
131 8.2.3 Monitor Load Processes 1 day? Mon 1/1/07 Mon 1/1/07
132 8.2.4 Track Change Control Requests 1 day? Mon 1/1/07 Mon 1/1/07
133 8.2.5 Monitor Usage 1 day? Mon 1/1/07 Mon 1/1/07
134 8.2.6 Monitor Data Quality 1 day? Mon 1/1/07 Mon 1/1/07
135 8.3 Maintain and Upgrade Environment 1 day? Mon 1/1/07 Mon 1/1/07
136 8.3.1 Maintain Repository 1 day? Mon 1/1/07 Mon 1/1/07
137 8.3.2 Upgrade Software 1 day? Mon 1/1/07 Mon 1/1/07

Task Milestone External Tasks


Project: Velocity_Project_Plan.mpp
Split Summary External Milestone
Date: Mon 7/28/08
Progress Project Summary Deadline

5 of 5 Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Project Roadmap
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PROJECT ROADMAP

TIMELINE

Q1 2001 Q3 2001 Q4 20001


Full Bookings, Billings
Sales Analysis v.1 Sales Analysis v.2 and Backlog Data Mart
All priority 1 Functional Requirements All priority 2 and 3 Functional Requirements

INCREMENT DESCRIPTIONS
The initial project roadmap above is based on the priorities set forth in the Functional Requirements Specification
document. Based on those priorities and estimated feasibility and time-to-deliver, the above timeline is an
approximate high-level plan for this project.

INCREMENT 1: SALES ANALYSIS V.1

<Describe increment functionality>

INCREMENT 2: SALES ANALYSIS V.2

<Describe increment functionality>

INCREMENT 3: FULL BOOKINGS, BILLINGS AND BACKLOG DATA MART

<Describe increment functionality>

2 of 2 Informatica Velocity – Sample Deliverable


Project Role Matrix

D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r

1 Manage P P S S S S A P A P P
1.1 Define Project P P P
1.1.1 Establish Business Project Scope P R P
1.1.2 Build Business Case P S
1.1.3 Assess Centralized Resources P
1.2 Plan and Manage Project P S S S A P P
1.2.1 Establish Project Roles P A P
1.2.2 Develop Project Estimate P S S S S A S S
1.2.3 Develop Project Plan P A S
1.2.4 Manage Project P R P
1.3 Perform Project Close P A A A A
2 Analyze P P P P P P P P P P S P P P P
2.1 Define Business Drivers, Objectives and Goals P R R
2.2 Define Business Requirements S P S P A P A A
2.2.1 Define Business Rules and Definitions S P A P A
2.2.2 Establish Data Stewardship S P S A
2.3 Define Business Scope P P P P P P S P S P P
2.3.1 Identify Source Data Systems P P P P P
2.3.2 Determine Sourcing Feasibility P P P P P
2.3.3 Determine Target Requirements S P P S S P P
2.3.4 Determine Business Process Data Flows P P P P P
2.3.5 Build Roadmap for Incremental Delivery S P P P S S P P
2.4 Define Functional Requirements P R
2.5 Define Metadata Requirements R P P P P P P P
2.5.1 Establish Inventory of Technical Metadata R P P P P
2.5.2 Review Metadata Sourcing Requirements P R P R
2.5.3 Assess Technical Strategies and Policies R P P P P
2.6 Determine Technical Readiness P P P P
2.7 Determine Regulatory Requirements P R P P
2.8 Perform Data Quality Audit S S P P S
2.8.1 Perform Data Quality Analysis of Source Data S P S
2.8.2 Report Analysis Results to the Business S S P P S
3 Architect P P P S P R P P S A P P S P P P
3.1 Develop Solution Architecture P P P R P P P R
3.1.1 Define Technical Requirements P S P R
3.1.2 Develop Architecture Logical View S P
3.1.3 Develop Configuration Recommendations S P

1 of 4 Informatica Velocity - Sample Deliverable


Project Role Matrix

D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r

3.1.4 Develop Architecture Physical View A P P R


3.1.5 Estimate Volume Requirements P P P P S
3.2 Design Development Architecture P S S P P S R P P S P P P
3.2.1 Develop Quality Assurance Strategy P S P
3.2.2 Define Development Environments P P P P R
3.2.3 Develop Change Control Procedures S S S A S S P
3.2.4 Determine Metadata Strategy P P A
3.2.5 Develop Change Management Process P R P
3.3 Implement Technical Architecture S A P P P P
3.3.1 Procure Hardware and Software S A S S P P
3.3.2 Install/Configure Software P P P R R
4 Design P P P P P P P P R
4.1 Develop Data Model(s) S P P R
4.1.1 Develop Enterprise Data Warehouse Model S P R
4.1.2 Develop Data Mart Model(s) S P R
4.2 Analyze Data Sources P P P P S P R
4.2.1 Develop Source to Target Relationships S P P P R
4.2.2 Determine Source Availability P S S P R
4.3 Design Physical Database S P P R R
4.3.1 Develop Physical Database Design S P P R R
4.4 Design Presentation Layer P P
4.4.1 Design Presentation Layer Prototype P P
4.4.2 Present Prototype to Business Analysts P P
4.4.3 Develop Presentation Layout Design S P
5 Build P S P P P S S P P A P S S P P
5.1 Launch Build Phase S P S S S P P R R P P
5.1.1 Review Project Scope and Plan R R R R R R R P
5.1.2 Review Physical Model S P S S S P R R R
5.1.3 Define Defect Tracking Process R R R R P R R P P
5.2 Implement Physical Database S R P S S
5.3 Design and Build Data Quality Process P S S P S A
5.3.1 Design Data Quality Technical Rules P P S
5.3.2 Determine Dictionary and Reference Data Requirements S S P S
5.3.3 Design and Execute Data Enhancement Processes P A
5.3.4 Design Run-time and Real-time Processes for Operate Pha R P R
5.3.5 Develop Inventory of Data Quality Processes P

2 of 4 Informatica Velocity - Sample Deliverable


Project Role Matrix

D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r

5.3.6 Review and Package Data Transformation S P R


Specification Processes and Documents
5.4 Design and Develop Data Integration Processes S P S P P R
5.4.1 Design High Level Load Process R P S P A R
5.4.2 Develop Error Handling Strategy R P S A R
5.4.3 Plan Restartability Process P S A R
5.4.4 Develop Inventory of Mappings & Reusable Objects P
5.4.5 Design Individual Mappings & Reusable Objects S P
5.4.6 Build Mappings & Reusable Objects P S
5.4.7 Perform Unit Test R P
5.4.8 Conduct Peer Reviews S P
5.5 Populate and Validate Database S P R A
5.5.1 Build Load Process R P R
5.5.2 Perform Integrated ETL Testing S P R A
5.6 Build Presentation Layer P P A R
5.6.1 Develop Presentation Layer P
5.6.2 Demonstrate Presentation Layer to Business Analysts P P A R
6 Test P P P P P P P R P P P P S P P
6.1 Define Overall Test Strategy P P P P A A
6.1.1 Define Test Data Strategy P S P P A A P
6.1.2 Define Unit Test Plan S P P R
6.1.3 Define System Test Plan R P
6.1.4 Define User Acceptance Test Plan S P A P
6.1.5 Define Test Scenarios S P A P
6.1.6 Build/Maintain Test Source Data Set P P
6.2 Prepare for Testing Process S P S P P P P
6.2.1 Prepare Environments S P S P P P
6.2.2 Prepare Defect Management Processes P P
6.3 Execute System Test P P P P S S R R S P R P
6.3.1 Prepare for System Test S S S P
6.3.2 Execute Complete System Test S S R R S R R R R P
6.3.3 Perform Data Validation P R S R R R P
6.3.4 Conduct Disaster Recovery Testing S P P S R S P P
6.3.5 Conduct Volume Testing S P S S P
6.4 Conduct User Acceptance Testing P P P
6.5 Tune System Performance P P P P P R P P P R P
6.5.1 Benchmark P P P P P P P P

3 of 4 Informatica Velocity - Sample Deliverable


Project Role Matrix

D Po
Key
at
a D w
St at D D er
D e w a at
a at Ce Pr
es Q U
P = Primary participants) B a a T r W ab nt en ua R
Te se
S = Secondary participant(s)
us ta D r d an a as
N
er
ta lit ep ch rA
A in In a /D s re e e D t
Pr y o S n i T
pp es te ta a fo
r
ho Ad t w om io od A s i ys T ca r a
cc
R = Review only B s Q ta m m M o a n ss to S te S ec l in e
lic P
gr
Q
us
e i e rk in L uc
t P ry m y P in pta
A = Approve at us ro D
at ua
u
at
A n ta ay i r o
ur ec
s hn r o T g
io in io lit io is A A on je an ur te ic je Te nc
n es jec at
a n y al tr a Le da
ta dm dm er S
Ad
m ity
Ad
m m C
s t D D ity n D dm g i D ct ce
i O
al
A
ct est
M
st oo e T
Sp Ar ev ev St e in to En al M in e v up Sp M in M En M es
ec An Ma
n ch is r( d E an nis is e er o an is an nis pe rc
h
an g an rdi
ia al ag ite
el
op
el
op
ew vel
op tr a DB U x ag tra tr vi t t r i na t L
ar at lop to ea
lis ys er ct e r e r d e r
to
r A se pe
r rt e r
to o er
so nso age rato age rato ato itec age nee age
r r r r r t r r r r d
t t ) r r r r

6.5.2 Identify Areas for Improvement P P P P P P P P


6.5.3 Tune Data Integration Performance P P P R P P R P
6.5.4 Tune Reporting Performance P S P R P P R P
7 Deploy P P S P P P S P P A A P P S R
7.1 Plan Deployment S R S P S P P R S S
7.1.1 Plan User Training S
7.1.2 Plan Metadata Documentation and Rollout P P S R
7.1.3 Plan User Documentation Rollout R R
7.1.4 Develop Punch List S S S P R S S
7.1.5 Develop Communication Plan S S S P R S S
7.1.6 Develop Run Book S S S P R S S
7.2 Deploy Solution P P S P P P P A A P P S A
7.2.1 Train Users P P P S P R
7.2.2 Migrate Development to Production P P A A P P A
7.2.3 Package Documentation A P S P S S P A P R
8 Operate P S P S P S P P P R
8.1 Define Production Support Procedures S P R
8.1.1 Develop Operations Manual S P R
8.2 Operate Solution P P S P S P R P P R
8.2.1 Execute First Production Run P P P P R
8.2.2 Monitor Load Volume S P
8.2.3 Monitor Load Processes S P
8.2.4 Track Change Control Requests P P
8.2.5 Monitor Usage P S P P R R A R
8.2.6 Monitor Data Quality P S
8.3 Maintain and Upgrade Environment P P S
8.3.1 Maintain Repository S P S
8.3.2 Upgrade Software S P S

4 of 4 Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Prototype Feedback
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
PROTOTYPE FEEDBACK

ADMINISTRATIVE INFORMATION
This section should record the date and time of the meeting and a list of attendees.

INTRODUCTION
This section should describe the overall effort, including the business objectives of the end product. It should
explicitly describe which parts of the prototype are demonstrated for feedback and those that are not included in the
feedback.

END USER REQUIREMENTS


This section should describe the end-user requirements defined in the requirements gathering task.

It is important to restate the requirements here to measure the final product against the end users’ requests. This
makes it possible to determine if the final product is likely satisfying the users’ needs.

FUNCTIONAL REQUIREMENTS
This section should specify the functional requirements as defined by the end users. Functional requirements can
include such capabilities as: drill down and up, alert functionality, and access via the web (including both dynamic
and static reporting), as well as information display capabilities such as graphs, bar charts, pie charts, etc.

DATA REQUIREMENTS
This section should specify the data requirements defined by the end users. Data requirements include specific
information requests such as inventory amounts, revenues, organizational data, locations, personnel, etc. If desired,
data requirements can be further broken down by fact data and dimension data.

HARDWARE/SOFTWARE REQUIREMENTS
This section should contain the specific hardware and software requirements necessary to install and run the end
user application. Examples of requirements include minimum memory requirements, software requirements such as
ODBC drivers, database connections, web browsers, or any other required software.

DESCRIPTION OF INTERFACES
This section should contain a detailed description of the interfaces that end users will use to access the data,
including how the interfaces are organized (i.e., by subject area, organizational constraints, or whatever). It should
also provide screen captures of the actual system to be developed.

2 of 3 Informatica Velocity – Sample Deliverable


PROTOTYPE FEEDBACK

PREDEFINED REPORTS
This section should contain a detailed description of each predefined report planned for development. Each report
description should include the report name, all of the attributes, including any attributes calculated by the end user
analysis application. The description should also detail how the information will be presented (e.g., tabular, bar graph,
line chart, pie chart, etc.).

USER FEEDBACK
This section should describe the results of the user review of the prototype. Any comments or suggestions that
impact the design/implementation should be noted here, and may result in a Scope Change Assessment.

Informatica Velocity – Sample Deliverable 3 of 3


VELOCITY
SAMPLE DELIVERABLE

Restartability Matrix
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
RESTARTABILITY MATRIX

PURPOSE
To identify issues that may have an impact on the data integration team’s ability to restart or recover a failed session
and maintain the integrity of data in the data warehouse.

Issue Steps to Mitigate Party Responsible for Notes


Impact on Ensuring Steps are
Restartability Completed
Data in source Append source data with Database Administrator Backup schema created
table changes a date stamp, and store a (creates backup schema in on xx/xx/xxxx
frequently snapshot of source data repository)
in a backup schema until Data Integration Developer
the session has (ensures that session calls
completed successfully backup schema when session
recovery is performed)
Mappings in Arrange sessions in a Data Integration Developer
certain sessions workfow; configure
are dependent on sessions to run only if
data produced by previous sessions are
mappings in other completed successfully
sessions
Session uses the If sessions fail frequently Data Integration Developer
Bulk Loading due to external problems
parameter (e.g., network downtime),
reconfigure the session
to normal load. Bulk
loading bypasses the
database log, making
session unrecoverable

Only the Configure the session to Data Integration Developer


Informatica send an email to the
Administrator can Informatica administrator
recover or restart when a session fails
sessions

Multiple sessions Work with database Data Integration Developer


within a workflow administrator to Database Administrator
fail determine when failed
sessions should be
recovered, and when
targets should be
truncated and entire
session run again.

2 of 2 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Scope Change Assessment


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
SCOPE CHANGE ASSESSMENT

ISSUE OR REQUIREMENT
<Description of issue to be resolved or new requirement>

PROPOSED RESOLUTION
<Description of approach and reasoning (and alternatives if applicable)>

IMPACT TO PLAN
<Estimated change to schedule, budget, staffing, or other costs>

ASSUMPTIONS AND RISKS


<Assumptions involved in this approach and risks involved. Risks might include risks regarding impact on the project
as a whole>

2 of 2 Informatica Velocity – Sample Deliverable


SOURCE AVAILABILITY MATRIX
AM PM
12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
System 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
ERP
Accounting
Manufacturing
Asia Accounting
HR
Europe Manufacturing

= In Use
= Available for Extraction

1 of 2 Weekday Schedule Informatica Velocity - Sample Deliverable


SOURCE AVAILABILITY MATRIX
AM PM
12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11
System 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
ERP
Accounting
Manufacturing
Asia Accounting
HR
Europe Manufacturing

= In Use
= Available for Extraction

2 of 2 Weekend Schedule Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

System Test Plan -


System Name
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
SYSTEM TEST PLAN – SYSTEM NAME

TABLE OF CONTENTS
1.0 SCOPE ...................................................................................................................................................... 3
1.1 INTRODUCTION ...................................................................................................................................... 3
1.2 PURPOSE .............................................................................................................................................. 3
1.3 LIMITATIONS .......................................................................................................................................... 3
1.4 ROLES, RESPONSIBILITIES AND SUPPORT AGENCIES ................................................................................ 3
1.4.1 POINTS OF CONTACT ...................................................................................................................... 3
1.4.2 TESTING ORGANIZATION DIAGRAM ................................................................................................... 3
1.4.3 SUPPORT SYSTEMS ........................................................................................................................ 3
1.5 SYSTEM OVERVIEW ................................................................................................................................ 3
1.6 SYSTEM CONFIGURATION ....................................................................................................................... 3
1.6.1 DATA SOURCES .............................................................................................................................. 3
1.6.2 DATA WAREHOUSE ......................................................................................................................... 3
1.6.3 DATA STORES/TARGETS .................................................................................................................. 4
1.6.4 DATA MODELING ............................................................................................................................. 4
1.6.5 DATA INTEGRATION ADMINISTRATION ............................................................................................... 4
1.6.6 DATA STAGING ............................................................................................................................... 4
1.6.7 CLIENT INTERFACE ......................................................................................................................... 4
1.6.8 NETWORK ...................................................................................................................................... 4
1.6.9 SYSTEM OVERVIEW DIAGRAM .......................................................................................................... 4
1.7 RELATIONSHIP TO OTHER PLANS ............................................................................................................. 4
2.0 REFERENCES ............................................................................................................................................. 4
3.0 SYSTEM TEST ENVIRONMENT ...................................................................................................................... 5
3.1 TEST AREAS .......................................................................................................................................... 5
3.1.1 ENVIRONMENT (HARDWARE AND SOFTWARE) ................................................................................... 5
3.1.2 ENVIRONMENT (HARDWARE AND SOFTWARE) ................................................................................... 5
3.1.3 OTHER MATERIALS ......................................................................................................................... 6
3.1.4 PROPRIETARY NATURE, ACQUIRER'S RIGHTS AND LICENSING............................................................. 6
3.1.5 INSTALLATION, TESTING AND CONTROL ............................................................................................ 6
3.1.6 TEST PERSONNEL ........................................................................................................................... 6
4.0 TEST IDENTIFICATION ................................................................................................................................. 6
4.1 TEST CASE DESCRIPTION........................................................................................................................ 7
4.1.1 TEST CASE OBJECTIVE .................................................................................................................... 7
4.1.2 TEST LEVELS .................................................................................................................................. 7
4.1.3 TEST TYPES ................................................................................................................................... 7
4.1.4 CRITICAL TECHNICAL PARAMETERS (CTPS)....................................................................................... 7
4.1.5 TEST CONDITION REQUIREMENTS (TCRS) ......................................................................................... 7
4.1.6 TEST EXECUTION AND PROGRESSION .............................................................................................. 7
4.1.7 TEST SCHEDULE ............................................................................................................................. 7
4.2 PHASED TESTING BREAKDOWN DIAGRAM ................................................................................................. 7
5.0 DATA RECORDING, ANALYSIS, AND REPORTING ............................................................................................ 8
5.1 RECORDING........................................................................................................................................... 8
5.2 ANALYSIS .............................................................................................................................................. 8
5.3 REPORTING ........................................................................................................................................... 8

2 of 8 Informatica Velocity – Sample Deliverable


SYSTEM TEST PLAN – SYSTEM NAME

1.0 SCOPE
1.1 INTRODUCTION
[Discuss the System Test Plan, its strategy, and intended use.]

1.2 PURPOSE
[Describe the purpose of this document.]

1.3 LIMITATIONS
[Include any disclaimers or issues that can make this test plan less effective.]

1.4 ROLES, RESPONSIBILITIES AND SUPPORT AGENCIES


1.4.1 POINTS OF CONTACT

[Include all points of contact involved in any phase of the testing process. Also include the group or affiliation of each
POC. (i.e. DBA, developer, etc.)]

1.4.2 TESTING ORGANIZATION DIAGRAM

[Provide the organization chart of the testing organization. If an outside testing group is used, provide information
and contacts within that group.]

1.4.3 SUPPORT SYSTEMS

[Discuss any other organizations that will be supporting the testing effort. Include contacts and organization charts
where necessary.]

1.5 SYSTEM OVERVIEW


[Provide an overview of the system under test. What is the purpose of this system? How frequently will the jobs be
run? How will the resulting data be used?]

1.6 SYSTEM CONFIGURATION


1.6.1 DATA SOURCES

[Describe the data sources used within the system.]

1.6.2 DATA WAREHOUSE

[Briefly describe the data warehouse, if applicable.]

Informatica Velocity – Sample Deliverable 3 of 8


SYSTEM TEST PLAN – SYSTEM NAME

1.6.3 DATA STORES/TARGETS

[Briefly describe any additional data stores or targets, if applicable.]

1.6.4 DATA MODELING

[Briefly describe the data modeling tool used. If desired, include a reference to the data model of the system under
test.]

1.6.5 DATA INTEGRATION ADMINISTRATION

[Describe how the data integration processes will be monitored.]

1.6.6 DATA STAGING

[Describe how and where source, intermediate, and/or target files will be stored. If possible, include a reference to
any documentation that describes the process in greater detail.]

1.6.7 CLIENT INTERFACE

[Describe the client interface used to view and/or query the resulting data.]

1.6.8 NETWORK

[Describe the network architecture of the system under test. If possible, discuss the implication of the battery of tests
on the network.]

1.6.9 SYSTEM OVERVIEW DIAGRAM

[Provide a diagram of the overall system architecture and data flow.]

1.7 RELATIONSHIP TO OTHER PLANS


[List any other test plans or other documentation that may be used during the execution of this test plan.]

2.0 REFERENCES
[List any documents referenced within this test plan.]
Document Name Source Date

4 of 8 Informatica Velocity – Sample Deliverable


SYSTEM TEST PLAN – SYSTEM NAME

3.0 SYSTEM TEST ENVIRONMENT


3.1 TEST AREAS
[Describe the venue(s) where testing will occur.]

3.1.1 ENVIRONMENT (HARDWARE AND SOFTWARE)

[List all hardware and software under test. Also provide a point-of-contact (POC) for each item, indicating who will be
available to support the testing effort.]
Software Version Purpose POC

Hardware Purpose POC(s)


Component

3.1.2 ENVIRONMENT (HARDWARE AND SOFTWARE)

[Describe ALL hardware and software in the system.]


Software Version
Software

Compilers

Application
Database

Interfaces

COTS

Hardware
Component
Server

Informatica Velocity – Sample Deliverable 5 of 8


SYSTEM TEST PLAN – SYSTEM NAME

Hard Drive
PCs

Printers

UPS

Tape Drive
Switch

Controller

Drivers

Communications

3.1.3 OTHER MATERIALS

[Describe all other materials including manuals, installation software, documentation and procedures that will be
supplied on an as needed basis.]

3.1.4 PROPRIETARY NATURE, ACQUIRER'S RIGHTS AND LICENSING

[Describe any licensing agreements if necessary.]

3.1.5 INSTALLATION, TESTING AND CONTROL

[Describe who will install, maintain, and control the testing environment.]

3.1.6 TEST PERSONNEL

[List the testing personnel involved in this testing effort.]

4.0 TEST IDENTIFICATION


[Describe how the individual test cases will be documented and executed.]

6 of 8 Informatica Velocity – Sample Deliverable


SYSTEM TEST PLAN – SYSTEM NAME

4.1 TEST CASE DESCRIPTION


[Provide information on the test case description to be included in each test case.]

4.1.1 TEST CASE OBJECTIVE

[Summarize the reasoning for establishing the specific test case, answering in the simplest possible terms the
question: “Why are we doing this?”]

4.1.2 TEST LEVELS

[Describe the various testing levels that will be performed within the test plan. Examples of levels are hardware,
software, data, application, etc.]

4.1.3 TEST TYPES

[Describe the types of testing that will be performed. Discuss whether the testing will consist of black box, white box,
or a combination of both. (Black box testing provides the most visibility into the various test levels being addressed,
while White box testing tests each level at the highest level possible.

4.1.4 CRITICAL TECHNICAL PARAMETERS (CTPS)

[Each CTP will define specific functional units that will be tested. This should include any specific data items,
component, or functional parts that will be tested. ]

4.1.5 TEST CONDITION REQUIREMENTS (TCRS)

[TCR scripts will be developed to satisfy all identified CTPs. Personnel identified by the Technical System Manager
will develop these TCRs. All TCRs will be assigned a numeric designation and will include the test objective, list of
any prerequisites, test steps, actual results, expected results and identification of tester, the current date, and the
current iteration of the test. ]

4.1.6 TEST EXECUTION AND PROGRESSION

[Provide a set of control procedures for executing a test such as special conditions and processes for returning a
TCR to a technical developer in the event that it fails.]

4.1.7 TEST SCHEDULE

[Provide the schedule for executing the test plan.]

4.2 PHASED TESTING BREAKDOWN DIAGRAM


[Provide a diargarm for the testing process.]

Informatica Velocity – Sample Deliverable 7 of 8


SYSTEM TEST PLAN – SYSTEM NAME

5.0 DATA RECORDING, ANALYSIS, AND REPORTING


5.1 RECORDING
[Describe how test results will be recorded.]

5.2 ANALYSIS
[Describe how the test results will be gathered, reviewed, and analyzed. In addition, describe who will receive the
results of this analysis.]

5.3 REPORTING
[Describe test case summary report provided for each test case. Provide detail about the items that will be included
in this test case summary.]

8 of 8 Informatica Velocity – Sample Deliverable


TARGET-SOURCE MATRIX

Warehouse Table: <Table Name>


Source
Target Source Operating Mapped to Data Source Source Data Comments/Detailed Transformation Date Changed (if
Reference # Target Attribute Datatype Default Value Database System File / Table Field Name Datatype Provider Source Data Description Rules not in orig. load)
1

10

11

12

13

14

15

1 of 1 Informatica Velocity - Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Data Governance
Technology Evaluation Checklist
DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
TECHNOLOGY EVALUATION CHECKLIST

IT organizations can use this evaluation criteria checklist to ensure that the data integration platform that
they have (or select) offers the comprehensive set of capabilities that are required for a robust data
governance program. It is critical that a unified platform supplies these capabilities to ensure consistency
and re-use and the provision of uniform process and policy controls.

1. DATA ACCESSIBILITY
The platform should ensure that all enterprise data can be accessed, regardless of its source or structure.

□ Pre-built connectivity. Does the platform have pre-built connectivity to a wide variety of
systems, including multiple mainframe formats, messaging systems, and numerous applications?
□ Input/output data validation. Does the platform validate input/output data?
□ Event logging. Does the platform provide failed session statistics, error messages, metadata
statistics and lineage that help assess the exceptions and failures related to accessing data?
□ Federated access. Does the platform provide both physical and virtual/federated access to data
in one common tool?
□ Cross-firewall access. Does the platform support secure, high performance data movement
across firewalls?
□ Supported data types. Is the platform able to access the following data types with one tool while
leveraging common metadata; mainframe data; structured data; unstructured data (e.g., Microsoft
Word documents and Excel spreadsheets); XML and EDI data; relational data; application data;
and message queue data?

2. DATA AVAILABILITY
The platform should ensure that data is available to users and applications—when, where, and how
needed.

□ Throughput. Does the platform make it easy to configure multiple performance enhancement
options including pipelining, dynamic partitioning, and smart parallelism?
□ Scalability. Does the platform take advantage of 64-bit, thread-based parallel processing and
grid deployment for near-linear scalability?
□ Automatic failover and recovery. Does the platform feature automatic failover and recovery
capabilities? Does it provide a graphical status on the grid, as well as other key indicators/alerts?
□ High availability. Does the platform enable you to easily configure high availability? Does it
include built-in resiliency, failover and recovery? Does it support multi-node/grid deployment?
□ Volume and timing. Does the platform allow data volumes and latencies (e.g., large volume
batch vs. message-based real-time) to be configured to meet business needs, without any
recoding?
□ Breadth of delivery protocols. Can the platform be easily configured to deliver data via different
protocols and methods, including loading physical databases for SQL-based access, creating
virtual data views (EII), publishing to a message bus or queue, and publishing Web Services?

2 of 4 Informatica Velocity – Sample Deliverable


TECHNOLOGY EVALUATION CHECKLIST

3. DATA QUALITY
The platform should ensure the accuracy and validity of data.

□ Profiling. Does the platform include tools to automatically profile data sources to understand the
data and flag potential issues? Is that tool integrated with the rest of the data quality and data
integration platform?
□ Monitoring and measurement. Does the platform enable you to establish key data quality
metrics, monitor them on an ongoing basis, and receive alerts on items that fall out of acceptable
ranges?
□ Cleansing and remediation. Does the platform allow you to define business rules to address
data quality issues on an automated basis? Does it provide historical statistics, which help root
cause analysis on data quality issues, including accuracy, completeness, conformity, consistency,
referential integrity, and duplication?
□ Breadth of data. Does the platform include data quality capabilities that address all key data
types—customer, product/service, financial, employee, etc.— not just a single data type such as
customer contact information?
□ Ease of use. Does the platform provide an easy-to-use interface to enable both business users
(e.g., business analysts and data stewards) and IT users to visualize and address data quality
issues?
□ Integrated metadata. Does the platform automatically capture the metadata from your data
quality processes? Is the metadata seamlessly incorporated as part of the overall data integration
lifecycle?

4. DATA CONSISTENCY
The platform should ensure that the value, structure, and meaning of data is consistent and reconciled
across systems, processes, and organizations.

□ Validation. Does the platform provide an integrated design and mapping tool that automatically
validates the data model on-the-fly?
□ Transformation. Does the platform feature robust transformation capabilities that address not
only syntactic issues, but also structural and semantic variances across different systems?
□ Logical design and workflow. Does the platform capture all design and business rules at the
logical level as metadata, abstracting them from the physical layer?
□ Reusability. Are you able to capture all data integration and data quality logic and workflows as
metadata via one platform? Does the platform enable sharing at both local and global levels?
□ Cataloging. Does the platform allow you to easily search, filter, define, and modify data
dictionaries and business rules?
□ Data synchronization. Does the platform easily interoperate with enterprise application
integration (EAI) and messaging technologies to help synchronize the context and meaning of
data, as well as data values, across operational systems?

Informatica Velocity – Sample Deliverable 3 of 4


TECHNOLOGY EVALUATION CHECKLIST

5. DATA AUDITABILITY
The platform should ensure that there is an audit trail on the data and that internal controls have been
appropriately implemented.

□ Lineage. Can the platform provide a visual lineage of data across multiple systems and
applications, including both backward and forward tracking? Does it provide drill-down
capabilities?
□ Impact Assessment. Can the platform automatically assess the impact of changes across
applications and systems? Does it provide reports on port details, metadata extensions and
usage, and mapping dependencies across connected systems?
□ Workflow. Does the platform include robust workflow orchestration capabilities including support
for grid deployments and global, cross-team collaboration?
□ Dashboard. Does the platform provide a dashboard with a high-level summary of workflows,
processes and status? Does it include the ability to easily drill down into the details?
□ Testing. Does the platform include an integrated test environment that not only detects mapping
and session errors, but also helps identify the root causes of invalid mapping and session errors?
□ Version control. Does the platform feature robust, granular version management and
deployment capabilities?

6. DATA SECURITY
The platform should ensure secure access to the data.

□ Data classification. Does the platform support an enterprise-wide strategy on information


classification, enabling data to be easily grouped and classified?
□ Segregation of duty. Does the platform enable granular segregation of duties, as well as
reporting on the interdependencies between different tasks?
□ Privilege management. Does the platform have robust privilege management capabilities for
managing and reporting on granular privileges such as copy object, maintain labels and change
object status?
□ Client authentication. Can the platform integrate tightly with Lightweight Directory Access
Protocol (LDAP)? Does it have a repository to track real-time authentication status?
□ Web Services security. Does the platform support the latest Web Services security standards at
both the message and transport layers for authentication, encryption and authorization?
□ Encryption. Can the platform support data encryption? Does it enable secure synchronization
across firewalls, leveraging built-in encryption and compression capabilities?

4 of 4 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Test Case List


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
TEST CASE LIST

SCHEDULE

TCD Technical Functional


# Test Case Location Start Date End Date Contact Contact
1

10

11

12

13

14

15

16

17

18

19

20

2 of 2 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Test Condition Results


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
TEST CONDITION RESULTS

TCs
# Test Case Description Total TCs Passed Total % Passed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Total TCs Passed out of Total TCs


Total % System TCs Passed

2 of 2 Informatica Velocity – Sample Deliverable


VELOCITY
SAMPLE DELIVERABLE

Unit Test Plan


DOCUMENT AUTHOR:
DOCUMENT OWNER:
DATE CREATED:
LAST UPDATED:
PROJECT:
COMPANY:
UNIT TEST PLAN

Tester Name: Test Date:

Transformation Session Name:


Version:
Developer:

Source Target
Tables: Tables:

TEST CASE DEFINITION


Creation Date: Test Case ID:

System/Subsystem:

System Function:

Business Scenario:

Test Case Description:

Expected Results:

Error Code:
Actual Occurrences:
Expected Occurrences:

2 of 5 Informatica Velocity – Sample Deliverable


UNIT TEST PLAN

MAPPING OVERVIEW
<Description of the function(s) performed by the mapping.>

SOURCE – TARGET TRANSFORMATION

FILTERS

Transformation: FILTER NAME Pass/Fail


[Enter specific tests for the transformation object.]
Transformation: FILTER NAME Pass/Fail

LOOKUPS

Transformation: LOOKUP NAME Pass/Fail


[Enter specific tests for the transformation object.]

EXPRESSIONS

Transformation: EXPRESSION NAME Pass/Fail


[Enter specific tests for the transformation object.]

Transformation: EXPRESSION NAME Pass/Fail

[OTHER TRANSFORMATION OBJECTS]

Transformation: OBJECT NAME Pass/Fail


[Enter specific tests for the transformation object.]
Transformation: OBJECT NAME Pass/Fail

Informatica Velocity – Sample Deliverable 3 of 5


UNIT TEST PLAN

ERROR HANDLING

Does the error table record count in the error files(s) match the number of errors that exist in the
source table?
Is the .bad file free of error messages?
Was error table built correctly?

Error Cases
• Refer to the expected results section.
• Were the error cases caught?
• Were the error records written to the appropriate file/table?
• Was the record layout for the error record correct?

PRE/POST SESSION COMMAND


Pre Session
[Enter specific tests for the pre-session script.]

Post Session (All commands are executed only if the session is successful
[Enter specific tests for the post-session script.]

NOTE: ADD ADDITIONAL TESTS IF NECESSARY

LOAD STATISTICS

Is the load time acceptable? Load Time: # Files: # Rows: Date:

Is the load time acceptable? Load Time: # Files: # Rows: Date:

4 of 5 Informatica Velocity – Sample Deliverable


UNIT TEST PLAN

SOURCE RECORD(S) TESTED

Column Name Record 1 Record 2 Record 3 Record 4


<Column 1>
<Column 2>
<Column 3>

Informatica Velocity – Sample Deliverable 5 of 5


Work Breakdown Structure
Phase/Task/Subtask
1 Manage
1.1 Define Project
1.1.1 Establish Business Project Scope
1.1.2 Build Business Case
1.1.3 Assess Centralized Resources
1.2 Plan and Manage Project
1.2.1 Establish Project Roles
1.2.2 Develop Project Estimate
1.2.3 Develop Project Plan
1.2.4 Manage Project
1.3 Perform Project Close

2 Analyze
2.1 Define Business Drivers, Objectives and Goals
2.2 Define Business Requirements
2.2.1 Define Business Rules and Definitions
2.2.2 Establish Data Stewardship
2.3 Define Business Scope
2.3.1 Identify Source Data Systems
2.3.2 Determine Sourcing Feasibility
2.3.3 Determine Target Requirements
2.3.4 Determine Business Process Data Flows
2.3.5 Build Roadmap for Incremental Delivery
2.4 Define Functional Requirements
2.5 Define Metadata Requirements
2.5.1 Establish Inventory of Technical Metadata
2.5.2 Review Metadata Sourcing Requirements
2.5.3 Assess Technical Strategies and Policies
2.6 Determine Technical Readiness
2.7 Determine Regulatory Requirements
2.8 Perform Data Quality Audit
2.8.1 Perform Data Quality Analysis of Source Data
2.8.2 Report Analysis Results to the Business

3 Architect
3.1 Develop Solution Architecture
3.1.1 Define Technical Requirements

1 of 4 Informatica Velocity -Sample Deliverable


Work Breakdown Structure
Phase/Task/Subtask
3.1.2 Develop Architecture Logical View
3.1.3 Develop Configuration Recommendations
3.1.4 Develop Architecture Physical View
3.1.5 Estimate Volume Requirements
3.2 Design Development Architecture
3.2.1 Develop Quality Assurance Strategy
3.2.2 Define Development Environments
3.2.3 Develop Change Control Procedures
3.2.4 Determine Metadata Strategy
3.2.5 Develop Change Management Process
3.3 Implement Technical Architecture
3.3.1 Procure Hardware and Software
3.3.2 Install/Configure Software

4 Design
5 Build
5.1 Launch Build Phase
5.1.1 Review Project Scope and Plan
5.1.2 Review Physical Model
5.1.3 Define Defect Tracking Process
5.2 Implement Physical Database
5.3 Design and Build Data Quality Process
5.3.1 Design Data Quality Technical Rules
5.3.2 Determine Dictionary and Reference Data Requirements
5.3.3 Design and Execute Data Enhancement Processes
5.3.4 Design Run-time and Real-time Processes for Operate Phase Execution
5.3.5 Develop Inventory of Data Quality Processes
5.3.6 Review and Package Data Transformation Specification Processes and Documents
5.4 Design and Develop Data Integration Processes
5.4.1 Design High Level Load Process
5.4.2 Develop Error Handling Strategy
5.4.3 Plan Restartability Process
5.4.4 Develop Inventory of Mappings & Reusable Objects
5.4.5 Design Individual Mappings & Reusable Objects
5.4.6 Build Mappings & Reusable Objects
5.4.7 Perform Unit Test
5.4.8 Conduct Peer Reviews

2 of 4 Informatica Velocity -Sample Deliverable


Work Breakdown Structure
Phase/Task/Subtask
5.5 Populate and Validate Database
5.5.1 Build Load Process
5.5.2 Perform Integrated ETL Testing
5.6 Build Presentation Layer
5.6.1 Develop Presentation Layer
5.6.2 Demonstrate Presentation Layer to Business Analysts

6 Test
6.1 Define Overall Test Strategy
6.1.1 Define Test Data Strategy
6.1.2 Define Unit Test Plan
6.1.3 Define System Test Plan
6.1.4 Define User Acceptance Test Plan
6.1.5 Define Test Scenarios
6.1.6 Build/Maintain Test Source Data Set
6.2 Prepare for Testing Process
6.2.1 Prepare Environments
6.2.2 Prepare Defect Management Processes
6.3 Execute System Test
6.3.1 Prepare for System Test
6.3.2 Execute Complete System Test
6.3.3 Perform Data Validation
6.3.4 Conduct Disaster Recovery Testing
6.3.5 Conduct Volume Testing
6.4 Conduct User Acceptance Testing
6.5 Tune System Performance
6.5.1 Benchmark
6.5.2 Identify Areas for Improvement
6.5.3 Tune Data Integration Performance
6.5.4 Tune Reporting Performance

7 Deploy
7.1 Plan Deployment
7.1.1 Plan User Training
7.1.2 Plan Metadata Documentation and Rollout
7.1.3 Plan User Documentation Rollout
7.1.4 Develop Punch List

3 of 4 Informatica Velocity -Sample Deliverable


Work Breakdown Structure
Phase/Task/Subtask
7.1.5 Develop Communication Plan
7.1.6 Develop Run Book
7.2 Deploy Solution
7.2.1 Train Users
7.2.2 Migrate Development to Production
7.2.3 Package Documentation

8 Operate
8.1 Define Production Support Procedures
8.1.1 Develop Operations Manual
8.2 Operate Solution
8.2.1 Execute First Production Run
8.2.2 Monitor Load Volume
8.2.3 Monitor Load Processes
8.2.4 Track Change Control Requests
8.2.5 Monitor Usage
8.2.6 Monitor Data Quality
8.3 Maintain and Upgrade Environment
8.3.1 Maintain Repository
8.3.2 Upgrade Software

4 of 4 Informatica Velocity -Sample Deliverable

You might also like