You are on page 1of 18

__ sorting the tables in ascending or descending and

SORTER
aslo to obtain Distinct records.
RANK __ Top or bottom 'N' analysis .

__ Join two different sources cmng from different and


JOINER same location .
FILTER __ filters the rows that do not meet the condition.

ROUTER __ It is useful to test multiple conditions .

__ To perform group calculation such as count ,


AGGREGATOR max , min , sum , avg (mainly to perform
calculation or multiple rows or group)
__ Reads cobol files ( denormalized format).
NORMALIZER
Split a single row into multiple rows.
__ It performs many tasks such as override default
sql query , filtering records , join data from two or
SOURCE
more table etc
QUALIFIER
Represents the flatfile or relational data.

UNION __ It merges data from multiple sources similar to


the UNION ALL SQL statement to combine the
results from two or more SQL statements. Similar
to the UNION ALL statement, the Union
transformation does not remove duplicate rows.
__ You can use the Expression transformation to
EXPRESSION calculate values in a single row before you write
to the target.
__ Use a Lookup transformation in a mapping to look
LOOK UP up data in a flat file or a relational table, view, or
synonym.
__ stored procedures to automate tasks that are too
STORED complicated for standard SQL statements.You can
PROCEDURE call by using Stored Procedure Transformation.

__ When you add an XML source definition to a


XML SOURCE
mapping, you need to connect it to an XML Source
QUALIFIER
Qualifier transformation.
UPDATE __ To flag rows for insert, delete, update, or reject..
STRATEGY
ARCHITECTURE
Data Warehouse Architecture

ST
AR
SC
HE
MA

Star schema architecture is the simplest


data warehouse design. The main feature
of a star schema is a table at the center,
called the fact table and the dimension
tables which allow browsing of specific
categories, summarizing, drill-downs and
specifying criteria.
Typically, most of the fact tables in a star
schema are in database third normal form,
while dimensional tables are de-normalized
(second normal form).

Fact table
The fact table is not a typical relational
database table as it is de-normalized on
purpose - to enhance query response
times. The fact table typically contains
records that are ready to explore, usually
with ad hoc queries. Records in the fact
table are often referred to as events, due
to the time-variant nature of a data
warehouse environment.
The primary key for the fact table is a
composite of all the columns except
numeric values / scores (like QUANTITY,
TURNOVER, exact invoice date and time).
Typical fact tables in a global enterprise
data warehouse are (apart for those, there
may be some company or business specific
fact tables):

sales fact table - contains all details


regarding sales
orders fact table - in some cases the table
can be split into open orders and historical
orders. Sometimes the values for historical
orders are stored in a sales fact table.
budget fact table - usually grouped by
month and loaded once at the end of a
year.
forecast fact table - usually grouped by
month and loaded daily, weekly or
monthly.
inventory fact table - report stocks, usually
refreshed daily

Dimension table
Nearly all of the information in a typical
fact table is also present in one or more
dimension tables. The main purpose of
maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension
tables are linked together to form the
composite primary key of the fact table. In
a star schema design, there is only one de-
normalized table for a given dimension.
Typical dimension tables in a data
warehouse are:

time dimension table


customers dimension table
products dimension table
key account managers (KAM) dimension
table
sales office dimension table
Star schema example
An example of a star schema architecture
is depicted below.

SNOWFLAKE SCHEMA

Snowflake schema architecture is a more complex variation of a star


schema design. The main difference is that dimensional tables in a
snowflake schema are normalized, so they have a typical relational
database design.
Snowflake schemas are generally used when a dimensional table becomes
very big and when a star schema can’t represent the complexity of a data
structure. For example if a PRODUCT dimension table contains millions of
rows, the use of snowflake schemas should significantly improve
performance by moving out some data to other table (with BRANDS for
instance).
The problem is that the more normalized the dimension table is, the more
complicated SQL joins must be issued to query them. This is because in
order for a query to be answered, many tables need to be joined and
aggregates generated.

An example of a snowflake schema architecture is depicted below.


GALAXY SCHEMA
For each star schema or snowflake schema it is possible to construct a
fact constellation schema.
This schema is more complex than star or snowflake architecture, which is
because it contains multiple fact tables. This allows dimension tables to be
shared amongst many fact tables.
That solution is very flexible, however it may be hard to manage and
support.
The main disadvantage of the fact constellation schema is a more
complicated design because many variants of aggregation must be
considered.
In a fact constellation schema, different fact tables are explicitly assigned
to the dimensions, which are for given facts relevant. This may be useful
in cases when some facts are associated with a given dimension level and
other facts with a deeper dimension level.

Use of that model should be reasonable when for example, there is a sales
fact table (with details down to the exact date and invoice header id) and
a fact table with sales forecast which is calculated based on month, client
id and product id.

In that case using two different fact tables on a different level of grouping
is realized through a fact constellation model.
Source System
A database, application, file, or other
storage facility from which the data in a
data warehouse is derived.
Mapping
The definition of the relationship and data
flow between source and target objects.
Meta data
Data that describes data and other
structures, such as objects, business rules,
and processes. For example, the schema
design of a data warehouse is typically
stored in a repository as meta data, which
is used to generate scripts used to build
and populate the data warehouse. A
repository contains meta data.
Staging Area
A place where data is processed before
entering the warehouse.
Cleansing
The process of resolving inconsistencies
and fixing the anomalies in source data,
typically as part of the ETL process.
Transformation
The process of manipulating data. Any
manipulation beyond copying is a
transformation. Examples include
cleansing, aggregating, and integrating
data from multiple sources.
Transportation
The process of moving copied or
transformed data from a source to a data
warehouse.
Target System
A database, application, file, or other
storage facility to which the "transformed
source data" is loaded in a data warehouse.
Figure 1.12 : Sample ETL Process Flow
Informatica
Informatica is a powerful ETL tool from
Informatica Corporation, a leading provider
of enterprise data integration software and
ETL softwares.
The important Informatica Components
are:
• Power Exchange
• Power Center
• Power Center Connect
• Power Channel
• Metadata Exchange
• Power Analyzer
• Super Glue
In Informatica, all the Metadata information
about source systems, target systems and
transformations are stored in the
Informatica repository. Informatica's Power
Center Client and Repository Server access
this repository to store and retrieve
metadata.
Note: To know more about Metadata and
its significance, please click here.
Source and Target
Consider a Bank that has got many
branches throughout the world. In each
branch data may be stored in different
source systems like oracle, sql server,
terradata, etc. When the Bank decides to
integrate its data from several sources for
its management decisions, it may choose
one or more systems like oracle, sql server,
terradata, etc. as its data warehouse
target. Many organisations prefer
Informatica to do that ETL process,
because Informatica is more powerful in
designing and building data warehouses. It
can connect to several sources and targets
to extract meta data from sources and
targets, transform and load the data into
target systems.
Guidelines to work with Informatica Power
Center
• Repository: This is where all the
metadata information is stored in the
Informatica suite. The Power Center
Client and the Repository Server
would access this repository to
retrieve, store and manage
metadata.
• Power Center Client: Informatica
client is used for managing users,
identifiying source and target
systems definitions, creating
mapping and mapplets, creating
sessions and run workflows etc.
• Repository Server: This repository
server takes care of all the
connections between the repository
and the Power Center Client.
• Power Center Server: Power Center
server does the extraction from
source and then loading data into
targets.
• Designer: Source Analyzer, Mapping
Designer and Warehouse Designer
are tools reside within the Designer
wizard. Source Analyzer is used for
extracting metadata from source
systems.
Mapping Designer is used to create
mapping between sources and targets.
Mapping is a pictorial representation about
the flow of data from source to target.
Warehouse Designer is used for extracting
metadata from target systems or metadata
can be created in the Designer itself.
• Data Cleansing: The PowerCenter's
data cleansing technology improves
data quality by validating, correctly
naming and standardization of
address data. A person's address
may not be same in all source
systems because of typos and postal
code, city name may not match with
address. These errors can be
corrected by using data cleansing
process and standardized data can
be loaded in target systems (data
warehouse).
• Transformation: Transformations help
to transform the source data
according to the requirements of
target system. Sorting, Filtering,
Aggregation, Joining are some of the
examples of transformation.
Transformations ensure the quality of
the data being loaded into target and
this is done during the mapping
process from source to target.
• Workflow Manager: Workflow helps to
load the data from source to target in
a sequential manner. For example, if
the fact tables are loaded before the
lookup tables, then the target system
will pop up an error message since
the fact table is violating the foreign
key validation. To avoid this,
workflows can be created to ensure
the correct flow of data from source
to target.
• Workflow Monitor: This monitor is
helpful in monitoring and tracking the
workflows created in each Power
Center Server.
• Power Center Connect: This
component helps to extract data and
metadata from ERP systems like
IBM's MQSeries, Peoplesoft, SAP,
Siebel etc. and other third party
applications.
• Power Center Exchange: This
component helps to extract data and
metadata from ERP systems like
IBM's MQSeries, Peoplesoft, SAP,
Siebel etc. and other third party
applications.
Informatica
Power Exchange
Informatica Power Exchange as a stand
alone service or along with Power Center,
helps organizations leverage data by
avoiding manual coding of data extraction
programs. Power Exchange supports batch,
real time and changed data capture
options in main frame(DB2, VSAM, IMS
etc.,), mid range (AS400 DB2 etc.,), and for
relational databases (oracle, sql server,
db2 etc) and flat files in unix, linux and
windows systems.
Power Channel
This helps to transfer large amount of
encrypted and compressed data over LAN,
WAN, through Firewalls, tranfer files over
FTP, etc.
Meta Data Exchange
Metadata Exchange enables organizations
to take advantage of the time and effort
already invested in defining data structures
within their IT environment when used with
Power Center. For example, an organization
may be using data modeling tools, such as
Erwin, Embarcadero, Oracle designer,
Sybase Power Designer etc for developing
data models. Functional and technical team
should have spent much time and effort in
creating the data model's data
structures(tables, columns, data types,
procedures, functions, triggers etc). By
using meta deta exchange, these data
structures can be imported into power
center to identifiy source and target
mappings which leverages time and effort.
There is no need for informatica developer
to create these data structures once again.
Power Analyzer
Power Analyzer provides organizations with
reporting facilities. PowerAnalyzer makes
accessing, analyzing, and sharing
enterprise data simple and easily available
to decision makers. PowerAnalyzer enables
to gain insight into business processes and
develop business intelligence.
With PowerAnalyzer, an organization can
extract, filter, format, and analyze
corporate information from data stored in a
data warehouse, data mart, operational
data store, or otherdata storage models.
PowerAnalyzer is best with a dimensional
data warehouse in a relational database. It
can also run reports on data in any table in
a relational database that do not conform
to the dimensional model.
Super Glue
Superglue is used for loading metadata in a
centralized place from several sources.
Reports can be run against this superglue
to analyze meta data.
Power Mart
Power Mart is a departmental version of
Informatica for building, deploying, and
managing data warehouses and data
marts. Power center is used for corporate
enterprise data warehouse and power mart
is used for departmental data warehouses
like data marts. Power Center supports
global repositories and networked
repositories and it can be connected to
several sources. Power Mart supports
single repository and it can be connected
to fewer sources when compared to Power
Center. Power Mart can extensibily grow to
an enterprise implementation and it is easy
for developer productivity through a
codeless environment.
Note:This is not a complete tutorial on
Informatica. We will add more Tips and
Guidelines on Informatica in near future.
Informatica - Transformations
In Informatica, Transformations help to
transform the source data according to the
requirements of target system and it
ensures the quality of the data being
loaded into target.
Transformations are of two types: Active
and Passive.
Active Transformation
An active transformation can change the
number of rows that pass through it from
source to target i.e it eliminates rows that
do not meet the condition in
transformation.
Passive Transformation
A passive transformation does not change
the number of rows that pass through it i.e
it passes all rows through the
transformation.
Transformations can be Connected or
UnConnected.
Connected Transformation
Connected transformation is connected to
other transformations or directly to target
table in the mapping.
UnConnected Transformation
An unconnected transformation is not
connected to other transformations in the
mapping. It is called within another
transformation, and returns a value to that
transformation.
List of Transformations
Following are the list of Transformations
available in
PowerCenter:
• Aggregator Transformation
• Expression Transformation
• Filter Transformation
• Joiner Transformation
• Lookup Transformation
• Normalizer Transformation
• Rank Transformation
• Router Transformation
• Sequence Generator Transformation
• Stored Procedure Transformation
• Sorter Transformation
• Update Strategy Transformation
• XML Source Qualifier Transformation
• Advanced External Procedure
Transformation
• External Transformation
• Union Transformation
In the following pages, we will explain all
the above Informatica Transformations and
their significances in the ETL process in
detail.
Aggregator Transformation
Aggregator transformation is an Active and
Connected transformation. This
transformation is useful to perform
calculations such as averages and sums
(mainly to perform calculations on multiple
rows or groups). For example, to calculate
total of daily sales or to calculate average
of monthly or yearly sales. Aggregate
functions such as AVG, FIRST, COUNT,
PERCENTILE, MAX, SUM etc. can be used in
aggregate transformation.
Expression Transformation
Expression transformation is a Passive and
Connected transformation. This can be
used to calculate values in a single row
before writing to the target. For example,
to calculate discount of each product or to
concatenate first and last names or to
convert date to a string field.
Filter Transformation
Filter transformation is an Active and
Connected transformation. This can be
used to filter rows in a mapping that do not
meet the condition. For example, to know
all the employees who are working in
Department 10 or to find out the products
that falls between the rate category $500
and $1000.
Joiner Transformation
Joiner Transformation is an Active and
Connected transformation. This can be
used to join two sources coming from two
different locations or from same location.
For example, to join a flat file and a
relational source or to join two flat files or
to join a relational source and a XML
source. In order to join two sources, there
must be at least one matching port. While
joining two sources it is a must to specify
one source as master and the other as
detail. The Joiner transformation supports
the following types of joins:
• Normal
• Master Outer
• Detail Outer
• Full Outer
Normal join discards all the rows of data
from the master and detail source that do
not match, based on the condition.
Master outer join discards all the
unmatched rows from the master source
and keeps all the rows from the detail
source and the matching rows from the
master source.
Detail outer join keeps all rows of data from
the master source and the matching rows
from the detail source. It discards the
unmatched rows from the detail source.
Full outer join keeps all rows of data from
both the master and detail sources.
Lookup Transformation
Lookup transformation is Passive and it can
be both Connected and UnConnected as
well. It is used to look up data in a
relational table, view, or synonym. Lookup
definition can be imported either from
source or from target tables.
For example, if we want to retrieve all the
sales of a product with an ID 10 and
assume that the sales data resides in
another table. Here instead of using the
sales table as one more source, use Lookup
transformation to lookup the data for the
product, with ID 10 in sales table.
Difference between Connected and
UnConnected Lookup Transformation:
Connected lookup receives input values
directly from mapping pipeline whereas
UnConnected lookup receives values from:
LKP expression from another
transformation.
Connected lookup returns multiple columns
from the same row whereas UnConnected
lookup has one return port and returns one
column from each row.
Connected lookup supports user-defined
default values whereas UnConnected
lookup does not support user defined
values..
Normalizer Transformation
Normalizer Transformation is an Active and
Connected transformation. It is used mainly
with COBOL sources where most of the
time data is stored in de-normalized
format. Also, Normalizer transformation
can be used to create multiple rows from a
single row of data.
Rank Transformation
Rank transformation is an Active and
Connected transformation. It is used to
select the top or bottom rank of data. For
example, to select top 10 Regions where
the sales volume was very high or to select
10 lowest priced products.
Router Transformation
Router is an Active and Connected
transformation. It is similar to filter
transformation. The only difference is, filter
transformation drops the data that do not
meet the condition whereas router has an
option to capture the data that do not meet
the condition. It is useful to test multiple
conditions. It has input, output and default
groups. For example, if we want to filter
data like where State=Michigan,
State=California, State=New York and all
other States. It’s easy to route data to
different tables..
Sequence Generator Transformation
Sequence Generator transformation is a
Passive and Connected transformation. It is
used to create unique primary key values
or cycle through a sequential range of
numbers or to replace missing keys.
It has two output ports to connect
transformations. By default it has two fields
CURRVAL and NEXTVAL(You cannot add
ports to this transformation). NEXTVAL port
generates a sequence of numbers by
connecting it to a transformation or target.
CURRVAL is the NEXTVAL value plus one or
NEXTVAL plus the Increment By value.
Stored Procedure Transformation
Stored Procedure transformation is a
Passive and Connected & UnConnected
transformation. It is useful to automate
time-consuming tasks and it is also used in
error handling, to drop and recreate
indexes and to determine the space in
database, a specialized calculation etc.
The stored procedure must exist in the
database before creating a Stored
Procedure transformation, and the stored
procedure can exist in a source, target, or
any database with a valid connection to the
Informatica Server. Stored Procedure is an
executable script with SQL statements and
control statements, user-defined variables
and conditional statements. In case of
stored procedure transformation procedure
will be compiled and executed in a
relational data source. You need data base
connection to import the stored procedure
in to your maping
Sorter Transformation
Sorter transformation is a Connected and
an Active transformation. It allows to sort
data either in ascending or descending
order according to a specified field. Also
used to configure for case-sensitive sorting,
and specify whether the output rows
should be distinct.
Source Qualifier Transformation
Source Qualifier transformation is an Active
and Connected transformation. When
adding a relational or a flat file source
definition to a mapping, it is must to
connect it to a Source Qualifier
transformation. The Source Qualifier
performs the various tasks such as
overriding default SQL query, filtering
records; join data from two or more tables
etc.
Update Strategy Transformation
Update strategy transformation is an Active
and Connected transformation. It is used to
update data in target table, either to
maintain history of data or recent changes.
You can specify how to treat source rows in
table, insert, update, delete or data driven.
XML Source Qualifier Transformation
XML Source Qualifier is a Passive and
Connected transformation. XML Source
Qualifier is used only with an XML source
definition. It represents the data elements
that the Informatica Server reads when it
executes a session with XML sources.
Advanced External Procedure
Transformation
Advanced External Procedure
transformation is an Active and Connected
transformation. It operates in conjunction
with procedures, which are created outside
of the Designer interface to extend
PowerCenter/PowerMart functionality. It is
useful in creating external transformation
applications, such as sorting and
aggregation, which require all input rows to
be processed before emitting any output
rows.
Union Transformation
The union transformation is used to merge
multiple datasets from various streams or
pipelines into one dataset. This
transformation works similar to the UNION
ALL, it does not remove any duplicate rows.
It is recommended to use aggregator to
remove duplicates are not expected at the
target.
External Procedure Transformation
External Procedure transformation is an
Active and Connected/UnConnected
transformations. Sometimes, the standard
transformations such as Expression
transformation may not provide the
functionality that you want. In such cases
External procedure is useful to develop
complex functions within a dynamic link
library (DLL) or UNIX shared library, instead
of creating the necessary Expression
transformations in a mapping.
Differences between Advanced External
Procedure and External Procedure
Transformations:
External Procedure returns single value,
where as Advanced External Procedure
returns multiple values.
External Procedure supports COM and
Informatica procedures where as AEP
supports only Informatica Procedures

You might also like