Professional Documents
Culture Documents
What is a Data Warehouse? According to Inmon, famous author for several data warehouse
books, "A data warehouse is a subject oriented, integrated, time variant, non volatile collection of
data in support of management's decision making process".
Example: In order to store data, over the years, many application designers in each branch have
made their individual decisions as to how an application and database should be built. So source
systems will be different in naming conventions, variable measurements, encoding structures, and
physical attributes of data. Consider a bank that has got several branches in several countries, has
millions of customers and the lines of business of the enterprise are savings, and loans. The
following example explains how the data is integrated from source systems to target systems.
In the aforementioned example, attribute name, column name, data type and values are entirely
different from one source system to another. This inconsistency in data can be avoided by
integrating the data into a data warehouse with good standards.
In the above example of target data, attribute names, column names, and data types are
consistent throughout the target system. This is how data from various source systems is
integrated and accurately stored into the data warehouse.
Figure 1.12 : Data Warehouse Architecture
1. Enterprise Data Warehouse - An enterprise data warehouse provides a central database for
decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real
enterprises data warehouse, data is refreshed in near real time and used for routine business
activity.
3. Data Mart - Data mart is a subset of data warehouse and it supports a particular region,
business unit or business function.
Data warehouses and data marts are built on dimensional data modeling where fact tables are
connected with dimension tables. This is most useful for users to access data since a database can
be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing
and dicing that cube along each of its dimensions.
Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of
business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from
an enterprise-wide data warehouse. In an independent data mart, data can be collected directly
from sources.
General Information
In general, an organization is started to earn money by selling a product or by providing service to
the product. An organization may be at one place or may have several branches.
When we consider an example of an organization selling products throughout the world, the main
four major dimensions are product, location, time and organization. Dimension tables have been
explained in detail under the section Dimensions. With this example, we will try to provide
detailed explanation about STAR SCHEMA.
• Identify dimensions for facts (product dimension, location dimension, time dimension, organization
dimension).
• List the columns that describe each dimension.(region name, branch name, region name).
• In a star schema, a dimension table will not have any parent table.
• Whereas in a snow flake schema, a dimension table will have one or more parent tables.
• Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
• Whereas hierarchies are broken into separate tables in snow flake schema. These hierarchies helps
to drill down the data from topmost hierarchies to the lowermost hierarchies.
Glossary:
Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be
used to define data aggregation; for example, in a time dimension, a hierarchy might be used to
aggregate data from the Month level to the Quarter level, from the Quarter level to the Year
level. A hierarchy can also be used to define a navigational drill path, regardless of whether the
levels in the hierarchy represent aggregated totals or not.
Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents
data at the Month, Quarter, and Year levels.
Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table typically
has two types of columns: those that contain facts and those that are foreign keys to dimension
tables. The primary key of a fact table is usually a composite key that is made up of all of its
foreign keys.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables
that contain aggregated facts are often instead called summary tables). A fact table usually
contains facts with the same level of aggregation.
Example of Star Schema: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product, time and
organization. It shows that data can be sliced across all dimensions and again it is possible for
the data to be aggregated across multiple dimensions. "Sales Dollar" in sales fact table can be
calculated across all dimensions independently or in a combined manner which is explained
below.
• Sales Dollar value for a product in a year within a location sold or serviced by an employee
Snowflake Schema
A snowflake schema is a term that describes a star schema structure normalized through the use
of outrigger tables. i.e. dimension table hierarchies are broken into simpler tables. In star schema
example we had 4 dimensions like location, product, time, organization and a fact table (sales).
In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables
and 1 fact table. The reason is that hierarchies (category, branch, state, and month) are being
broken out of the dimension tables (PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively
and shown separately. In OLAP, this Snowflake schema approach increases the number of joins and
poor performance in retrieval of data. In few organizations, they try to normalize the dimension
tables to save space. Since dimension tables hold less space, Snowflake schema approach may be
avoided.
Example of Snowflake Schema: Figure 1.7
Fact Table
The centralized table in a star schema is called as FACT table. A fact table typically has two types
of columns: those that contain facts and those that are foreign keys to dimension tables. The
primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
In the example fig 1.6 "Sales Dollar" is a fact(measure) and it can be added across several
dimensions. Fact tables store different types of measures like additive, non additive and semi
additive measures.
Measure Types
• Semi Additive - Measures that can be added across few dimensions and not with others.
A fact table might contain either detail level facts or facts that have been aggregated (fact tables
that contain aggregated facts are often instead called summary tables).
In the real world, it is possible to have a fact table that contains no measures or facts. These
tables are called as Factless Fact tables.
Steps in designing Fact Table
• Identify dimensions for facts(product dimension, location dimension, time dimension, organization
dimension).
• List the columns that describe each dimension.(region name, branch name, region name).
Example of a Fact Table with an Additive Measure in Star Schema: Figure 1.6
In the example figure 1.6, sales fact table is connected to dimensions location, product, time and
organization. Measure "Sales Dollar" in sales fact table can be added across all dimensions
independently or in a combined manner which is explained below.
• Sales Dollar value for a product in a year within a location sold or serviced by an employee
Database - RDBMS
There are a number of relational databases to store data. A relational database contains
normalized data stored in tables. Tables contain records and columns. RDBMS makes it easy to
work with individual records. Each row contains a unique instance of data for the categories
defined by the columns.
RDBMS are used in OLTP applications(e.g. ATM cards) very frequently and sometimes
datawarehouse may also use relational databases. Please refer to Relational data modeling for
details to know how data from a source system is normalized and stored in RDBMS databases.
Popular RDBMS Databases
RDBMS Name Company Name
Oracle Oracle Corporation
IBM DB2 UDB IBM Corporation
IBM Informix IBM Corporation
Microsoft SQL Server Microsoft
Sybase Sybase Corporation
Terradata NCR
ETL Tools are meant to extract, transform and load the data into Data Warehouse for decision
making. Before the evolution of ETL Tools, the above mentioned ETL process was done manually by
using SQL code created by programmers. This task was tedious and cumbersome in many cases
since it involved many resources, complex coding and more work hours. On top of it, maintaining
the code placed a great challenge among the programmers.
These difficulties are eliminated by ETL Tools since they are very powerful and they offer many
advantages in all stages of ETL process starting from extraction, data cleansing, data profiling,
transformation, debugging and loading into data warehouse when compared to the old method.
There are a number of ETL tools available in the market to do ETL process the data according to
business/technical requirements. Following are some those.
ETL Concepts
Extraction, transformation, and loading. ETL refers to the methods involved in accessing and
manipulating source data and loading it into target database.
The first step in ETL process is mapping the data between source systems and target
database(data warehouse or data mart). The second step is cleansing of source data in staging
area. The third step is transforming cleansed source data and then loading into the target system.
Note that ETT (extraction, transformation, transportation) and ETM (extraction, transformation,
move) are sometimes used instead of ETL.
Source System
A database, application, file, or other storage facility from which the data in a data warehouse is
derived.
Mapping
The definition of the relationship and data flow between source and target objects.
Metadata
Data that describes data and other structures, such as objects, business rules, and processes. For
example, the schema design of a data warehouse is typically stored in a repository as metadata,
which is used to generate scripts used to build and populate the data warehouse. A repository
contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
Cleansing
The process of resolving inconsistencies and fixing the anomalies in source data, typically as part of
the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying is a transformation. Examples
include cleansing, aggregating, and integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to a data warehouse.
Target System
A database, application, file, or other storage facility to which the "transformed source data" is
loaded in a data warehouse.
Informatica is a powerful ETL tool from Informatica Corporation, a leading provider of enterprise
data integration software and ETL softwares.
The important Informatica Components are:
• Power Exchange
• Power Center
• Power Exchange
• Power Channel
• Metadata Exchange
• Power Analyzer
• Super Glue
In Informatica, all the Metadata information about source systems, target systems and
transformations are stored in the Informatica repository. Informatica's Power Center Client and
Repository Server access this repository to store and retrieve metadata.
Note: To know more about Metadata and its significance, please click here.
• Repository: This is where all the metadata information is stored in the Informatica suite. The
Power Center Client and the Repository Server would access this repository to retrieve, store and
manage metadata.
• Power Center Client: Informatica client is used for managing users, identifiying source and target
systems definitions, creating mapping and mapplets, creating sessions and run workflows etc.
• Repository Server: This repository server takes care of all the connections between the
repository and the Power Center Client.
• Power Center Server: Power Center server does the extraction from source and then loading
data into targets.
• Designer: Source Analyzer, Mapping Designer and Warehouse Designer are tools reside within the
Designer wizard. Source Analyzer is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and targets. Mapping is a pictorial
representation about the flow of data from source to target.
Warehouse Designer is used for extracting metadata from target systems or metadata can be
created in the Designer itself.
• Data Cleansing: The PowerCenter's data cleansing technology improves data quality by
validating, correctly naming and standardization of address data. A person's address may not be
same in all source systems because of typos and postal code, city name may not match with
address. These errors can be corrected by using data cleansing process and standardized data can
be loaded in target systems (data warehouse).
• Workflow Manager: Workflow helps to load the data from source to target in a sequential
manner. For example, if the fact tables are loaded before the lookup tables, then the target system
will pop up an error message since the fact table is violating the foreign key validation. To avoid
this, workflows can be created to ensure the correct flow of data from source to target.
• Workflow Monitor: This monitor is helpful in monitoring and tracking the workflows created in
each Power Center Server.
• Power Center Connect: This component helps to extract data and metadata from ERP systems
like IBM's MQSeries, PeopleSoft, SAP, Siebel etc. and other third party applications.
• Power Center Exchange: This component helps to extract data and metadata from ERP systems
like IBM's MQSeries, PeopleSoft, SAP, Siebel etc. and other third party applications.
Informatica
Power Exchange:
Informatica Power Exchange as a stand alone service or along with Power Center, helps
organizations leverage data by avoiding manual coding of data extraction programs. Power
Exchange supports batch, real time and changed data capture options in main frame(DB2, VSAM,
IMS etc.,), mid range (AS400 DB2 etc.,), and for relational databases (oracle, sql server, db2 etc)
and flat files in unix, linux and windows systems.
Power Channel:
This helps to transfer large amount of encrypted and compressed data over LAN, WAN, through
Firewalls, tranfer files over FTP, etc.
Power Analyzer:
Power Analyzer provides organizations with reporting facilities. PowerAnalyzer makes accessing,
analyzing, and sharing enterprise data simple and easily available to decision makers.
PowerAnalyzer enables to gain insight into business processes and develop business intelligence.
With PowerAnalyzer, an organization can extract, filter, format, and analyze corporate information
from data stored in a data warehouse, data mart, operational data store, or other data storage
models. Power Analyzer is best with a dimensional data warehouse in a relational database. It can
also run reports on data in any table in a relational database that do not conform to the
dimensional model.
Super Glue:
Superglue is used for loading metadata in a centralized place from several sources. Reports can be
run against this superglue to analyze meta data.
Power Mart:
Power Mart is a departmental version of Informatica for building, deploying, and managing data
warehouses and data marts. Power center is used for corporate enterprise data warehouse and
power mart is used for departmental data warehouses like data marts. Power Center supports
global repositories and networked repositories and it can be connected to several sources. Power
Mart supports single repository and it can be connected to fewer sources when compared to Power
Center. Power Mart can extensible grow to an enterprise implementation and it is easy for
developer productivity through a codeless environment.
Note: This is not a complete tutorial on Informatica. We will add more Tips and Guidelines on
Informatica in near future. Please visit us soon to check back. To know more about Informatica,
contact its official website www.informatica.com.