Abinitio

DATAWARE HOUSING FUNDAMENTALS
Definition of Data warehouse Inmon A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of managements decisions.
OR
The Dataware House is an informational environment that

Provides an integrated and total view of the enterprise Makes the enterprises Current Historical and Information easily available for decision making Makes the decision-support transactions possible without hindering operational systems Renders the organizations information consistent Presents a flexible and interactive source of strategic information
OR
A copy of the transactional data specially structured for reporting and analysis
Organizations Use of Dataware Housing
Retail Customer Loyalty Market Planning Financial Risk Management Fraud Detection Manufacturing Cost Reduction Logistics Management Utilities Asset Management Resource Management Airlines Route Profitability Yield Management
Dataware House Subject Oriented

Organized around major subjects, such as customer, Sales, Account. Focusing on the modeling and analysis of data for decision makers, not on daily operations or
transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Operational Systems
Data Warehouse
Customer Billing Order Processing
Accounts Receivable
Customer Data
Account
Sales
REG Data
Dataware House - Integrated

Constructed by integrating multiple, heterogeneous data sources
Relational or other databases, flat files, external data Data cleaning and data integration techniques are applied Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted

Operational Systems
Data Warehouse
Savings Account Loans Account
Subject = Account
Checking Account
Dataware House Non Volatile

A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment

Does not require transaction processing, recovery, and concurrency control mechanisms Requires only : loading and access of data.
Operational Systems Delete Load Order Processing Update Insert
Data Warehouse
Create
Access Sales Data
Dataware House Time Variant

The time horizon for the data warehouse is significantly longer than that of operational
systems Operational database: current value data Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time But the key of operational data may or may not contain time element
Operational Systems
Data Warehouse
Deposit System
Customer Data
60 - 90 days
5 - 10 years
Dataware House OLTP Vs OLAP OLTP (On-line Transaction Processing)

holds current data Useful for end users stores detailed data data is dynamic repetitive processing (One record process at a time) high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions Response time is very quick serves large number of operational users
y OLAP (On-line Analytical Processing)

holds historic and integrated data Useful for EIS And DSS stores detailed and summarized data data is largely static (A group of records processed in a batch) ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions Response time is optimum serves relatively lower level of managerial users
Dataware House Architecture
Staging Area
Dataware House Vs Data Mart

y y y y y y y
Dataware House Corporate/Enterprise wide Union of all data marts Data received from staging area Structure for corporate view of data Queries on presentation resource Organized an ER model
y y y y y y y
Data Mart Departmental A single Business process Star join (Facts & Dimensions) Structure to view the departmental view of the data Technology optimal for data access and analysis Structure to suit the departmental view of data
Dataware To Meet Requirements within Dataware House
The data is organized differently in Dataware House (e.g. : Multidimenssional) -Star Schema -Snow Flake Schema The data is viewed differently Data is stored differently -Vector (array) storage Data is Indexed Differently -Bitmap indexes -Join indexes
Star Schema
y
Star Schema : A modeling technique used to map multidimensional decision support data into a relational database with the purpose of performing advantage data analysis OR A relational database schema organized around a central table (Fact table) joined to few smaller tables (dimension tables) using foreign key references Types of star schema 1)Basic star schema or Star Schema 2)Extended star schema or Snowflake schema.
Multidimensional modeling
Multidimensional modeling is based on the concept of star schema. Star schema consists of two types of tables. 1)Fact table 2)Dimension table
Fact Table : Fact table contains the transactional data generated out of business transactions Dimension Table : Dimension table contains master data or referential data used to analyze transactional data
Fact Table contains two types of columns 1)Measures 2)Key section Dataware House 3 types of measures 1)Additive measures 2)Non-additive measures 3)semi additive measures
Key Section
Date Prod_id Cust_id
Measures Sales_revenue Tot_quantity Unit_cost Sale_price

Fact Table
Additive measures : Measures that can involve in the calculation inorder to derive new measures Non-additive measures : Measures that cant participate in the calculations Semi-additive measures : Measures that can be participate in the calculations depend on the context Measures that can be added across few dimensions and not with others.
Types of Star Schema

Dataware House supports 2 types of star schemas 1)Basic star schema or Star schema 2)Extended star schema or Snow flake schema Star Schema : Fact tables existing in normalized format where as dimension tables existing in the demoralized format Snowflake Schema : Fact and dimension tables are existed in the normalized format Fact less fact table or Coverage tables: Events of the transactions can occur without the measures. Resulting in a fact table without the measures
Example of Star Schema
Example Of Snow Flake Schema
Dataware House Slowly Changing Dimensions

Slowly Changing Dimensions : Dimensions that change over time are called Slowly Changing Dimensions. For instance,a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.
Type1:Over writing the old values Type2:Creating an another additional record Type3:Creating new fields
SCD Type1
y y
Type1 : Overwriting the old values Product price in 2004

Product ID (PK) 1 year 2004 Prod Name Product1 Price 150
In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find out the old value of the product "Product1" in year 2004 since the table now contains only the new price and year information.
Product
Product ID (PK) 1
Year 2005
Prod Name Product1
Price 250
SCD Type2
y
Type 2: Creating an another additional record.
PRODUCT
Product ID (PK) 1 1 Effective Date time (PK) 01-01-2004 12.00AM 01-01-2005 12.00AM Year 2004 2005 Product Name Product1 Product1 Price 150 250 Expiry Date time 12-31-2004 11.59PM
SCD Type3
y
y
Type3 : Creating new fields

In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to add new columns and keep track of the changes. From that, we are able to see the current price and the previous price of the product, Product1. Product ID(PK) 1 Current Year 2005 Product Name Product1 Current Product Price 250 Old Product Price 150 Old Year
2004
The problem with the Type 3 approach, is over years, if the product price continuously changes, then the complete history may not be stored, only the latest change will be stored. For example, in year 2006, if the product1's price changes to $350, then we would not be able to see the complete history of 2004 prices, since the old values would have been updated with 2005 product information
Product ID(PK) 1
Year
Product Name Product1
Product Price 350
Old Product Price 250
Old Year
2006
2005
Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves: Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse) Extract : The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. Extraction converts the data into a format for transformation processing.
Transform The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database: Generating surrogate-key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table) Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.
Load The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, frequently updating extract data is done on daily, weekly or monthly. while other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly. As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data. ETL applications implement three main types of parallelism: Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.
AB INITIO INTRODUCTION
y Data
processing tool from Ab Initio software corporation
(http://www.abinitio.com)
y Latin for from the beginning y Designed
to
support
largest
and
most
complex
business
applications
y Graphical, intuitive, and fits the way your business works text
Importance of Ab Initio When Compared to other ETL s
1) Able to Process huge amount of data in a less span of time 2) Easy to write complex and custom ETL logics especially in case of Banking and Financial Applications. Ex :- Amortization. 3) Ab Initio follows all three types of parallelism , which an ETL tool needs to handle. 4) Data Parallelism of Ab Initio is one feature which makes it distinct from the other ETL tools. 5) When Handling complex logics , you can write custom code , as it is Pro C based code.

Abinitio

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abinitio

Uploaded by

Copyright:

Available Formats

DATAWARE HOUSING FUNDAMENTALS

The Dataware House is an informational environment that

Organizations Use of Dataware Housing

Dataware House Subject Oriented

Customer Billing Order Processing

Dataware House - Integrated

Savings Account Loans Account

Dataware House Non Volatile

Operational Systems Delete Load Order Processing Update Insert

Access Sales Data

Dataware House Time Variant

Dataware House OLTP Vs OLAP OLTP (On-line Transaction Processing)

y OLAP (On-line Analytical Processing)

Dataware House Architecture

Dataware House Vs Data Mart

Dataware To Meet Requirements within Dataware House

Measures Sales_revenue Tot_quantity Unit_cost Sale_price

Types of Star Schema

Example of Star Schema

Example Of Snow Flake Schema

Dataware House Slowly Changing Dimensions

Type1 : Overwriting the old values Product price in 2004

Prod Name Product1

Type 2: Creating an another additional record.

Type3 : Creating new fields

Product Name Product1

Product Price 350

Old Product Price 250

processing tool from Ab Initio software corporation

Importance of Ab Initio When Compared to other ETL s

You might also like