You are on page 1of 28

DATAWARE HOUSING FUNDAMENTALS

Definition of Data warehouse Inmon A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of managements decisions.
OR

The Dataware House is an informational environment that


Provides an integrated and total view of the enterprise Makes the enterprises Current Historical and Information easily available for decision making Makes the decision-support transactions possible without hindering operational systems Renders the organizations information consistent Presents a flexible and interactive source of strategic information

OR

A copy of the transactional data specially structured for reporting and analysis

Organizations Use of Dataware Housing

Retail Customer Loyalty Market Planning Financial Risk Management Fraud Detection Manufacturing Cost Reduction Logistics Management Utilities Asset Management Resource Management Airlines Route Profitability Yield Management

Dataware House Subject Oriented


 Organized around major subjects, such as customer, Sales, Account.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or

transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Operational Systems

Data Warehouse

Customer Billing Order Processing

Accounts Receivable

Customer Data

Account

Sales

REG Data

Dataware House - Integrated


 Constructed by integrating multiple, heterogeneous data sources

Relational or other databases, flat files, external data  Data cleaning and data integration techniques are applied  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources  When data is moved to the warehouse, it is converted


Operational Systems

Data Warehouse

Savings Account Loans Account

Subject = Account

Checking Account

Dataware House Non Volatile


 A physically separate store of data transformed from the operational environment  Operational update of data does not occur in the data warehouse environment
 

Does not require transaction processing, recovery, and concurrency control mechanisms Requires only : loading and access of data.

Operational Systems Delete Load Order Processing Update Insert

Data Warehouse

Create

Access Sales Data

Dataware House Time Variant


 The time horizon for the data warehouse is significantly longer than that of operational

systems  Operational database: current value data  Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time  But the key of operational data may or may not contain time element

Operational Systems

Data Warehouse

Deposit System

Customer Data

60 - 90 days

5 - 10 years

Dataware House OLTP Vs OLAP OLTP (On-line Transaction Processing)

            

holds current data Useful for end users stores detailed data data is dynamic repetitive processing (One record process at a time) high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions Response time is very quick serves large number of operational users

y OLAP (On-line Analytical Processing)


           

holds historic and integrated data Useful for EIS And DSS stores detailed and summarized data data is largely static (A group of records processed in a batch) ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions Response time is optimum serves relatively lower level of managerial users

Dataware House Architecture

Staging Area

Dataware House Vs Data Mart


y y y y y y y

Dataware House Corporate/Enterprise wide Union of all data marts Data received from staging area Structure for corporate view of data Queries on presentation resource Organized an ER model

y y y y y y y

Data Mart Departmental A single Business process Star join (Facts & Dimensions) Structure to view the departmental view of the data Technology optimal for data access and analysis Structure to suit the departmental view of data

Dataware To Meet Requirements within Dataware House

The data is organized differently in Dataware House (e.g. : Multidimenssional) -Star Schema -Snow Flake Schema The data is viewed differently Data is stored differently -Vector (array) storage Data is Indexed Differently -Bitmap indexes -Join indexes

Star Schema
y

Star Schema : A modeling technique used to map multidimensional decision support data into a relational database with the purpose of performing advantage data analysis OR A relational database schema organized around a central table (Fact table) joined to few smaller tables (dimension tables) using foreign key references Types of star schema 1)Basic star schema or Star Schema 2)Extended star schema or Snowflake schema.

Multidimensional modeling

Multidimensional modeling is based on the concept of star schema. Star schema consists of two types of tables. 1)Fact table 2)Dimension table

Fact Table : Fact table contains the transactional data generated out of business transactions Dimension Table : Dimension table contains master data or referential data used to analyze transactional data

Fact Table contains two types of columns 1)Measures 2)Key section Dataware House 3 types of measures 1)Additive measures 2)Non-additive measures 3)semi additive measures

Key Section
Date Prod_id Cust_id

Measures Sales_revenue Tot_quantity Unit_cost Sale_price


Fact Table

Additive measures : Measures that can involve in the calculation inorder to derive new measures Non-additive measures : Measures that cant participate in the calculations Semi-additive measures : Measures that can be participate in the calculations depend on the context Measures that can be added across few dimensions and not with others.

Types of Star Schema


Dataware House supports 2 types of star schemas 1)Basic star schema or Star schema 2)Extended star schema or Snow flake schema Star Schema : Fact tables existing in normalized format where as dimension tables existing in the demoralized format Snowflake Schema : Fact and dimension tables are existed in the normalized format Fact less fact table or Coverage tables: Events of the transactions can occur without the measures. Resulting in a fact table without the measures

Example of Star Schema

Example Of Snow Flake Schema

Dataware House Slowly Changing Dimensions


Slowly Changing Dimensions : Dimensions that change over time are called Slowly Changing Dimensions. For instance,a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time.

Type1:Over writing the old values Type2:Creating an another additional record Type3:Creating new fields

SCD Type1
y y

Type1 : Overwriting the old values Product price in 2004


Product ID (PK) 1 year 2004 Prod Name Product1 Price 150

In the year 2005, if the price of the product changes to $250, then the old values of the columns "Year" and "Product Price" have to be updated and replaced with the new values. In this Type 1, there is no way to find out the old value of the product "Product1" in year 2004 since the table now contains only the new price and year information.

Product

Product ID (PK) 1

Year 2005

Prod Name Product1

Price 250

SCD Type2
y

Type 2: Creating an another additional record.

PRODUCT
Product ID (PK) 1 1 Effective Date time (PK) 01-01-2004 12.00AM 01-01-2005 12.00AM Year 2004 2005 Product Name Product1 Product1 Price 150 250 Expiry Date time 12-31-2004 11.59PM

SCD Type3
y
y

Type3 : Creating new fields


In this Type 3, the latest update to the changed values can be seen. Example mentioned below illustrates how to add new columns and keep track of the changes. From that, we are able to see the current price and the previous price of the product, Product1. Product ID(PK) 1 Current Year 2005 Product Name Product1 Current Product Price 250 Old Product Price 150 Old Year

2004

The problem with the Type 3 approach, is over years, if the product price continuously changes, then the complete history may not be stored, only the latest change will be stored. For example, in year 2006, if the product1's price changes to $350, then we would not be able to see the complete history of 2004 prices, since the old values would have been updated with 2005 product information

Product ID(PK) 1

Year

Product Name Product1

Product Price 350

Old Product Price 250

Old Year

2006

2005

Extract, transform, and load (ETL) is a process in database usage and especially in data warehousing that involves: Extracting data from outside sources Transforming it to fit operational needs (which can include quality levels) Loading it into the end target (database or data warehouse) Extract : The first part of an ETL process involves extracting the data from the source systems. Most data warehousing projects consolidate data from different source systems. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. Extraction converts the data into a format for transformation processing.

Transform The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database: Generating surrogate-key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns) Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table) Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.

Load The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative, frequently updating extract data is done on daily, weekly or monthly. while other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly. As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.

Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up

A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data. ETL applications implement three main types of parallelism: Data: By splitting a single sequential file into smaller data files to provide parallel access. Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.

AB INITIO INTRODUCTION
y Data

processing tool from Ab Initio software corporation

(http://www.abinitio.com)
y Latin for from the beginning y Designed

to

support

largest

and

most

complex

business

applications
y Graphical, intuitive, and fits the way your business works text

Importance of Ab Initio When Compared to other ETL s

1) Able to Process huge amount of data in a less span of time 2) Easy to write complex and custom ETL logics especially in case of Banking and Financial Applications. Ex :- Amortization. 3) Ab Initio follows all three types of parallelism , which an ETL tool needs to handle. 4) Data Parallelism of Ab Initio is one feature which makes it distinct from the other ETL tools. 5) When Handling complex logics , you can write custom code , as it is Pro C based code.

You might also like