You are on page 1of 41

Data Warehousing Concepts

Contents 1. Introduction to Data Warehousing 2. Data Warehouse Architecture 3. Dimensional Modeling 4. OLAP 5. Data Warehousing Tools 6. iGate DW Capabilites 7. Reference Material

1. Introduction to Data Warehousing


1. 2. 3. 4. 5. 6. Data Warehousing Definition Online Transaction Processing (OLTP) System Data Warehousing System Difference between OLTP and DW System Reasons for Building a Data Warehouse Benefits of Data Warehousing

1.1 What is a Data Warehouse?


Data Warehouse is - primarily a centralized repository of an organizations data. - holds large amount of data including historical info. - designed to support efficient data analysis and reporting.

1.2 OLTP Systems


Focus:
Designed to get data in quickly and to analyze the current events. Transaction Oriented. Organized around business processes such as Order Entry, Purchasing, Campaign Management, Trading etc. Avoidance of data duplication, maintainability etc.

Characteristics:
Process Oriented. Normalized Data. Current Data. Volatile Data. Real Time Updates.

1.3 Data Warehousing Systems


Focus:
Designed to get data out and quickly analyze. Concerned with customer, product etc. rather than order entry, campaign management. Focus on easy data access . Contains slices of data across different periods of time. Historical data supports trending, forecasting and time based performance reporting.

Characteristics: - Subject oriented rather than process oriented. - Integrated across subjects and entire enterprise. - De-Normalized Data. - Time-Variant. - Historical Data. - Non Volatile - Atomic and Summary Data.

1.4 OLTP Vs Data Warehouse


OLTP Systems
Normalized Data Used to run the business Real-Time data update Volatile Data Current Data Wider Audience. Transaction throughput Small to large database

Data Warehouse
De-Normalized Data Used to analyze the business Updated on a predefined schedule Non-Volatile Data Historical Data Limited Audience. Fast Query Response Large to Very Large Database

1.5 Why Build a Data Warehouse?


No Single Version of Truth. Lack of standardized data across the enterprise for easy understanding and further decision-making. Absence of historical data for the purpose of analysis and decision making.

1.6 Benefits of Data Warehousing


Rapid Access to data. Integrated data. Reliable Reporting. Better Decision making.

2. Data Warehouse Architecture


Logical Architecture Elements of Data Warehouse

10

Data Warehouse - Logical Architecture

11

Elements of A Data Warehouse 1


ETL & Staging
SAP

2
Data Storage
Quality ETL Tool Accounts

3
Reporting Layer BI Tools, Portals

70% of Effort in a Data Warehousing solution is in developing a successful ETL strategy

Quality

CRM

ETL Tool or Process

Staging
Operational Data Storage

Inventory

Data Warehouse

Secured Access

Finance

Inventory
Enterprise Information is stored in the warehouse structure

Marktng
Manufacturing

ETL tool will interface with all the sources in the enterprise and extract data in a batch cycle or in real time

BI Tools interface with the databases to generate reports

METADATA

Extracting
The extract step is the first step involved in getting data into the data ware house environment. Extracting means reading and understanding the source data, and copying the parts that are needed to the data staging area for further work Extracting data needs to be done carefully so as not to effect production environments

12

Transforming
Once the data is extracted into the data staging area, there many possible transformation steps, including: Cleaning the data by correcting misspellings, resolving domain conflicts (such as a city name that is incompatible with a postal code), dealing with missing data elements, and parsing into standard formats Purging selected fields from the legacy data that are not useful for the data warehouse Combining data sources, by matching exactly on key values or by performing fuzzy matches on non-key attributes, including looking up textual equivalents of legacy systems codes Creating surrogate keys for each dimension record in order to avoid a dependence on legacy defined keys, where the surrogate key generation process enforces referential integrity between the dimension tables and the fact tables Building aggregates for boosting the performance of common queries

13

Staging Area
A storage area and set of processes that clean, transform, combine, duplicate, household, archive, and prepare source data for use in the data warehouse The data staging area is everything in between the source system and the presentation server The data staging area is not part of the physical data warehouse

The staging area is dominated by the simple activities of sorting and sequential processing
The data staging area does not need to be based on relational technology Data staging area does not provide query and presentation services

14

Loading Data
At the end of the transformation process, the data is in a position to be loaded across to the target warehouse

First time bulk load to get the historical data into the Data Warehouse
Periodic Incremental loads to bring in modified data Loading in the data warehouse environment usually takes the form of inserting data into dimension tables and fact table. These are the tables that are typically queried on by the users/tools while executing reports Bulk loading is a very important capability that is to be contrasted with record-at-a-time loading, which is far slower and can cause load times to be in the 10 hours+ range It may be required to drop and recreate indexes on the target warehouse structure each time data loading occurs Of late there is a move to real time data integration. Here effectively data is moved from source system across to the warehouse in a trickle feed manner throughout the day

15

The Data Warehouse


The data warehouse is the queryable presentation resource for an enterprises data

The data Warehouse is the centralized repository of historical information covering every subject area within the organization
This presentation resource is not organized around an entity-relation model Using entity-relation modeling will lose understandability and performance The data warehouse is nothing more than the union of all the constituent data marts A data warehouse is fed from the data staging area The data warehouse manager is responsible both for the data warehouse and the data staging area

16

The Data Marts


A logical subset of the complete data warehouse A data mart is a complete pie-wedge of the overall data warehouse pie Data marts are subject specific, such as financial, HR, Marketing A data mart represents a project that can be brought to completion rather than being an impossible galactic undertaking such as building a Warehouse A data warehouse is made up of the union of all its data marts Virtual Warehouse Every data mart must be represented by a dimensional model and, within a single data warehouse, all such data mart must be built from conformed dimensions and conformed facts

17

3. Dimensional Modeling
What is Dimensional Modeling? ER Model Vs Dimensional Model Facts Dimensions Surrogate Key Snow Flake Examples: Star Schema, Snow Flake Schema Slow Changing Dimensions

18

What is Dimensional Model?


Data Model Design Technique which directly reflects the way the managers look at the business. Easier to understand compared to ER Model. Facilitates efficient data analysis. Consists of FACTS and Dimensions. Commonly referred as STAR Schema

19

ER Model Vs Dimensional Model


ER Model
CUSTOMER customer number email customer first name customer last name (IE1.1) customer address (AK1.1) customer city customer state customer zip code makes / is made by PAYMENT payment transaction number employee number (FK) type amount date status customer number (FK) is made on / requires rents under / identifies makes / is made byCUSTOMER CREDIT customer number credit card credit card exp status code receives / is received by employs / is employed by STORE store number store manager (IE1.1) store address store address 2 store phone store city store state store zip code completes / is completed by rents / is in MOVIE RENTAL RECORD rental record date soc sec number (FK) movie copy number (FK) movie number (FK) employee phone (FK) customer number (FK) rental date due date rental status payment transaction number (FK) overdue charge rental rate

Dimensional Model

rents under / identifies

D IMENSION 1
MOVIE movie number movie title (AK1.1) movie director description star 1 name rating star 2 name genre rental rate movie url

D IMENSION 2 KEY 2 ATT RIBUT E 1 ATT RIBUT E 2 ATT RIBUT E 3 FACT KEY 1 KEY 2 KEY 3 KEY 4 MEASURE 1 MEASURE 2 MEASURE 3

KEY 1 ATT RIBUT E 1 ATT RIBUT E 2 ATT RIBUT E 3

type

CHECK payment transaction number (FK) check bank number check number

E-PAYMENT payment transaction number (FK) epay vendor number epay account number

CREDIT CARD payment transaction number (FK) credit card number credit card exp credit card type

D IMENSION 3
EMPLOYEE employee number store number (FK) employee first name (IE1.1) employee address employee address 2 employee phone (AK1.2) soc sec number (AK1.1) hire date salary email supervisor.employee number (FK) reports to / supervises

D IMENSION 4 KEY 4 ATT RIBUT E 1 ATT RIBUT E 2 ATT RIBUT E 3

KEY 3 ATT RIBUT E 1 ATT RIBUT E 2 ATT RIBUT E 3

20

Fact Tables
The core table in a dimensional model where the numeric performance measurements of the business are stored The most useful facts are numeric and additive Each measurement is taken at the intersection of all the dimensions Tend to be deep in term of number of rows but narrow in terms of number of columns They have Composite Primary Keys which consists of all Foreign Keys of referred Dimensions

21

Dimensional Table
Contain textual descriptors of the business

Lesser no. of rows but more no. of columns Linked to the Fact using a Foreign Key called Surrogate Key Dimension attributes serve as the primary source of query constraints, groupings and report labels Minimize the use of Codes by replacing them with verbose text Concatenated piece of text serving as a code should be broken into constituent piece of information Contain hierarchical information Data stored in a de-normalized form

22

Surrogate Key
Integers that are assigned sequentially as needed to populate a dimension

Serve to join the Dimension to the Fact table


Better to use Surrogate Key instead of Natural Key They buffer the DW environment from operational changes Operational Codes or Natural Keys might get reassigned in the Operational Systems Natural Keys might not be unique across business Better for performance; Natural Keys might be bulky alphanumeric character

string
There might not be a Natural Key available in the source system

23

Snowflake
Dimension Normalization. Dimension is divided into parent and child dimension tables. Aim is to reduce the total amount of storage needed for a dimension When to Snowflake
Very large dimensions Some attributes not common to all the records

Advantages
Reduces disk space usage Easy to maintain

Disadvantages
Presentation layer becomes complicated Data retrieval time increases Might not save too much of disk space considering that Dimensions take less space and Facts take more of space

24

Star Schema A data modeling technique


Intuitive to the user community to understand. There is no added complexity to the model
Time
Product Dim Product ID Product Name Prod. Category Unit Price

Time Dimension Time ID Full Date Year Month Week Day of month

Sales Fact Product


Product_ID Customer_ID Time_ID Store_ID Amount_Sold

Store

Store Dim Store ID Store Name Region Store Address

Customer Dim Customer ID Customer Name Cust Address Cust telephone

Customer

The performance will be good since there are very few joins

Space is no longer a constraint storage media costs have dropped


25

Snowflake Schema
Dimensions are further Prod Flake normalized in Snowflake Schemas
Sub ctgry Time Flake

Time
Product Dim Product ID Product Name Prod. Category Unit Price

Time Dimension Time ID Full Date Year Month Week Day of month

Holiday

Sales Fact
Product_ID Customer_ID Time_ID Store_ID Amount_Sold

Product Store Store Dim Too many joins have to be done to get the data Store ID

Store Name Region Store Address

Customer Dim Customer ID Customer Name Cust Address Cust telephone Cust Flake Cust. Profile

Customer
Store Flake Store Mgr

26

The hierarchies of the data are understood by looking at the model instead of being embedded in the data itself

Slowly Changing Dimensions


Business Users might want to track the impact of each and every attribute change 3 Basic techniques for maintaining SCDs Type 1. No History is maintained. Type 2. New Record is inserted. History is maintained. Type 3. Makes use of two columns. Previous Value Column and Current Values Column. History is maintained.

27

SCD - Type1 Example


Example: Project is changed from Schwab to UBoC
Employee Id 513300 Employee Name Srinivas Project Charles Schwab

Example: Project is changed from Schwab to UBoC


Employee Id 513300 Employee Name Srinivas Project UBoc

28

SCD - Type 2 Example


Example: Project is changed from Schwab to UBoC
Employee Id 513300 Employee Name Srinivas Project Schwab Current Flag Y

Example: Project is changed from Schwab to UBoC


Employee Id 513300 513300 Employee Name Srinivas Srinivas Project Schwab UBoc Current Flag N Y

29

SCD - Type 3 Example


Example: Project is changed from Schwab to UBoC
Employee Id 513300 Employee Name Srinivas Prev Project Current Project Schwab

Example: Project is changed from Schwab to UBoC


Employee Id 513300 Employee Name Srinivas Prev Project Schwab Current Project UBoc

30

4. OLAP
Introduction Flavors of OLAP

31

Introduction
The general activity of querying and presenting text and number data from data warehouses in a dimensional format is known as OLAP The OLAP vendors technology is non relational and is almost always based on an explicit multidimensional cube of data OLAP databases are also known as multidimensional databases, or MDDBs. OLAP installations would be classified as small, individual data marts when viewed against the full range of data warehouse application
C U S T O M E R
C U S T O M E R

Dimensions

Sales

Product

SALES CUBE

A B C D

11 33 59 09

43 15 37 53

12 65 78 20

49 94 12 73

71 45 77 32

1234 5 Product

32

A typical cube with summaries

Quarter
DVD PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr

sum

Total annual sales of DVDs in America

America Europe Asia


sum

33

Region

MOLAP Vs ROLAP
ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. Advantages Can handle large amounts of data- limit is the DB size Can leverage functionalities inherent in the relational database can leverage on Materialized Views etc.. Disadvantages Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL MOLAP In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats Advantages Excellent performance: - optimal for slicing and dicing operations Can perform complex calculations: All calculations have been pre-generated when the cube is created. Disadvantages Limited in the amount of data it can handle: performance. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization.

34

5. DW Products
Database ETL Tools Reporting Tools

35

Overview of Products
Complexity of Implementation

Small
BI Tools

Medium

Large

Cognos ReportNet WebFocus Hummingbird BI Cleverpath Reporter Business Objects

ETL Tools

PowerCenterRT
WebFocus ETL/Ab Initio Hummingbird ETL/ BO Data Integrator SAGENT DataFlow / OWB

Databases

Oracle 10g/ IBM DB2


SQL Server 2000 Teradata

36

6. iGates DW Offering
Data Pyramid DW COE

37

The iGATE Data Pyramid


iGATEs Integrated Service Offering Enables Business Process Efficiency Data Analytics

Customer acquisition Customer retention Process Optimization Loss mitigation

BI Maturity assessment using


iGATEs BI Maturity Model

BI Solutions Strategic measurement and analysis

Query and reporting Decision support systems Data mining

Data Warehousing Solutions Technology-enabled data warehouse design and dev.

Architecture Planning Custom Development Maintenance & Support Reengineering ETL Design

ePartner, Data Quality Manager, iGates proprietary tools for data quality management

Data Quality Management Integrated, work flow enabled data management solution

Data Migration Data Conversion Data Enhancements Offshore Data Cleansing

38

DW/BI CoE
Evolving landscape
Move to real time or active reporting EIP and BI Integration Enterprise Performance Management Enterprise Performance Management IS Re-engineering BIMM Framework Evaluation frameworks Product best practices

iGATE
DW/BI CoE
Business Landscape Key activities
Technology consulting Pre-sales assistance Best practices/design guidelines Architecture definition Roadmap / strategy definition Vendor/tool evaluation Frameworks/methodologies

Key Offerings in The making

Consulting

iGATE Differentiators

39

Integrated Information Pyramid offering E-Partner Alliances with leading vendors Product certified professionals

Reference Material
Books
Essential Oracle 8i Data Warehousing by Gary Dodge, Tim Gorman. The Data Warehouse Lifecycle Toolkit by Ralph Kimball. The Data Warehouse Toolkit by Ralph Kimball. Data Warehouse Design Solutions by Christopher Adamson,Michael Venerable.

Websites
www.tdwi.org www.olapreport.com

40

Thank You

41

You might also like