Professional Documents
Culture Documents
Contents 1. Introduction to Data Warehousing 2. Data Warehouse Architecture 3. Dimensional Modeling 4. OLAP 5. Data Warehousing Tools 6. iGate DW Capabilites 7. Reference Material
Characteristics:
Process Oriented. Normalized Data. Current Data. Volatile Data. Real Time Updates.
Characteristics: - Subject oriented rather than process oriented. - Integrated across subjects and entire enterprise. - De-Normalized Data. - Time-Variant. - Historical Data. - Non Volatile - Atomic and Summary Data.
Data Warehouse
De-Normalized Data Used to analyze the business Updated on a predefined schedule Non-Volatile Data Historical Data Limited Audience. Fast Query Response Large to Very Large Database
10
11
2
Data Storage
Quality ETL Tool Accounts
3
Reporting Layer BI Tools, Portals
Quality
CRM
Staging
Operational Data Storage
Inventory
Data Warehouse
Secured Access
Finance
Inventory
Enterprise Information is stored in the warehouse structure
Marktng
Manufacturing
ETL tool will interface with all the sources in the enterprise and extract data in a batch cycle or in real time
METADATA
Extracting
The extract step is the first step involved in getting data into the data ware house environment. Extracting means reading and understanding the source data, and copying the parts that are needed to the data staging area for further work Extracting data needs to be done carefully so as not to effect production environments
12
Transforming
Once the data is extracted into the data staging area, there many possible transformation steps, including: Cleaning the data by correcting misspellings, resolving domain conflicts (such as a city name that is incompatible with a postal code), dealing with missing data elements, and parsing into standard formats Purging selected fields from the legacy data that are not useful for the data warehouse Combining data sources, by matching exactly on key values or by performing fuzzy matches on non-key attributes, including looking up textual equivalents of legacy systems codes Creating surrogate keys for each dimension record in order to avoid a dependence on legacy defined keys, where the surrogate key generation process enforces referential integrity between the dimension tables and the fact tables Building aggregates for boosting the performance of common queries
13
Staging Area
A storage area and set of processes that clean, transform, combine, duplicate, household, archive, and prepare source data for use in the data warehouse The data staging area is everything in between the source system and the presentation server The data staging area is not part of the physical data warehouse
The staging area is dominated by the simple activities of sorting and sequential processing
The data staging area does not need to be based on relational technology Data staging area does not provide query and presentation services
14
Loading Data
At the end of the transformation process, the data is in a position to be loaded across to the target warehouse
First time bulk load to get the historical data into the Data Warehouse
Periodic Incremental loads to bring in modified data Loading in the data warehouse environment usually takes the form of inserting data into dimension tables and fact table. These are the tables that are typically queried on by the users/tools while executing reports Bulk loading is a very important capability that is to be contrasted with record-at-a-time loading, which is far slower and can cause load times to be in the 10 hours+ range It may be required to drop and recreate indexes on the target warehouse structure each time data loading occurs Of late there is a move to real time data integration. Here effectively data is moved from source system across to the warehouse in a trickle feed manner throughout the day
15
The data Warehouse is the centralized repository of historical information covering every subject area within the organization
This presentation resource is not organized around an entity-relation model Using entity-relation modeling will lose understandability and performance The data warehouse is nothing more than the union of all the constituent data marts A data warehouse is fed from the data staging area The data warehouse manager is responsible both for the data warehouse and the data staging area
16
17
3. Dimensional Modeling
What is Dimensional Modeling? ER Model Vs Dimensional Model Facts Dimensions Surrogate Key Snow Flake Examples: Star Schema, Snow Flake Schema Slow Changing Dimensions
18
19
Dimensional Model
D IMENSION 1
MOVIE movie number movie title (AK1.1) movie director description star 1 name rating star 2 name genre rental rate movie url
D IMENSION 2 KEY 2 ATT RIBUT E 1 ATT RIBUT E 2 ATT RIBUT E 3 FACT KEY 1 KEY 2 KEY 3 KEY 4 MEASURE 1 MEASURE 2 MEASURE 3
type
CHECK payment transaction number (FK) check bank number check number
E-PAYMENT payment transaction number (FK) epay vendor number epay account number
CREDIT CARD payment transaction number (FK) credit card number credit card exp credit card type
D IMENSION 3
EMPLOYEE employee number store number (FK) employee first name (IE1.1) employee address employee address 2 employee phone (AK1.2) soc sec number (AK1.1) hire date salary email supervisor.employee number (FK) reports to / supervises
20
Fact Tables
The core table in a dimensional model where the numeric performance measurements of the business are stored The most useful facts are numeric and additive Each measurement is taken at the intersection of all the dimensions Tend to be deep in term of number of rows but narrow in terms of number of columns They have Composite Primary Keys which consists of all Foreign Keys of referred Dimensions
21
Dimensional Table
Contain textual descriptors of the business
Lesser no. of rows but more no. of columns Linked to the Fact using a Foreign Key called Surrogate Key Dimension attributes serve as the primary source of query constraints, groupings and report labels Minimize the use of Codes by replacing them with verbose text Concatenated piece of text serving as a code should be broken into constituent piece of information Contain hierarchical information Data stored in a de-normalized form
22
Surrogate Key
Integers that are assigned sequentially as needed to populate a dimension
string
There might not be a Natural Key available in the source system
23
Snowflake
Dimension Normalization. Dimension is divided into parent and child dimension tables. Aim is to reduce the total amount of storage needed for a dimension When to Snowflake
Very large dimensions Some attributes not common to all the records
Advantages
Reduces disk space usage Easy to maintain
Disadvantages
Presentation layer becomes complicated Data retrieval time increases Might not save too much of disk space considering that Dimensions take less space and Facts take more of space
24
Time Dimension Time ID Full Date Year Month Week Day of month
Store
Customer
The performance will be good since there are very few joins
Snowflake Schema
Dimensions are further Prod Flake normalized in Snowflake Schemas
Sub ctgry Time Flake
Time
Product Dim Product ID Product Name Prod. Category Unit Price
Time Dimension Time ID Full Date Year Month Week Day of month
Holiday
Sales Fact
Product_ID Customer_ID Time_ID Store_ID Amount_Sold
Product Store Store Dim Too many joins have to be done to get the data Store ID
Customer Dim Customer ID Customer Name Cust Address Cust telephone Cust Flake Cust. Profile
Customer
Store Flake Store Mgr
26
The hierarchies of the data are understood by looking at the model instead of being embedded in the data itself
27
28
29
30
4. OLAP
Introduction Flavors of OLAP
31
Introduction
The general activity of querying and presenting text and number data from data warehouses in a dimensional format is known as OLAP The OLAP vendors technology is non relational and is almost always based on an explicit multidimensional cube of data OLAP databases are also known as multidimensional databases, or MDDBs. OLAP installations would be classified as small, individual data marts when viewed against the full range of data warehouse application
C U S T O M E R
C U S T O M E R
Dimensions
Sales
Product
SALES CUBE
A B C D
11 33 59 09
43 15 37 53
12 65 78 20
49 94 12 73
71 45 77 32
1234 5 Product
32
Quarter
DVD PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr
sum
33
Region
MOLAP Vs ROLAP
ROLAP This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. Advantages Can handle large amounts of data- limit is the DB size Can leverage functionalities inherent in the relational database can leverage on Materialized Views etc.. Disadvantages Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL MOLAP In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats Advantages Excellent performance: - optimal for slicing and dicing operations Can perform complex calculations: All calculations have been pre-generated when the cube is created. Disadvantages Limited in the amount of data it can handle: performance. Requires additional investment: Cube technology are often proprietary and do not already exist in the organization.
34
5. DW Products
Database ETL Tools Reporting Tools
35
Overview of Products
Complexity of Implementation
Small
BI Tools
Medium
Large
ETL Tools
PowerCenterRT
WebFocus ETL/Ab Initio Hummingbird ETL/ BO Data Integrator SAGENT DataFlow / OWB
Databases
36
6. iGates DW Offering
Data Pyramid DW COE
37
Architecture Planning Custom Development Maintenance & Support Reengineering ETL Design
ePartner, Data Quality Manager, iGates proprietary tools for data quality management
Data Quality Management Integrated, work flow enabled data management solution
38
DW/BI CoE
Evolving landscape
Move to real time or active reporting EIP and BI Integration Enterprise Performance Management Enterprise Performance Management IS Re-engineering BIMM Framework Evaluation frameworks Product best practices
iGATE
DW/BI CoE
Business Landscape Key activities
Technology consulting Pre-sales assistance Best practices/design guidelines Architecture definition Roadmap / strategy definition Vendor/tool evaluation Frameworks/methodologies
Consulting
iGATE Differentiators
39
Integrated Information Pyramid offering E-Partner Alliances with leading vendors Product certified professionals
Reference Material
Books
Essential Oracle 8i Data Warehousing by Gary Dodge, Tim Gorman. The Data Warehouse Lifecycle Toolkit by Ralph Kimball. The Data Warehouse Toolkit by Ralph Kimball. Data Warehouse Design Solutions by Christopher Adamson,Michael Venerable.
Websites
www.tdwi.org www.olapreport.com
40
Thank You
41