You are on page 1of 38

Index

What are the Source systems? ETL process EDW Enterprise data warehouse DM data Mart OLAP Online analytical processing Dimensional Modeling Topology (all data marts, dependent, independent) Audience

Data Warehousing.
Data Warehouse basic concepts Data Warehouse Approach Data Warehouse Implementation OLAP (Online Analytical Processing) Next steps in Data Warehousing

By V.S.Rajesh Kumar November 2004

Data Warehouse- Concepts


Module 1 Data Warehouse basic concepts

What is DSS?
Decision Support System Mainly used by business to take some strategic decisions based on the trends (comparing current fiscal to previous) and project the numbers based on history and some parameters Not to run the business, OLTP systems takes care of the day to day activities of a business. Example SAP Order Management takes care of the orders which the organization gets. In the DSS we collect all the data to do the analysis.

OLTP
Online Transaction processing system Examples of OLTP systems are order management, TERA etc Always follows 3rd normal form, while designing the database All the DML types are active Deal with specific data (customer x, product z etc)

OLTP

vs

DSS
No change in the data (No updates and deletes) Queries based on time period, set of products, set of customers etc Maintains the history. Used mainly for analytics (trend analysis, customer behavior etc)

More DML operations (Update, Delete, Inserts) Point Queries Very specific while issuing queries Less history (approximately 6 months to 1 year) Used for day today activities (must to run the business)

General DSS Architecture


Source Data

OLTP 1

ETL (Tool or TSQL)

Data Warehouse Database

Pre Defined Reports Ad hoc Reporting

OLTP 2

Market Place

Staging DB

OLAP Cubes Database

Web clicks

ODS

Data Mining

Close the loop (write back to OLTP about the findings in DSS

Architecture Diagram
Source Data ET&L Data Warehouse Database Microsoft DTS (Data Transformation Services) & Stored Procedures Database OLAP Cubes Pre Defined Reports

HR Data

Finance

SQL Server Database Ad hoc Reporting

Payroll

Project

Example for a DSS


OLTP 1
OLTP 2

Reporting

Data Warehouse

OLAP

Analytics OLTP 3 OLTP 4

DSS Categories
ODS
Operational Data Store
Support for: Consolidated and reconciled operational data capture and access Detailed, lightly summarized Process oriented, Subject oriented integrated Volatile (updateable) Current Short; business process life (30 to 90 days of history), purge

EDW
Enterprise Data Warehouse
Support for: Single source of consistent, integrated, cross-functional data for access and distribution Detailed atomic record of events, reference and dimension masters, derived, summarized Subject oriented integrated non-volatile; periodic loads, read only Time variant Long; institutional memory (2 years or more of history), archive

RDM
Relational Data Mart
Support for: Subset of Integrated data, separated for autonomous processing, optimized for access Aggregated, summarized, specialized Subject oriented integrated Non-volatile; periodic load, can contain separate updateable structures for OLTP support Time variant Variable retention; some archive

OLAP
Online Analytical Processing
Support for: Subset of Integrated data, separated for autonomous processing, optimized for access Aggregated, summarized, specialized Subject oriented integrated Non-volatile; periodic load, can contain separate updateable structures for OLTP support Time variant Variable retention; some archive

ETL (E Extract)
Extract Getting data out of the source systems. This may be just a DTS package which pulls the data, or exporting a table to a flat file in the source system. In Teradata we have Fast Export utility where we can export the data to a flat file. In Oracle we have SQL*Loader to export the data to a flat file. In SQL Server we can use a DTS package to do the same job

ETL (T Transform)
Transform Its not necessary to have the same data model in source and destination. When the data model is different from source obviously we have to modify the source data to destinations data model. This process is called transformation. Example : When we receive data from various distis about the reseller information we wont get the geo information. So in the transformation logic we will have some code which assigns the respective geo based on the country from which you are getting the data. This is the simple example on transformation.

ETL (L Load)

Load Loding the transformed data into the destination datamoel (data warehouse). As there are export functionality available in each RDBMS there is an utility to import the data into the database. Teradata Fast Import Oracle SQL*Loader Sybase - bcp

Data Modeling for OLTP


Usually 3rd normal form. Advantages : Flexibility to modify for the changes. No redundancy of the data in the model. Disadvantages : Complex queries to generate the reports as the number of tables to join are usually high.

Dimensional Modeling for DSS


Star Schema, Snowflake schema Based on RDBMS we have to choose what type of model suits better. Example: Teradata is an RDBMS which can give the results in reasonable time as its a parallel processing database engine in the market. So we can design the Enterprise data model in the 3rd normal form. But we cant have the same approach for SQL server or Oracle, we should think of denormalizing the data model. Star Schema makes queries run faster as the number of tables to join is less. In star schema all the hierarchies defined per dimension will be stored in single table. So the data redundancy is high. In snow flake we can have one more table for the hierarchy. Thats the difference between the star schema and snow flake schema.

Star Schema

Star schema is optimized for queries. You will have the redundant data available in star schema based data model.

Snow flake
Snow flake wont have much of redundant data as most of the dimensions will have a look table. This way the number of joins between the tables will become more. Both have advantages and dis advantages, so analyze the end users requirements and space constraints to pick the best.

Data Refresh in DSS


We have to refresh the data in DSS from various source systems in timely manner. While doing so, either we should do a full refresh of a particular table or capture only the changed data (this process is called delta) Usually for fact tables we go for delta refresh and for dimension tables we go for full refresh. As the environment is getting bigger and bigger almost all the tables will become delta loads.

Advantages of DSS

Safeway a grocery store chain in US gives various information from DSS directly to store manager. Example, the system can predict the a particular stock outage in the store. Based on the history system knows for every 3 hours there should be sale on one particular item, if the DSS system did not see a transaction from last 2 hours it sends an SMS to current shifts manager mobile. Thats the level you can go with the DSS. It takes time to get there. Walmart does the customer profiling, store sales analysis etc etc on there data warehouse, its implemented on Teradata. FedEx uses Teredata, Ab Initio and Microstrategy as there DSS tools.

Data Warehouse- Concepts


Module 2 Data Warehouse Approach

Distributed Approach

Various departments can start creating different data marts. Each can start working independently and see the ROI in a short span. In the long run integrating these data adds the complexity and Cost will be higher as there are more systems to maintain.

Distributed Approach to DSS


Gives only part of the answer Requires time and effort to put the pieces together No guarantee its the right answer

How We Are Different

Centralized Approach

Centralized data warehouse contains the data in one place, easy to answer any business question. In the long run this has the cost advantage over the non-centralized data warehouse. Not very easy to implement as it needs more time and resources. ROI wont be seen until the implementation is completed. So recommended approach is to implement the centralized data warehouse is, start with one subject area and keep adding one subject area at a time, this way organization will get the see the ROI at various stages.

Centralized Approach to DSS

Delivers one version of the truth for increased confidence and speed in decisionmaking

How We Are Different

Data Warehouse- Concepts


Module 3 Data Warehouse Implementation Steps

Typical Approach
Data Modeling is a cyclic process involving the following steps Requirement Gathering Requirement Analysis Requirement Validation Logical Modeling Physical Design Implementation Validation The above cycle repeats for any upgrades or enhancements

Requirement Gathering
Identify the Business objectives Identify the reporting requirements Identify the frequency of report generation Granularity of Information Business rules

Requirement Analysis

Study the requirements captured Identify the subject areas Identify the Measures and criteria fields Identify the granularity of information required

Requirement Validation

Validate the analysis with the customer Document Sign off.

Logical Modeling

Identify facts and dimensions Create Logical Model

Physical Design

Analyze Source Systems with respect to Logical Model Data Quality Analysis Physical Design
Data type Indexes Partitioning Database creation etc.,

Source to target mapping Capture Transformation rules Capture Derivation rules for derived fields

Implementation

Database Creation Staging Design (Design Extraction Jobs) Develop ETL Jobs Unit testing of ETL Jobs Schedule Jobs Test Load Data Validation Performance monitoring ETL Job tuning Test Database performance tuning Final loading of data from source to target

Data Warehouse- Concepts


Module 4 OLAP (Online Analytical Processing)

What is OLAP?

What is OLAP?
Online Analytical Processing. Viewing data in a multi dimensional way.

Why OLAP?
Slice and dice for data warehouse. RDBMS is a 2 dimensional way of storing / viewing the data

Types in OLAP?
1.

2.
3.

Three types of OLAP in the industry. MOLAP Multi dimensional OLAP (Ex MSOLAP, Essbase, Cognos). ROLAP Relational OLAP ( Ex Business Objects, Microstrategy). HOLAP Hybrid OLAP

Data Warehouse- Concepts


Module 5 Next steps in Data Warehousing

Data Mining

OLAP is like fishing (one trend at a time) Data Mining is like fishing using a NET. Mining tools provides the sophisticated algorithms to find the specific trends with the data available. Example : MS Analysis Server provides the following algorithms. (Clustering etc) Mainly used to identify set of customers who think a like, fraud deductions etc etc

Business Activity Monitoring (BAM)


BAM is the technology which is used to monitor the DW or OLTP actively for certain value. The system can run the set of process when it finds the exception and sends the information to relevant owners to take the action. Based on the findings immediately update the relevant OLTP system (conceptually its called closing the loop with DSS and OLTP) Example - INFORAY is a BAM tool which you can use on the DW.

You might also like