You are on page 1of 107

Data-Data Everywhere yet..

I cant find the data that I need


Data scattered all across the network
Data stored in disparate formats

I cant understand the data that I see


How to interpret
Need someone to translate

I dont get the data when it matters


Data comes in very late
Data collection is very time consuming

Brain Works Technologies 2013. All rights reserved


What the users want

Data should be integrated across the enterprise

Data reporting should be uniform irrespective of how it is


stored

Data should be available when we want it

Historical data holds the key to understanding data over


time

Can we clean, merge and enrich the data???

Enter Data Warehouse..

Brain Works Technologies 2013. All rights reserved


Knowle
dge

Inf o
rma
t ion

Dat
a

Brain Works Technologies 2013. All rights reserved


Data Warehouse

A single, complete and consistent store of data obtained


from a variety of different sources made available to end
users in a format that they can understand and use in a
business context

Definition: Integrated, Subject-Oriented, Time-Variant,


Nonvolatile database that provides support for decision
making

Brain Works Technologies 2013. All rights reserved


Data Warehousing as a process

A technique for assembling and managing data


from various sources for the purpose of
answering business questions, thus making
decisions that were previously not possible

Creating a decision support database


maintained separately from the organizations
operational database

Brain Works Technologies 2013. All rights reserved


Goals of a Data Warehouse

It must make an organizations information more


accessible

It must make the organizations information


consistent

It must serve as a foundation for improved


decision making

Brain Works Technologies 2013. All rights reserved


Integrated

The data warehouse is a centralized, consolidated database that


integrated data derived from the entire organization

Multiple Sources
Diverse Formats

Brain Works Technologies 2013. All rights reserved


Subject-Oriented

Data is arranged and optimized to provide answer to questions


from diverse functional areas

Data is organized and summarized by topic


Sales / Marketing / Finance / Etc

Brain Works Technologies 2013. All rights reserved


Time-Variant

The Data Warehouse represents the flow of data through time

Can contain projected data from statistical models

Data is periodically uploaded then time-dependent data is


recomputed

Brain Works Technologies 2013. All rights reserved


Non-volatile

Once data is entered it is NEVER removed

Represents the companys entire history

Always growing

Must support terabyte databases and multiprocessors

Read-Only database for data analysis and query processing

Brain Works Technologies 2013. All rights reserved


Characteristics of Data Warehouse

Subject oriented: Data is organized based on how the users refer to


them
Integrated: All inconsistencies regarding naming convention and value
representations are removed
Non-volatile: Data is stored in read-only format and do not change over
time
Time variant: Data is not current but normally time series
Summarized: Operational data is mapped into a decision-usable format
Large volume: Time series data sets are normally quite large
Not normalized: DW data can be, and often are, redundant
Metadata: Data about data is stored
Data sources: Data come from internal and external

Brain Works Technologies 2013. All rights reserved


Advantages of Warehousing

High query performance

Queries not visible outside warehouse

Local processing at sources unaffected

Can operate when sources unavailable

Can query data not stored in a DBMS

Extra information at warehouse


Modify, summarize (store aggregates)
Add historical information

Brain Works Technologies 2013. All rights reserved


Database Normalization

Database normalization is the process of removing


redundant data from your tables in to improve storage
efficiency, data integrity, and scalability.

In the relational model, methods exist for quantifying


how efficient a database is. These classifications are
called normal forms (or NF), and there are algorithms for
converting a given database between them.

Normalization generally involves splitting existing tables


into multiple ones, which must be re-joined or linked
each time a query is issued

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

Normal Form
Edgar F. Codd originally established three normal forms:
1NF, 2NF and 3NF. There are now others that are
generally accepted, but 3NF is widely considered to be
sufficient for most applications. Most tables when
reaching 3NF are also in BCNF (Boyce-Codd Normal
Form).

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

Table A
Title Author1 Author2 ISBN Subject Pag Publisher
es
Database Abraham Henry F. 0072958863 MySQL, 1168 McGraw-Hill
System Silberschatz Korth Computers
Concepts

Operating Abraham Henry F. 0471694665 Computers 944 McGraw-Hill


System Silberschatz Korth
Concepts

Table problems:
This table is not very efficient with storage
This design does not protect data integrity
Third, this table does not scale well

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

1NF Rules:

Each table cell should contain single value.


Each record needs to be unique

In our Table A, we have two violations of First Normal Form:


First, we have more than one author field,

Second, our subject field contains more than one piece of information. With more
than one value in a single field, it would be very difficult to search for all books on a
given subject.

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

2NF Rules:

Rule 1- Be in 1NF
Rule 2- Single Column Primary Key (no partial dependency exists
between non-key attributes and key attributes)

Author Table
Subject Table
Author_I Last Name First Name
Subject_ID Subject D
1 MySQL 1 Silberschatz Abraham

2 Computers 2 Korth Henry

Book Table
ISBN Title Pages Publisher

0072958863 Database System Concepts 1168 McGraw-Hill


0471694665 Operating System Concepts 944 McGraw-Hill

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

Publisher Table
Here we have a one-to-many relationship between the
book table and the publisher. A book has only one
publisher, and a publisher will publish many books. When Publisher_ID Publisher Name
we have a one-to-many relationship, we place a foreign
key in the Book Table, pointing to the primary key of the 1 McGraw-Hill
Publisher Table.
2NF covers the case of multi-column primary keys

Author Table
Subject Table
Author_I Last Name First Name
Subject_ID Subject D
1 MySQL 1 Silberschatz Abraham

2 Computers 2 Korth Henry

Book Table
ISBN Title Pages Publisher_ID

0072958863 Database System Concepts 1168 1


0471694665 Operating System Concepts 944 1

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

OrderId ItemId OrderDat


(PK) (PK) e
1 100 2009-01-
01
1 101 2009-01-
01
Orders

OrderId (PK) OrderDate


1 2009-01-01

Order_Items

OrderId (PK) ItemId (PK)


1 100

1 101

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

First Normal Form deals with redundancy of data across a horizontal row

Second Normal Form (or 2NF) deals with redundancy of data in vertical
columns

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

3NF Rules:

Rule 1- Be in 2NF
Rule 2- Has no transitive functional dependencies (There are no non-key
attributes that depend on another non-key attribute)

To move our 2NF table into 3NF we again need to need divide our table.

Tax depends on price not item

Brain Works Technologies 2013. All rights reserved


Database Normalization Contd

Practice

Brain Works Technologies 2013. All rights reserved


Relationships

Relationships are created between tables using the primary key field and a
foreign key field

One to One Relationship


One record in a table relates to one record in another table

One to Many Relationship


One record in table can relate to many records in another table

Many to Many Relationship


Many records in one table can relate to many records in another table

Brain Works Technologies 2013. All rights reserved


Relationships Contd

Brain Works Technologies 2013. All rights reserved


Data warehouse Architecture
Transactional Operational
systems Databases

Staging Layer
Financial
Financial
Data

Marketin
Marketin
g
g Data
Data OLAP server
Data Mart
Mart

HR/ERP Data
Data
Data
Data Centralized Mart
Mart
Data warehouse

Sales/CM
R
R
Data

ODS

Legacy
Legacy
DB

Data
Mart
Mart

ETL
ETL
Brain Works Technologies 2013. All rights reserved
Source Systems

Source:

OLTP Systems
Range from Flat files to RDBMS
External/Legacy systems

Brain Works Technologies 2013. All rights reserved


Extraction Transformation Loading

Extraction
Capture of data from Source Systems
Important to decide the frequency of Extraction

Merging
Bringing data together from different operational
sources.
Choosing information from each functional
system to populate the single occurrence of the
data item in the warehouse

Brain Works Technologies 2013. All rights reserved


Extraction Transformation Loading

Conditioning
The conversion of data types from the source to the
target data store (warehouse) -- always a relational
database
Eg. OLTP Date stored as text (DDMMYY); DW format
is Oracle Date type

Scrubbing
Ensuring all data meets the input validation rules
which should have been in place when the data
was captured by the operational system.
Eg. Country of the Customer should have been
entered in the Country field but entered in 1 of the
address field

Brain Works Technologies 2013. All rights reserved


Extraction Transformation Loading

Enrichment
Bring data from external sources to augment/enrich
operational data.
Eg. Currency conversion rates being brought in
from external sources.

Validating
Process of ensuring that the data captured is
accurate and transformation process is correct
Eg. Date of Birth of a Customer should not be more
than todays date

Brain Works Technologies 2013. All rights reserved


Extraction Transformation Loading

Loading

Loading the Extracted and Transformed data into


the Staging Area or the Data Warehouse

First time bulk load to get the historical data into


the Data Warehouse

Periodic Incremental loads to bring in modified


data

The Loading window should be as small as


possible

Should be clubbed with strong Error Management


process to capture the failures or rejections in the
Loading process
Brain Works Technologies 2013. All rights reserved
Vendor/Tool Engine

Source DB Target DB

Brain Works Technologies 2013. All rights reserved


ETL Process Issues & Challenges

Consumes 70-80% of project time

Heterogeneous source systems

Little or no control over source systems

Scattered source systems working is different time zones


having different currencies

Data not captured by OLTP systems

Data Quality

Brain Works Technologies 2013. All rights reserved


Incremental Load vs. Complete Refresh

Complete refresh is required when the data is being loaded into


the DW for the first time

Subsequent to that, DW should be refreshed with incremental


loads

Some master data might require only a 1 time load into the DW

Data Extraction Window

Brain Works Technologies 2013. All rights reserved


Staging Area

An intermediate area between the Operational Source


Systems and the data presentation area

Accessible only to the skilled personnel; no user access

The structure is closer to the Operational Systems rather


than the DW

Data arriving at different point of time is merged and then


loaded into the DW

Usually does not maintain history; only a temporary area

Brain Works Technologies 2013. All rights reserved


Why do we need Staging Area during ETL Load

Extract data based on some conditions which require you


to join two or more different systems together

Various source systems have different allotted timing for


data extraction

ETL process involves complex data transformations that


require extra space to temporarily stage the data

Data in the staging area occupies extra space

Brain Works Technologies 2013. All rights reserved


When to Refresh?

Periodically (e.g., every night, every week) or after


significant events

Refresh policy set by administrator based on user needs


and traffic

Different strategies might be required for different sources

Brain Works Technologies 2013. All rights reserved


Data Marts

Small Data Stores


More manageable data sets
Targeted to meet the needs of small groups
within the organization

----------------------------------------------------

Small, Single-Subject data warehouse subset


that provides decision support to a small
group of people
Part of organization
e.g., marketing (customers, products, sales)

Brain Works Technologies 2013. All rights reserved


Data MartsContd

Dependent Data Mart


A Data Mart whose source is the Data
Warehouse
All dependent Data Marts are loaded from
the same source the Data Warehouse

Independent Data Mart


A Data Mart whose source is the legacy
application environment
Each independent Data Mart is fed
uniquely and separately by the individual
source systems

Brain Works Technologies 2013. All rights reserved


Critical Features of an ETL framework

In a very broad sense, here are a few of the features that we


feel critical in any ETL framework

Support for Change Data Capture Or Delta Loading Or


Incremental Loading

Metadata logging

Handling of multiple source formats

Restartability support

Brain Works Technologies 2013. All rights reserved


Report Samples

Brain Works Technologies 2013. All rights reserved


Bill Inmon & Ralph Kimball

Inmon is known as Father of the Data


warehouse

Co-creator of the Corporate Information


Factory

He has 35 years of experience in database


technology management and data
warehouse design

Bill has written about a variety of topics on


the building, usage, & maintenance of the
data warehouse & the Corporate
Information Factory

Kimball is known as Father of the Business


Intelligence

Brain Works Technologies 2013. All rights reserved


OLTP vs. OLAP

OLTP System OLAP System


Online Transaction Online Analytical Processing
Processing (Data Warehouse)
(OperationalSystem)
Operational data; OLTPs are the original source of the Consolidation data; OLAP data comes from the various
Source of data
data. OLTPDatabases
To help with planning, problem solving, and decision
Purpose of data To control and run fundamental business tasks
support
Inserts and Short and fast inserts and updates initiated by end
Periodic long-running batch jobs refresh the data
Updates users
Relatively standardized and simple queries Returning
Queries Often complex queries involving aggregations
relatively few records
Depends on the amount of data involved; batch data
Processing Speed Typically very fast refreshes and complex queries may take many hours;
query speed can be improved by creating indexes
Space Larger due to the existence of aggregation structures
Can be relatively small if historical data is archived
Requirements and history data; requires more indexes than OLTP
Typically de-normalized with fewer tables; use of star
DatabaseDesign Highly normalized with many tables
and/or snowflake schemas
Backup religiously; operational data is critical to run Instead of regular backups, some environments may
Backup and
the business, data loss is likely to entail consider simply reloading the OLTP data as a recovery
Recovery
significantmonetaryloss and legal liability method

Brain Works Technologies 2013. All rights reserved


Data Warehouse Design

Design of the DW must directly reflect the way the


managers look at the business

Should capture the important measurements along with


the parameters by which these measurements are
viewed

It must facilitate data analysis

The methodology on which the DW is designed is called


as Dimensional Modelling (different from ER Modelling)

Brain Works Technologies 2013. All rights reserved


Dimensional Modeling Examples

Brain Works Technologies 2013. All rights reserved


Dimensional Modeling

Represents data in a standard framework

Framework is easily understandable by the end-users

Contains same information as the ER Model

Facilitates data retrieval and analysis

Entities are called Facts and Dimensions

A generic representation of a dimension model in which a fact table is join


to a number of dimensions is called a Star Schema

Brain Works Technologies 2013. All rights reserved


Data Warehouse Models

Data Models
Relations
Stars & Snowflakes
Cubes

Operators
Slice & Dice
Roll-up, Drill down
Pivoting
Other

Brain Works Technologies 2013. All rights reserved


Data Warehouse Models Contd

Star schema: A fact table in the middle connected to a


set of dimension tables

Snowflake schema: A refinement of star schema


where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape similar
to snowflake

Fact constellations: Multiple fact tables share


dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

Brain Works Technologies 2013. All rights reserved


Fact Data

Fact data records the information on factual event that


occurred in the business- POS, Phone calls, Banking
transactions

Typically 70% of Warehouse data is Fact data

Important to identify and define structure right in the first


place as restructuring is an expensive process

Detail content of FACT is derived from the business


requirement

Recorded Facts do not change as they are events of past

Brain Works Technologies 2013. All rights reserved


Dimension Data

Information that is used for analysing the elemental data, for


example, product hierarchy, time periods, customers, stores

It is the reference data used for analysis of Facts

Organizing the information in separate reference tables offers


better query performance

It differs from Fact data as it changes over time, due to


changes in business, reorganization

It should be structured to permit rapid changes

Brain Works Technologies 2013. All rights reserved


Compare Fact and Dimension

Fact Dimension
Millions to billions of Tens to millions of rows
rows
Multiple foreign keys One primary key
Numeric Textual description
Does not change Frequently modifies

Brain Works Technologies 2013. All rights reserved


Dimension Table

Contain textual descriptors of the business

Lesser no. of rows but more no. of columns

Linked to the Fact using a Foreign Key called Surrogate Key

Dimension attributes serve as the primary source of query


constraints, groupings and report labels

Contain hierarchical information

Data stored in a denormalized form

Brain Works Technologies 2013. All rights reserved


Dimension TableContd

SURROGATE
KEY
Client Dimension
CLIENT CLIENT ID CLIENT NAME CLIENT GROUP CLIENT GROUP CLIENT AREA
KEY CODE NAME

1 100 ABC LTD. 1234 XYZ LTD. A1


2 200 DEF LTD. 6789 RST LTD. A1
3 300 GHI LTD. 1234 XYZ LTD. A2

NATURAL
KEY
Client Fact
CLIENT DEBTOR TIME KEY CURRENCY AMOUNT INVESTED AMOUNT EARNED
KEY KEY KEY
1 5 1 100 10,000 3,000
2 6 1 100 20,000 7,000
3 5 1 100 15,000 6,000

Brain Works Technologies 2013. All rights reserved


Dimension TableContd

rrogate Key

ntegers that are assigned sequentially as needed to populate a dimension

Serve to join the Dimension to the Fact table

Better to use Surrogate Key instead of Natural Key

They buffer the DW environment from operational changes

Operational Codes or Natural Keys might get reassigned in the Operational Syste
Granularity of the dimension might be different from the Natural Key
Natural Keys might not be unique across business
Better for performance; Natural Keys might be bulky alphanumeric character stri
There might not be a Natural Key available in the source system

Brain Works Technologies 2013. All rights reserved


Star Schema

The star schema is a data-modeling technique used to map multidimensio


decision support into a relational database.

Star schemas yield an easily implemented model for multidimensional da


analysis while still preserving the relational structure of the operational d

Four Components:
Facts
Dimensions
Attributes
Attribute hierarchies

Brain Works Technologies 2013. All rights reserved


Simple Star Schema

Brain Works Technologies 2013. All rights reserved


Identifying Facts and Dimensions

Elemental Transaction

Determine Key Dimensions

Check if Fact is a dimension

Check if dimensions is a Fact

Brain Works Technologies 2013. All rights reserved


Simple Star Schema

product prodId name price store storeId city


p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

Measures

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

Brain Works Technologies 2013. All rights reserved


58
Attributes
Each dimension table contains attributes. Attributes are
often used to search, filter, or classify facts.
Dimensions provide descriptive characteristics about the
facts through their attributes.

Brain Works Technologies 2013. All rights reserved


Three Dimensional View Of Sales

Brain Works Technologies 2013. All rights reserved


2D Cube Example

Fact table view:


Multi-dimensional cube:

Dimensions = 2

Brain Works Technologies 2013. All rights reserved


3D Cube Example

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50
day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

Dimensions = 3

Brain Works Technologies 2013. All rights reserved


OLAP Operations on Dimensional Model

Aggregation
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

Brain Works Technologies 2013. All rights reserved


OLAP Operations on Dimensional Model...Contd

Aggregation
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId storeId date amt


p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4

rollup

drill-down

Brain Works Technologies 2013. All rights reserved


OLAP Operations on Dimensional Model...Contd

Cube Aggregation

c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129

sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)

Brain Works Technologies 2013. All rights reserved


OLAP Operations on Dimensional Model...Contd

Slice and Dice Queries: select and project on one or


more dimensions

Brain Works Technologies 2013. All rights reserved


OLAP Operations on Dimensional Model...Contd

Pivoting : aggregate on selected dimensions


usually 2 dims (cross-tabulation)

Brain Works Technologies 2013. All rights reserved


Attribute Hierarchies

Attributes within dimensions can be ordered in a well-defined


attribute hierarchy.

The attribute hierarchy provides a top-down data organization that


is used for two main purposes:

Aggregation
Drill-down/roll-up data analysis

Brain Works Technologies 2013. All rights reserved


A Concept Hierarchy: Dimension (location)

Brain Works Technologies 2013. All rights reserved


A Location Attribute Hierarchy

Brain Works Technologies 2013. All rights reserved


Star Schema Representation

Facts and dimensions are normally represented by physical


tables in the data warehouse database.

The fact table is related to each dimension table in a many-to-


one (M:1) relationship.

Fact and dimension tables are related by foreign keys and are
subject to the primary/foreign key constraints.

Brain Works Technologies 2013. All rights reserved


Star Schema Representation Contd

Brain Works Technologies 2013. All rights reserved


Star Schema Representation for Multidimensional
Analysis

Brain Works Technologies 2013. All rights reserved


Star Schema
Examples

Brain Works Technologies 2013. All rights reserved


Star Schema
Examples

Video Rental Auction Company

Brain Works Technologies 2013. All rights reserved


Star Schema
Examples Cont..

Wireless Phone Service Supermarket

Brain Works Technologies 2013. All rights reserved


Exercise

TRANSACTION MASTER CLIENT MASTER


Client Id Client Id
Transaction Id Client Name
Date of Transaction Client Address 1
Currency Code Client Address 2
Amount Client Address 3
Country Code
COUNTRY MASTER
Credit Limit
Country Code
Country Name

User Requirement
Total Amount of Transactions per month per Client

Brain Works Technologies 2013. All rights reserved


Potential Solution

Currency Key Client Master Key

Currency Code Client Id

Last Extraction Date Client Name


Client Master Key Client Address 1
Currency Key Client Address 2
Time Key Client Address 3
Time Key Amount Country Code
Month Last Update Date Country Name
Year
Credit Limit
Last Extraction Date
Last Extraction Date

GRANULARITY DENORMALIZATION

Brain Works Technologies 2013. All rights reserved


Convert the following E/R into a
dimensional model

Brain Works Technologies 2013. All rights reserved


Dimensions revisited

Till now we have assumed Dimensions to be independent of


time

Dimension attributes are relatively static, they are not fixed


forever

Business Users might want to track the impact of each and


every attribute change

We can preserve the independent dimensional structure with


only relatively minor adjustments

Brain Works Technologies 2013. All rights reserved


Dimensions revisited

Classification

Slowly Changing Dimensions(SCD):


Attributes of a dimension that would undergo changes over time. It
depends on the business requirement whether particular attribute
history of changes should be preserved in the data warehouse. This
is called a Slowly Changing Attribute and a dimension containing
such an attribute is called a Slowly Changing Dimension.

Rapidly Changing Dimensions:


A dimension attribute that changes frequently is a Rapidly
Changing Attribute. If you dont need to track the changes, the
Rapidly Changing Attribute is no problem, but if you do need to
track the changes, using a standard Slowly Changing Dimension
technique can result in a huge inflation of the size of the dimension.
One solution is to move the attribute to its own dimension, with a
separate foreign key in the fact table. This new dimension is called
a Rapidly Changing Dimension.

Brain Works Technologies 2013. All rights reserved


Dimensions revisited

Rapidly Changing Dimensions

Brain Works Technologies 2013. All rights reserved


Dimensions revisited

Role playing dimension

Dimensions which are often used for multiple purposes within


the same database are called role-playing dimensions

Ex: a date dimension can be used for date of sale", as well as


"date of delivery", or "date of hire

Late Arriving dimension


Late arriving dimensions are the dimensions where the fact
(measurable quantities) table records come early when
compared to the dimension table records

Brain Works Technologies 2013. All rights reserved


Dimensions revisited
Late Arriving dimensionContd

Ex: Let say I have a product dimension and sales fact table in my data
warehouse. A new product A is created in the OLTP system and sales
transactions happened for that product. Assume that somehow when I
extracted the OLTP system, I got only the sales transaction into the
staging environment and not the products. In this case the measurable
quantity arrives earlier into the staging but not the dimension. This is
called late arriving dimension

Handling late arriving dimensions:

We all know that first we will process dimension records and insert into
the dimension table. Next the fact records are processed by joining with
the dimension table. In case of late arriving dimension when you joined
the fact table with dimension, the fact records are not inserted into the
fact table as there is no corresponding dimension for that record. To
handle this we have to create another table in which we will insert the
fact records that are failed to insert into the original fact table. When we
process the data next time, we will use this table along with the fact
stage table to join with the dimension table to insert into the fact table.

Brain Works Technologies 2013. All rights reserved


3 Basic techniques for maintaining SCDs

SCD - Type 1
Dimension Table With No Tracking Behaviour

SCD - Type 2
Dimension Table With Attribute Change Tracking Behaviour

SCD - Type 3

Brain Works Technologies 2013. All rights reserved


SCD Type 1

The new information simply overwrites the original information


No history is maintained

Before Change:
Client Master Key Client Name Client Country
1000 Srinivas N India

After Change:
Client Master Key Client Name Client Country
1000 Srinivas N US

Brain Works Technologies 2013. All rights reserved


SCD Type 1

Advantages
Easiest technique in terms of implementation
Disadvantages
All history will be lost
Usage
About 50% of the time
When to use
When it is not necessary for the DW to maintain history

Brain Works Technologies 2013. All rights reserved


SCD Type 2

A new record is added to the dimension to represent the new


information
The new record gets its own Primary Key(SURROGATE KEY)

Before Change:
Client_Ke Client Latest Effective_start_dat
y ID Name Country Record e Effective_end_date
Srinivas 01-Jan-1997 00:00 01-Dec-2020 00:00
1000 IB113 N India Y AM AM
After Change:

Client_Ke Client Latest Begin_effective_da


y ID Name Country Record te End_effective_date
Srinivas 01-Jan-1997 00:00
1000 IB113 N India N AM 11-Apr-2014 01:45 PM
Srinivas 11-Apr-2014 01:45 01-Dec-2020 00:00
1001 IB113 N US Y PM AM

Brain Works Technologies 2013. All rights reserved


SCD Type 2

Advantages
Allows us to accurately store history
Disadvantages
This will cause the table size to grow fast
Storage and Performance might become a concern
Usage
About 50% of the time
When to use
When it is necessary for the DW to maintain history

Brain Works Technologies 2013. All rights reserved


SCD Type 3

There will be 2 columns to indicate the particular attribute of


interest; 1 indicating the original value and one indicating
the current value

Before Change:
Client Master Client Original Client Current Client Effective
Key Name Country Country Date

1000 Srinivas N India 12-Jan-2004

After Change:
Client Master Client Original Client Current Client Effective
Key Name Country Country Date
1000 Srinivas N India US 13-Apr-2004

Brain Works Technologies 2013. All rights reserved


SCD Type 3

Advantages
Does not increase the table size drastically
Allows us to keep some part of history
Disadvantages
Will not be able to keep all history when the value of the attribute
changes more than once
Usage
Very rarely use
When to use
When the no. of attribute changes are finite

Brain Works Technologies 2013. All rights reserved


Type of Dimensions

Conformed Dimension

A single Dimension referring to more than one Fact


Exact copy of the same Dimension used in more than one
Data Mart

Eg: The date/time dimension table connected to the sales facts


is identical to the date dimension connected to the inventory
facts.
TRANSACTION DAILY SUMMARY
FACT FACT

CLIENT
DIMENSION

Brain Works Technologies 2013. All rights reserved


Type of Dimensions
Contd

Junk Dimension

The junk dimension is simply a structure that provides a


convenient place to store the junk attributes
Is a convenient grouping of typically low cardinality flags and
indicators
Can be used to handle infrequently populated, open ended
comments field sometimes attached to a Fact row

Brain Works Technologies 2013. All rights reserved


Type of Dimensions
Contd

Name ID Marital Privileged


Status
Manohar ICI0102 Y N
Mohan ICI0129 N N
Amit Z ICI0234 Y Y

Name ID Coustmer_ATTR
_Key
Manohar ICI0102 3
Mohan ICI0129 0
Amit Z ICI0234 4
Junk dimension
Coustmer_ATTR_K Marital Status Privileged
ey
1 N N

2 N Y

3 Y N

4 Y Y
Brain Works Technologies 2013. All rights reserved
Type of Dimensions
Contd

Degenerate Dimension

A degenerate dimension is a dimension which is derived from


the fact table and doesn't have its own dimension table
It is stored in the fact table rather than the dimension table

Eg: A transactional code in a fact table

TRANSACTION FACT
CLIENT MASTER KEY
TIME KEY
CURRENCY KEY
TRANSACTION CODE
DEGENERATE
AMOUNT
DIMENSION
LAST EXTRACTION DATE

Brain Works Technologies 2013. All rights reserved


Type of Dimensions
Contd

Degenerate Dimension

Many data warehouse transaction fact tables have a control


number, such as an invoice number, purchase order number
or policy number

If you were to have a dimension table for invoice, you would


have nearly as many entries in the dimension table as you
have in the line-item fact table. The line-item fact table is
generally the largest table by far in the data warehouse. So
joining the multimillion or multibillion row fact table to a
multimillion or multibillion row dimension table will cause
your data warehouse to take up much more disk storage that
it should as well as significantly degrading performance

Brain Works Technologies 2013. All rights reserved


ETL Implementation for
dimensions

Lookup into target(DIM)

Insert
new
Data Dimensi
Source change on
d cha
nge

No change
Update

Reject

Brain Works Technologies 2013. All rights reserved


Type of Facts

Factless Fact

A Fact table that has no facts but captures certain many-to-


many relationship between the dimension keys

Brain Works Technologies 2013. All rights reserved


Type of FactsContd

Additive Measures:

These are those specific class of fact measures which can be


aggregated across all dimension and their hierarchy.

Semi-Additive Measures:

Semi-additive facts are facts that can be summed up for some


of the dimensions in the fact table, but not the others.

Non-Additive Measures:

Non-additive facts are facts that cannot be summed up for any


of the dimensions present in the fact table.
These are generally percentages and ratio metrics

Brain Works Technologies 2013. All rights reserved


Type of FactsContd

Additive:
The "Sales in $" in the example above
can be measured across all the three
dimensions attached to the fact table. If
we add the "Sales in $" across the time
dimension we get the total sales for a
period of time, similarly total sales for
across all stores, and sales for all
products

Semi-Additive:
Inventory Balance metric in the example,
indicates the remaining number of the
product in the store at the time of the
transaction. Adding it over the time
dimension will not result in a meaningful
result, but adding it for all the products in
the store will give the total inventory
count

Non-Additive:
Sales Margin % as shown in the example
above
Brain Works Technologies 2013. All rights reserved
Types of Fact Tables

Transaction Fact Tables- These are fact tables that


contain the value of the business transaction that has
occurred at a point of time. Here a row will be inserted
for each transaction that has occurred.

Periodic Snapshot Fact Tables- These are fact tables


that contain the complete snapshot of the transactions
at the end of the business period (day/week/month etc).
Take for example that there were 10 sales transactions
for a particular product/SKU during the day. In
Transaction Fact table we would have the 10 entries for
each of the transaction and the value for inventory
balance would reduce with the rows for each transaction.
In the case of Periodic Snapshot table we would store the
end of day Balance Inventory value only. The Periodic
Snapshot fact tables are loaded continuously at the end
of every business period (day/week/month etc). This way
we build the fact table to provide predictable trends for
business measures

Brain Works Technologies 2013. All rights reserved


Types of Fact Tables

Accumulating Snapshot Fact Tables - These are special type


of fact tables that are applied to business processes like order
management. Here we create entries for all the phases of the
order (start to end of the order process) when an order is
created. Once the event to complete a phase is over we update
the row corresponding to the event with factual entries and the
date of event.

Brain Works Technologies 2013. All rights reserved


ETL process order

STAGE TABLES

DIMENSION
TABLES

FACT TABLES

Brain Works Technologies 2013. All rights reserved


ETL Fact Load Implementation

Lookup into dimension

ey
fo
rk Insert
o ok
L
new
Stage Data
Fact
change
tables d cha
nge

No change
Update

Reject

Brain Works Technologies 2013. All rights reserved


Handling Failed
lookup

When building a dimensional model it is critical that facts


have accurate foreign keys pointing back to related
dimensions
Ex: We load a sales amount (fact) for a product (dimension)
that does not exist in the product dimension. In this
situation either an unknown value, such as 1, will be
placed in the fact table due to a failed lookup to the product
dimension, or the looked-up key will be pointing to the
wrong version of the dimension record Zero or Undefined row

Product_key Name Promotion_co


de
0/-1 Undefined UNDEF
1 P1 023AB
2 P2 0944S

Brain Works Technologies 2013. All rights reserved


Aggregate fact tables

ontain pre-calculated summaries derived from the most granular (detailed) fact

reated as a specific summarization across any number of dimensions

educes runtime processing

Brain Works Technologies 2013. All rights reserved


Why need aggregate fact tables?

Large size of the fact table


To speed up query extraction

Limitations

Must be re-aggregated each time there is a change in the source data

Do not support exploratory analysis

Limited interactive use

Brain Works Technologies 2013. All rights reserved

You might also like