Data Warehousing

Data-Data Everywhere yet..
I cant find the data that I need

Data scattered all across the network
Data stored in disparate formats
I cant understand the data that I see

How to interpret
Need someone to translate
I dont get the data when it matters

Data comes in very late
Data collection is very time consuming
Brain Works Technologies 2013. All rights reserved

What the users want
Data should be integrated across the enterprise
Data reporting should be uniform irrespective of how it is

stored
Data should be available when we want it
Historical data holds the key to understanding data over

time
Can we clean, merge and enrich the data???
Enter Data Warehouse..

Knowle
dge
Inf o
rma
t ion
Dat
a

Data Warehouse
A single, complete and consistent store of data obtained

from a variety of different sources made available to end
users in a format that they can understand and use in a
business context
Definition: Integrated, Subject-Oriented, Time-Variant,

Nonvolatile database that provides support for decision
making

Data Warehousing as a process
A technique for assembling and managing data

from various sources for the purpose of
answering business questions, thus making
decisions that were previously not possible
Creating a decision support database

maintained separately from the organizations
operational database

Goals of a Data Warehouse
It must make an organizations information more

accessible
It must make the organizations information

consistent
It must serve as a foundation for improved

decision making

Integrated
The data warehouse is a centralized, consolidated database that

integrated data derived from the entire organization
Multiple Sources
Diverse Formats

Subject-Oriented
Data is arranged and optimized to provide answer to questions

from diverse functional areas
Data is organized and summarized by topic

Sales / Marketing / Finance / Etc

Time-Variant
The Data Warehouse represents the flow of data through time
Can contain projected data from statistical models
Data is periodically uploaded then time-dependent data is

recomputed

Non-volatile
Once data is entered it is NEVER removed
Represents the companys entire history
Always growing
Must support terabyte databases and multiprocessors
Read-Only database for data analysis and query processing

Characteristics of Data Warehouse
Subject oriented: Data is organized based on how the users refer to

them
Integrated: All inconsistencies regarding naming convention and value
representations are removed
Non-volatile: Data is stored in read-only format and do not change over
time
Time variant: Data is not current but normally time series
Summarized: Operational data is mapped into a decision-usable format
Large volume: Time series data sets are normally quite large
Not normalized: DW data can be, and often are, redundant
Metadata: Data about data is stored
Data sources: Data come from internal and external

Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse

Modify, summarize (store aggregates)
Add historical information

Database Normalization
Database normalization is the process of removing

redundant data from your tables in to improve storage
efficiency, data integrity, and scalability.
In the relational model, methods exist for quantifying

how efficient a database is. These classifications are
called normal forms (or NF), and there are algorithms for
converting a given database between them.
Normalization generally involves splitting existing tables

into multiple ones, which must be re-joined or linked
each time a query is issued

Database Normalization Contd
Normal Form
Edgar F. Codd originally established three normal forms:
1NF, 2NF and 3NF. There are now others that are
generally accepted, but 3NF is widely considered to be
sufficient for most applications. Most tables when
reaching 3NF are also in BCNF (Boyce-Codd Normal
Form).

Table A
Title Author1 Author2 ISBN Subject Pag Publisher
es
Database Abraham Henry F. 0072958863 MySQL, 1168 McGraw-Hill
System Silberschatz Korth Computers
Concepts
Operating Abraham Henry F. 0471694665 Computers 944 McGraw-Hill

System Silberschatz Korth
Concepts
Table problems:
This table is not very efficient with storage
This design does not protect data integrity
Third, this table does not scale well

1NF Rules:
Each table cell should contain single value.

Each record needs to be unique
In our Table A, we have two violations of First Normal Form:

First, we have more than one author field,
Second, our subject field contains more than one piece of information. With more
than one value in a single field, it would be very difficult to search for all books on a
given subject.

2NF Rules:
Rule 1- Be in 1NF
Rule 2- Single Column Primary Key (no partial dependency exists
between non-key attributes and key attributes)
Author Table
Subject Table
Author_I Last Name First Name
Subject_ID Subject D
1 MySQL 1 Silberschatz Abraham
2 Computers 2 Korth Henry
Book Table
ISBN Title Pages Publisher
0072958863 Database System Concepts 1168 McGraw-Hill

0471694665 Operating System Concepts 944 McGraw-Hill

Publisher Table
Here we have a one-to-many relationship between the
book table and the publisher. A book has only one
publisher, and a publisher will publish many books. When Publisher_ID Publisher Name
we have a one-to-many relationship, we place a foreign
key in the Book Table, pointing to the primary key of the 1 McGraw-Hill
Publisher Table.
2NF covers the case of multi-column primary keys
Author Table
Subject Table
Author_I Last Name First Name
Subject_ID Subject D
1 MySQL 1 Silberschatz Abraham
2 Computers 2 Korth Henry
Book Table
ISBN Title Pages Publisher_ID
0072958863 Database System Concepts 1168 1

0471694665 Operating System Concepts 944 1

OrderId ItemId OrderDat

(PK) (PK) e
1 100 2009-01-
01
1 101 2009-01-
01
Orders
OrderId (PK) OrderDate

1 2009-01-01
Order_Items
OrderId (PK) ItemId (PK)

1 100
1 101


First Normal Form deals with redundancy of data across a horizontal row
Second Normal Form (or 2NF) deals with redundancy of data in vertical
columns

3NF Rules:
Rule 1- Be in 2NF
Rule 2- Has no transitive functional dependencies (There are no non-key
attributes that depend on another non-key attribute)
To move our 2NF table into 3NF we again need to need divide our table.
Tax depends on price not item

Practice

Relationships
Relationships are created between tables using the primary key field and a
foreign key field
One to One Relationship

One record in a table relates to one record in another table
One to Many Relationship

One record in table can relate to many records in another table
Many to Many Relationship

Many records in one table can relate to many records in another table

Relationships Contd

Data warehouse Architecture
Transactional Operational
systems Databases
Staging Layer
Financial
Financial
Data
Marketin
Marketin
g
g Data
Data OLAP server
Data Mart
Mart
HR/ERP Data
Data
Data
Data Centralized Mart
Mart
Data warehouse
Sales/CM
R
R
Data
ODS
Legacy
Legacy
DB
Data
Mart
Mart
ETL
ETL
Source Systems
Source:
OLTP Systems
Range from Flat files to RDBMS
External/Legacy systems

Extraction Transformation Loading
Extraction
Capture of data from Source Systems
Important to decide the frequency of Extraction
Merging
Bringing data together from different operational
sources.
Choosing information from each functional
system to populate the single occurrence of the
data item in the warehouse

Conditioning
The conversion of data types from the source to the
target data store (warehouse) -- always a relational
database
Eg. OLTP Date stored as text (DDMMYY); DW format
is Oracle Date type
Scrubbing
Ensuring all data meets the input validation rules
which should have been in place when the data
was captured by the operational system.
Eg. Country of the Customer should have been
entered in the Country field but entered in 1 of the
address field

Enrichment
Bring data from external sources to augment/enrich
operational data.
Eg. Currency conversion rates being brought in
from external sources.
Validating
Process of ensuring that the data captured is
accurate and transformation process is correct
Eg. Date of Birth of a Customer should not be more
than todays date

Loading
Loading the Extracted and Transformed data into

the Staging Area or the Data Warehouse
First time bulk load to get the historical data into

the Data Warehouse
Periodic Incremental loads to bring in modified

data
The Loading window should be as small as

possible
Should be clubbed with strong Error Management

process to capture the failures or rejections in the
Loading process
Vendor/Tool Engine
Source DB Target DB

ETL Process Issues & Challenges
Consumes 70-80% of project time
Heterogeneous source systems
Little or no control over source systems
Scattered source systems working is different time zones

having different currencies
Data not captured by OLTP systems
Data Quality

Incremental Load vs. Complete Refresh
Complete refresh is required when the data is being loaded into

the DW for the first time
Subsequent to that, DW should be refreshed with incremental

loads
Some master data might require only a 1 time load into the DW
Data Extraction Window

Staging Area
An intermediate area between the Operational Source

Systems and the data presentation area
Accessible only to the skilled personnel; no user access
The structure is closer to the Operational Systems rather

than the DW
Data arriving at different point of time is merged and then

loaded into the DW
Usually does not maintain history; only a temporary area

Why do we need Staging Area during ETL Load
Extract data based on some conditions which require you

to join two or more different systems together
Various source systems have different allotted timing for

data extraction
ETL process involves complex data transformations that

require extra space to temporarily stage the data
Data in the staging area occupies extra space

When to Refresh?
Periodically (e.g., every night, every week) or after

significant events
Refresh policy set by administrator based on user needs

and traffic
Different strategies might be required for different sources

Data Marts
Small Data Stores

More manageable data sets
Targeted to meet the needs of small groups
within the organization
----------------------------------------------------
Small, Single-Subject data warehouse subset

that provides decision support to a small
group of people
Part of organization
e.g., marketing (customers, products, sales)

Data MartsContd
Dependent Data Mart

A Data Mart whose source is the Data
Warehouse
All dependent Data Marts are loaded from
the same source the Data Warehouse
Independent Data Mart

A Data Mart whose source is the legacy
application environment
Each independent Data Mart is fed
uniquely and separately by the individual
source systems

Critical Features of an ETL framework
In a very broad sense, here are a few of the features that we

feel critical in any ETL framework
Support for Change Data Capture Or Delta Loading Or

Incremental Loading
Metadata logging
Handling of multiple source formats
Restartability support

Report Samples

Bill Inmon & Ralph Kimball
Inmon is known as Father of the Data

warehouse
Co-creator of the Corporate Information

Factory
He has 35 years of experience in database

technology management and data
warehouse design
Bill has written about a variety of topics on

the building, usage, & maintenance of the
data warehouse & the Corporate
Information Factory
Kimball is known as Father of the Business

Intelligence

OLTP vs. OLAP
OLTP System OLAP System

Online Transaction Online Analytical Processing
Processing (Data Warehouse)
(OperationalSystem)
Operational data; OLTPs are the original source of the Consolidation data; OLAP data comes from the various
Source of data
data. OLTPDatabases
To help with planning, problem solving, and decision
Purpose of data To control and run fundamental business tasks
support
Inserts and Short and fast inserts and updates initiated by end
Periodic long-running batch jobs refresh the data
Updates users
Relatively standardized and simple queries Returning
Queries Often complex queries involving aggregations
relatively few records
Depends on the amount of data involved; batch data
Processing Speed Typically very fast refreshes and complex queries may take many hours;
query speed can be improved by creating indexes
Space Larger due to the existence of aggregation structures
Can be relatively small if historical data is archived
Requirements and history data; requires more indexes than OLTP
Typically de-normalized with fewer tables; use of star
DatabaseDesign Highly normalized with many tables
and/or snowflake schemas
Backup religiously; operational data is critical to run Instead of regular backups, some environments may
Backup and
the business, data loss is likely to entail consider simply reloading the OLTP data as a recovery
Recovery
significantmonetaryloss and legal liability method

Data Warehouse Design
Design of the DW must directly reflect the way the

managers look at the business
Should capture the important measurements along with

the parameters by which these measurements are
viewed
It must facilitate data analysis
The methodology on which the DW is designed is called

as Dimensional Modelling (different from ER Modelling)

Dimensional Modeling Examples

Dimensional Modeling
Represents data in a standard framework
Framework is easily understandable by the end-users
Contains same information as the ER Model
Facilitates data retrieval and analysis
Entities are called Facts and Dimensions
A generic representation of a dimension model in which a fact table is join

to a number of dimensions is called a Star Schema

Data Warehouse Models
Data Models
Relations
Stars & Snowflakes
Cubes
Operators
Slice & Dice
Roll-up, Drill down
Pivoting
Other

Data Warehouse Models Contd
Star schema: A fact table in the middle connected to a

set of dimension tables
Snowflake schema: A refinement of star schema

where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape similar
to snowflake
Fact constellations: Multiple fact tables share

dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

Fact Data
Fact data records the information on factual event that

occurred in the business- POS, Phone calls, Banking
transactions
Typically 70% of Warehouse data is Fact data
Important to identify and define structure right in the first

place as restructuring is an expensive process
Detail content of FACT is derived from the business

requirement
Recorded Facts do not change as they are events of past

Dimension Data
Information that is used for analysing the elemental data, for

example, product hierarchy, time periods, customers, stores
It is the reference data used for analysis of Facts
Organizing the information in separate reference tables offers

better query performance
It differs from Fact data as it changes over time, due to

changes in business, reorganization
It should be structured to permit rapid changes

Compare Fact and Dimension
Fact Dimension
Millions to billions of Tens to millions of rows
rows
Multiple foreign keys One primary key
Numeric Textual description
Does not change Frequently modifies

Dimension Table
Contain textual descriptors of the business
Lesser no. of rows but more no. of columns
Linked to the Fact using a Foreign Key called Surrogate Key
Dimension attributes serve as the primary source of query

constraints, groupings and report labels
Contain hierarchical information
Data stored in a denormalized form

Dimension TableContd
SURROGATE
KEY
Client Dimension
CLIENT CLIENT ID CLIENT NAME CLIENT GROUP CLIENT GROUP CLIENT AREA
KEY CODE NAME
1 100 ABC LTD. 1234 XYZ LTD. A1

2 200 DEF LTD. 6789 RST LTD. A1
3 300 GHI LTD. 1234 XYZ LTD. A2
NATURAL
KEY
Client Fact
CLIENT DEBTOR TIME KEY CURRENCY AMOUNT INVESTED AMOUNT EARNED
KEY KEY KEY
1 5 1 100 10,000 3,000
2 6 1 100 20,000 7,000
3 5 1 100 15,000 6,000

Dimension TableContd
rrogate Key
ntegers that are assigned sequentially as needed to populate a dimension
Serve to join the Dimension to the Fact table
Better to use Surrogate Key instead of Natural Key
They buffer the DW environment from operational changes
Operational Codes or Natural Keys might get reassigned in the Operational Syste
Granularity of the dimension might be different from the Natural Key
Natural Keys might not be unique across business
Better for performance; Natural Keys might be bulky alphanumeric character stri
There might not be a Natural Key available in the source system

Star Schema
The star schema is a data-modeling technique used to map multidimensio

decision support into a relational database.
Star schemas yield an easily implemented model for multidimensional da

analysis while still preserving the relational structure of the operational d
Four Components:
Facts
Dimensions
Attributes
Attribute hierarchies

Simple Star Schema

Identifying Facts and Dimensions
Elemental Transaction
Determine Key Dimensions
Check if Fact is a dimension
Check if dimensions is a Fact

Simple Star Schema
product prodId name price store storeId city

p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
sale oderId date custId prodId storeId qty amt

o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50
Measures
customer custId name address city

53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

58
Attributes
Each dimension table contains attributes. Attributes are
often used to search, filter, or classify facts.
Dimensions provide descriptive characteristics about the
facts through their attributes.

Three Dimensional View Of Sales

2D Cube Example
Fact table view:

Multi-dimensional cube:
Dimensions = 2

3D Cube Example
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt

p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50
day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
Dimensions = 3

OLAP Operations on Dimensional Model
Aggregation
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

OLAP Operations on Dimensional Model...Contd
Aggregation
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down

Cube Aggregation
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)

Slice and Dice Queries: select and project on one or

more dimensions

Pivoting : aggregate on selected dimensions

usually 2 dims (cross-tabulation)

Attribute Hierarchies
Attributes within dimensions can be ordered in a well-defined

attribute hierarchy.
The attribute hierarchy provides a top-down data organization that

is used for two main purposes:
Aggregation
Drill-down/roll-up data analysis

A Concept Hierarchy: Dimension (location)

A Location Attribute Hierarchy

Star Schema Representation
Facts and dimensions are normally represented by physical

tables in the data warehouse database.
The fact table is related to each dimension table in a many-to-

one (M:1) relationship.
Fact and dimension tables are related by foreign keys and are
subject to the primary/foreign key constraints.

Star Schema Representation Contd

Star Schema Representation for Multidimensional
Analysis

Star Schema
Examples

Star Schema
Examples
Video Rental Auction Company

Star Schema
Examples Cont..
Wireless Phone Service Supermarket

Exercise
TRANSACTION MASTER CLIENT MASTER

Client Id Client Id
Transaction Id Client Name
Date of Transaction Client Address 1
Currency Code Client Address 2
Amount Client Address 3
Country Code
COUNTRY MASTER
Credit Limit
Country Code
Country Name
User Requirement
Total Amount of Transactions per month per Client

Potential Solution
Currency Key Client Master Key
Currency Code Client Id
Last Extraction Date Client Name

Client Master Key Client Address 1
Currency Key Client Address 2
Time Key Client Address 3
Time Key Amount Country Code
Month Last Update Date Country Name
Year
Credit Limit
Last Extraction Date
Last Extraction Date
GRANULARITY DENORMALIZATION

Convert the following E/R into a
dimensional model

Dimensions revisited
Till now we have assumed Dimensions to be independent of

time
Dimension attributes are relatively static, they are not fixed

forever
Business Users might want to track the impact of each and

every attribute change
We can preserve the independent dimensional structure with

only relatively minor adjustments

Classification
Slowly Changing Dimensions(SCD):

Attributes of a dimension that would undergo changes over time. It
depends on the business requirement whether particular attribute
history of changes should be preserved in the data warehouse. This
is called a Slowly Changing Attribute and a dimension containing
such an attribute is called a Slowly Changing Dimension.
Rapidly Changing Dimensions:

A dimension attribute that changes frequently is a Rapidly
Changing Attribute. If you dont need to track the changes, the
Rapidly Changing Attribute is no problem, but if you do need to
track the changes, using a standard Slowly Changing Dimension
technique can result in a huge inflation of the size of the dimension.
One solution is to move the attribute to its own dimension, with a
separate foreign key in the fact table. This new dimension is called
a Rapidly Changing Dimension.

Rapidly Changing Dimensions

Role playing dimension
Dimensions which are often used for multiple purposes within

the same database are called role-playing dimensions
Ex: a date dimension can be used for date of sale", as well as

"date of delivery", or "date of hire
Late Arriving dimension

Late arriving dimensions are the dimensions where the fact
(measurable quantities) table records come early when
compared to the dimension table records

Late Arriving dimensionContd
Ex: Let say I have a product dimension and sales fact table in my data
warehouse. A new product A is created in the OLTP system and sales
transactions happened for that product. Assume that somehow when I
extracted the OLTP system, I got only the sales transaction into the
staging environment and not the products. In this case the measurable
quantity arrives earlier into the staging but not the dimension. This is
called late arriving dimension
Handling late arriving dimensions:
We all know that first we will process dimension records and insert into
the dimension table. Next the fact records are processed by joining with
the dimension table. In case of late arriving dimension when you joined
the fact table with dimension, the fact records are not inserted into the
fact table as there is no corresponding dimension for that record. To
handle this we have to create another table in which we will insert the
fact records that are failed to insert into the original fact table. When we
process the data next time, we will use this table along with the fact
stage table to join with the dimension table to insert into the fact table.

3 Basic techniques for maintaining SCDs
SCD - Type 1
Dimension Table With No Tracking Behaviour
SCD - Type 2
Dimension Table With Attribute Change Tracking Behaviour
SCD - Type 3

SCD Type 1
The new information simply overwrites the original information

No history is maintained
Before Change:
Client Master Key Client Name Client Country
1000 Srinivas N India
After Change:
Client Master Key Client Name Client Country
1000 Srinivas N US

SCD Type 1
Advantages
Easiest technique in terms of implementation
Disadvantages
All history will be lost
Usage
About 50% of the time
When to use
When it is not necessary for the DW to maintain history

SCD Type 2
A new record is added to the dimension to represent the new

information
The new record gets its own Primary Key(SURROGATE KEY)
Before Change:
Client_Ke Client Latest Effective_start_dat
y ID Name Country Record e Effective_end_date
Srinivas 01-Jan-1997 00:00 01-Dec-2020 00:00
1000 IB113 N India Y AM AM
After Change:
Client_Ke Client Latest Begin_effective_da

y ID Name Country Record te End_effective_date
Srinivas 01-Jan-1997 00:00
1000 IB113 N India N AM 11-Apr-2014 01:45 PM
Srinivas 11-Apr-2014 01:45 01-Dec-2020 00:00
1001 IB113 N US Y PM AM

SCD Type 2
Advantages
Allows us to accurately store history
Disadvantages
This will cause the table size to grow fast
Storage and Performance might become a concern
Usage
About 50% of the time
When to use
When it is necessary for the DW to maintain history

SCD Type 3
There will be 2 columns to indicate the particular attribute of

interest; 1 indicating the original value and one indicating
the current value
Before Change:
Client Master Client Original Client Current Client Effective
Key Name Country Country Date
1000 Srinivas N India 12-Jan-2004
After Change:
Client Master Client Original Client Current Client Effective
Key Name Country Country Date
1000 Srinivas N India US 13-Apr-2004

SCD Type 3
Advantages
Does not increase the table size drastically
Allows us to keep some part of history
Disadvantages
Will not be able to keep all history when the value of the attribute
changes more than once
Usage
Very rarely use
When to use
When the no. of attribute changes are finite

Type of Dimensions
Conformed Dimension
A single Dimension referring to more than one Fact

Exact copy of the same Dimension used in more than one
Data Mart
Eg: The date/time dimension table connected to the sales facts

is identical to the date dimension connected to the inventory
facts.
TRANSACTION DAILY SUMMARY
FACT FACT
CLIENT
DIMENSION

Type of Dimensions
Contd
Junk Dimension
The junk dimension is simply a structure that provides a

convenient place to store the junk attributes
Is a convenient grouping of typically low cardinality flags and
indicators
Can be used to handle infrequently populated, open ended
comments field sometimes attached to a Fact row

Type of Dimensions
Contd
Name ID Marital Privileged

Status
Manohar ICI0102 Y N
Mohan ICI0129 N N
Amit Z ICI0234 Y Y
Name ID Coustmer_ATTR
_Key
Manohar ICI0102 3
Mohan ICI0129 0
Amit Z ICI0234 4
Junk dimension
Coustmer_ATTR_K Marital Status Privileged
ey
1 N N
2 N Y
3 Y N
4 Y Y
Type of Dimensions
Contd
Degenerate Dimension
A degenerate dimension is a dimension which is derived from

the fact table and doesn't have its own dimension table
It is stored in the fact table rather than the dimension table
Eg: A transactional code in a fact table
TRANSACTION FACT
CLIENT MASTER KEY
TIME KEY
CURRENCY KEY
TRANSACTION CODE
DEGENERATE
AMOUNT
DIMENSION
LAST EXTRACTION DATE

Type of Dimensions
Contd
Degenerate Dimension
Many data warehouse transaction fact tables have a control

number, such as an invoice number, purchase order number
or policy number
If you were to have a dimension table for invoice, you would

have nearly as many entries in the dimension table as you
have in the line-item fact table. The line-item fact table is
generally the largest table by far in the data warehouse. So
joining the multimillion or multibillion row fact table to a
multimillion or multibillion row dimension table will cause
your data warehouse to take up much more disk storage that
it should as well as significantly degrading performance

ETL Implementation for
dimensions
Lookup into target(DIM)
Insert
new
Data Dimensi
Source change on
d cha
nge
No change
Update
Reject

Type of Facts
Factless Fact
A Fact table that has no facts but captures certain many-to-

many relationship between the dimension keys

Type of FactsContd
Additive Measures:
These are those specific class of fact measures which can be

aggregated across all dimension and their hierarchy.
Semi-Additive Measures:
Semi-additive facts are facts that can be summed up for some

of the dimensions in the fact table, but not the others.
Non-Additive Measures:
Non-additive facts are facts that cannot be summed up for any

of the dimensions present in the fact table.
These are generally percentages and ratio metrics

Type of FactsContd
Additive:
The "Sales in $" in the example above
can be measured across all the three
dimensions attached to the fact table. If
we add the "Sales in $" across the time
dimension we get the total sales for a
period of time, similarly total sales for
across all stores, and sales for all
products
Semi-Additive:
Inventory Balance metric in the example,
indicates the remaining number of the
product in the store at the time of the
transaction. Adding it over the time
dimension will not result in a meaningful
result, but adding it for all the products in
the store will give the total inventory
count
Non-Additive:
Sales Margin % as shown in the example
above
Types of Fact Tables
Transaction Fact Tables- These are fact tables that

contain the value of the business transaction that has
occurred at a point of time. Here a row will be inserted
for each transaction that has occurred.
Periodic Snapshot Fact Tables- These are fact tables

that contain the complete snapshot of the transactions
at the end of the business period (day/week/month etc).
Take for example that there were 10 sales transactions
for a particular product/SKU during the day. In
Transaction Fact table we would have the 10 entries for
each of the transaction and the value for inventory
balance would reduce with the rows for each transaction.
In the case of Periodic Snapshot table we would store the
end of day Balance Inventory value only. The Periodic
Snapshot fact tables are loaded continuously at the end
of every business period (day/week/month etc). This way
we build the fact table to provide predictable trends for
business measures

Types of Fact Tables
Accumulating Snapshot Fact Tables - These are special type

of fact tables that are applied to business processes like order
management. Here we create entries for all the phases of the
order (start to end of the order process) when an order is
created. Once the event to complete a phase is over we update
the row corresponding to the event with factual entries and the
date of event.

ETL process order
STAGE TABLES
DIMENSION
TABLES
FACT TABLES

ETL Fact Load Implementation
Lookup into dimension
ey
fo
rk Insert
o ok
L
new
Stage Data
Fact
change
tables d cha
nge
No change
Update
Reject

Handling Failed
lookup
When building a dimensional model it is critical that facts

have accurate foreign keys pointing back to related
dimensions
Ex: We load a sales amount (fact) for a product (dimension)
that does not exist in the product dimension. In this
situation either an unknown value, such as 1, will be
placed in the fact table due to a failed lookup to the product
dimension, or the looked-up key will be pointing to the
wrong version of the dimension record Zero or Undefined row
Product_key Name Promotion_co

de
0/-1 Undefined UNDEF
1 P1 023AB
2 P2 0944S

Aggregate fact tables
ontain pre-calculated summaries derived from the most granular (detailed) fact
reated as a specific summarization across any number of dimensions
educes runtime processing

Why need aggregate fact tables?
Large size of the fact table

To speed up query extraction
Limitations
Must be re-aggregated each time there is a change in the source data
Do not support exploratory analysis
Limited interactive use

Data Warehousing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing

Uploaded by

Copyright:

Available Formats

Data-Data Everywhere yet..

I cant find the data that I need

I cant understand the data that I see

I dont get the data when it matters

Brain Works Technologies 2013. All rights reserved

Data should be integrated across the enterprise

Data reporting should be uniform irrespective of how it is

Data should be available when we want it

Historical data holds the key to understanding data over

Can we clean, merge and enrich the data???

Enter Data Warehouse..

Brain Works Technologies 2013. All rights reserved

Brain Works Technologies 2013. All rights reserved

A single, complete and consistent store of data obtained

Definition: Integrated, Subject-Oriented, Time-Variant,

Brain Works Technologies 2013. All rights reserved

A technique for assembling and managing data

Creating a decision support database

Brain Works Technologies 2013. All rights reserved

It must make an organizations information more

It must make the organizations information

It must serve as a foundation for improved

Brain Works Technologies 2013. All rights reserved

The data warehouse is a centralized, consolidated database that

Brain Works Technologies 2013. All rights reserved

Data is arranged and optimized to provide answer to questions

Data is organized and summarized by topic

Brain Works Technologies 2013. All rights reserved

The Data Warehouse represents the flow of data through time

Can contain projected data from statistical models

Data is periodically uploaded then time-dependent data is

Brain Works Technologies 2013. All rights reserved

Once data is entered it is NEVER removed

Represents the companys entire history

Must support terabyte databases and multiprocessors

Read-Only database for data analysis and query processing

Brain Works Technologies 2013. All rights reserved

Subject oriented: Data is organized based on how the users refer to

Brain Works Technologies 2013. All rights reserved

High query performance

Queries not visible outside warehouse

Local processing at sources unaffected

Can operate when sources unavailable

Can query data not stored in a DBMS

Extra information at warehouse

Brain Works Technologies 2013. All rights reserved

Database normalization is the process of removing

In the relational model, methods exist for quantifying

Normalization generally involves splitting existing tables

Brain Works Technologies 2013. All rights reserved

Brain Works Technologies 2013. All rights reserved

Operating Abraham Henry F. 0471694665 Computers 944 McGraw-Hill

Brain Works Technologies 2013. All rights reserved

Each table cell should contain single value.

In our Table A, we have two violations of First Normal Form:

Brain Works Technologies 2013. All rights reserved

2 Computers 2 Korth Henry

0072958863 Database System Concepts 1168 McGraw-Hill

Brain Works Technologies 2013. All rights reserved

2 Computers 2 Korth Henry

0072958863 Database System Concepts 1168 1

Brain Works Technologies 2013. All rights reserved

OrderId ItemId OrderDat

OrderId (PK) OrderDate