Professional Documents
Culture Documents
Inf o
rma
t ion
Dat
a
Multiple Sources
Diverse Formats
Always growing
Normal Form
Edgar F. Codd originally established three normal forms:
1NF, 2NF and 3NF. There are now others that are
generally accepted, but 3NF is widely considered to be
sufficient for most applications. Most tables when
reaching 3NF are also in BCNF (Boyce-Codd Normal
Form).
Table A
Title Author1 Author2 ISBN Subject Pag Publisher
es
Database Abraham Henry F. 0072958863 MySQL, 1168 McGraw-Hill
System Silberschatz Korth Computers
Concepts
Table problems:
This table is not very efficient with storage
This design does not protect data integrity
Third, this table does not scale well
1NF Rules:
Second, our subject field contains more than one piece of information. With more
than one value in a single field, it would be very difficult to search for all books on a
given subject.
2NF Rules:
Rule 1- Be in 1NF
Rule 2- Single Column Primary Key (no partial dependency exists
between non-key attributes and key attributes)
Author Table
Subject Table
Author_I Last Name First Name
Subject_ID Subject D
1 MySQL 1 Silberschatz Abraham
Book Table
ISBN Title Pages Publisher
Publisher Table
Here we have a one-to-many relationship between the
book table and the publisher. A book has only one
publisher, and a publisher will publish many books. When Publisher_ID Publisher Name
we have a one-to-many relationship, we place a foreign
key in the Book Table, pointing to the primary key of the 1 McGraw-Hill
Publisher Table.
2NF covers the case of multi-column primary keys
Author Table
Subject Table
Author_I Last Name First Name
Subject_ID Subject D
1 MySQL 1 Silberschatz Abraham
Book Table
ISBN Title Pages Publisher_ID
Order_Items
1 101
First Normal Form deals with redundancy of data across a horizontal row
Second Normal Form (or 2NF) deals with redundancy of data in vertical
columns
3NF Rules:
Rule 1- Be in 2NF
Rule 2- Has no transitive functional dependencies (There are no non-key
attributes that depend on another non-key attribute)
To move our 2NF table into 3NF we again need to need divide our table.
Practice
Relationships are created between tables using the primary key field and a
foreign key field
Staging Layer
Financial
Financial
Data
Marketin
Marketin
g
g Data
Data OLAP server
Data Mart
Mart
HR/ERP Data
Data
Data
Data Centralized Mart
Mart
Data warehouse
Sales/CM
R
R
Data
ODS
Legacy
Legacy
DB
Data
Mart
Mart
ETL
ETL
Brain Works Technologies 2013. All rights reserved
Source Systems
Source:
OLTP Systems
Range from Flat files to RDBMS
External/Legacy systems
Extraction
Capture of data from Source Systems
Important to decide the frequency of Extraction
Merging
Bringing data together from different operational
sources.
Choosing information from each functional
system to populate the single occurrence of the
data item in the warehouse
Conditioning
The conversion of data types from the source to the
target data store (warehouse) -- always a relational
database
Eg. OLTP Date stored as text (DDMMYY); DW format
is Oracle Date type
Scrubbing
Ensuring all data meets the input validation rules
which should have been in place when the data
was captured by the operational system.
Eg. Country of the Customer should have been
entered in the Country field but entered in 1 of the
address field
Enrichment
Bring data from external sources to augment/enrich
operational data.
Eg. Currency conversion rates being brought in
from external sources.
Validating
Process of ensuring that the data captured is
accurate and transformation process is correct
Eg. Date of Birth of a Customer should not be more
than todays date
Loading
Source DB Target DB
Data Quality
Some master data might require only a 1 time load into the DW
----------------------------------------------------
Metadata logging
Restartability support
Data Models
Relations
Stars & Snowflakes
Cubes
Operators
Slice & Dice
Roll-up, Drill down
Pivoting
Other
Fact Dimension
Millions to billions of Tens to millions of rows
rows
Multiple foreign keys One primary key
Numeric Textual description
Does not change Frequently modifies
SURROGATE
KEY
Client Dimension
CLIENT CLIENT ID CLIENT NAME CLIENT GROUP CLIENT GROUP CLIENT AREA
KEY CODE NAME
NATURAL
KEY
Client Fact
CLIENT DEBTOR TIME KEY CURRENCY AMOUNT INVESTED AMOUNT EARNED
KEY KEY KEY
1 5 1 100 10,000 3,000
2 6 1 100 20,000 7,000
3 5 1 100 15,000 6,000
rrogate Key
Operational Codes or Natural Keys might get reassigned in the Operational Syste
Granularity of the dimension might be different from the Natural Key
Natural Keys might not be unique across business
Better for performance; Natural Keys might be bulky alphanumeric character stri
There might not be a Natural Key available in the source system
Four Components:
Facts
Dimensions
Attributes
Attribute hierarchies
Elemental Transaction
Measures
Dimensions = 2
Dimensions = 3
Aggregation
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
Aggregation
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId
rollup
drill-down
Cube Aggregation
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)
Aggregation
Drill-down/roll-up data analysis
Fact and dimension tables are related by foreign keys and are
subject to the primary/foreign key constraints.
User Requirement
Total Amount of Transactions per month per Client
GRANULARITY DENORMALIZATION
Classification
Ex: Let say I have a product dimension and sales fact table in my data
warehouse. A new product A is created in the OLTP system and sales
transactions happened for that product. Assume that somehow when I
extracted the OLTP system, I got only the sales transaction into the
staging environment and not the products. In this case the measurable
quantity arrives earlier into the staging but not the dimension. This is
called late arriving dimension
We all know that first we will process dimension records and insert into
the dimension table. Next the fact records are processed by joining with
the dimension table. In case of late arriving dimension when you joined
the fact table with dimension, the fact records are not inserted into the
fact table as there is no corresponding dimension for that record. To
handle this we have to create another table in which we will insert the
fact records that are failed to insert into the original fact table. When we
process the data next time, we will use this table along with the fact
stage table to join with the dimension table to insert into the fact table.
SCD - Type 1
Dimension Table With No Tracking Behaviour
SCD - Type 2
Dimension Table With Attribute Change Tracking Behaviour
SCD - Type 3
Before Change:
Client Master Key Client Name Client Country
1000 Srinivas N India
After Change:
Client Master Key Client Name Client Country
1000 Srinivas N US
Advantages
Easiest technique in terms of implementation
Disadvantages
All history will be lost
Usage
About 50% of the time
When to use
When it is not necessary for the DW to maintain history
Before Change:
Client_Ke Client Latest Effective_start_dat
y ID Name Country Record e Effective_end_date
Srinivas 01-Jan-1997 00:00 01-Dec-2020 00:00
1000 IB113 N India Y AM AM
After Change:
Advantages
Allows us to accurately store history
Disadvantages
This will cause the table size to grow fast
Storage and Performance might become a concern
Usage
About 50% of the time
When to use
When it is necessary for the DW to maintain history
Before Change:
Client Master Client Original Client Current Client Effective
Key Name Country Country Date
After Change:
Client Master Client Original Client Current Client Effective
Key Name Country Country Date
1000 Srinivas N India US 13-Apr-2004
Advantages
Does not increase the table size drastically
Allows us to keep some part of history
Disadvantages
Will not be able to keep all history when the value of the attribute
changes more than once
Usage
Very rarely use
When to use
When the no. of attribute changes are finite
Conformed Dimension
CLIENT
DIMENSION
Junk Dimension
Name ID Coustmer_ATTR
_Key
Manohar ICI0102 3
Mohan ICI0129 0
Amit Z ICI0234 4
Junk dimension
Coustmer_ATTR_K Marital Status Privileged
ey
1 N N
2 N Y
3 Y N
4 Y Y
Brain Works Technologies 2013. All rights reserved
Type of Dimensions
Contd
Degenerate Dimension
TRANSACTION FACT
CLIENT MASTER KEY
TIME KEY
CURRENCY KEY
TRANSACTION CODE
DEGENERATE
AMOUNT
DIMENSION
LAST EXTRACTION DATE
Degenerate Dimension
Insert
new
Data Dimensi
Source change on
d cha
nge
No change
Update
Reject
Factless Fact
Additive Measures:
Semi-Additive Measures:
Non-Additive Measures:
Additive:
The "Sales in $" in the example above
can be measured across all the three
dimensions attached to the fact table. If
we add the "Sales in $" across the time
dimension we get the total sales for a
period of time, similarly total sales for
across all stores, and sales for all
products
Semi-Additive:
Inventory Balance metric in the example,
indicates the remaining number of the
product in the store at the time of the
transaction. Adding it over the time
dimension will not result in a meaningful
result, but adding it for all the products in
the store will give the total inventory
count
Non-Additive:
Sales Margin % as shown in the example
above
Brain Works Technologies 2013. All rights reserved
Types of Fact Tables
STAGE TABLES
DIMENSION
TABLES
FACT TABLES
ey
fo
rk Insert
o ok
L
new
Stage Data
Fact
change
tables d cha
nge
No change
Update
Reject
ontain pre-calculated summaries derived from the most granular (detailed) fact
Limitations