Lecture 2

Review
Why data warehouses?
Difference between traditional DB and

DW
Naturally evolving architecture

Architecture environment
 4 levels of data
Advantages of data
warehousing approach
High query performance
◦ Not necessarily current information
Doesn’t interfere with local

processing at sources
◦ Complex queries at warehouse
◦ OLTP at information sources
Information copied at the DW

◦ Availability of historical information
Data Warehouse: Definition
1
A data warehouse is simply a
single, complete, and consistent
store of data obtained from a
variety of sources and made
available to end users in a way they
can understand and use in a
business context
Barry Devlin, IBM consultant

Data Warehouse: Definition
2
A data warehouse is a subject-

oriented, integrated, nonvolatile,
and time variant collection of data in
support of management’s decisions.
W.H.Inmon, 1996
DW: Subject-Oriented
• An insurance company
Autos
Customer
Health Claim
Policy
Life
premium
Applications Subjects
Example
To learn more about your
company's sales data, you can
build a warehouse that
concentrates on sales.
Usingthis warehouse, you can

answer questions like
◦ "Who was our best customer for this
item last year?"
Example: Customer
 Series of related tables
 Time key
 Different granularity Customer
activity Customer
Customer details activity
Customer data (2000- details
data (2003- Customer 2003) (2003-
(2000- 2005) activity
2005)
2002) (2000- Cust-ID
Cust-ID 2003) Activity
Cust-ID
Cust-ID From date
amount Activity
From To Cust-ID
clerk no date
To Name Month amount
location
Name Address item no
no. of trans time
Address Phone location
avg. …..
Phone ….. invoice no
amount
….. …..
max
Summarized data …..
Detailed data
DW: Integrated
Integration is the most important
aspect
Source1
Source
2
Converted
Source
Reformatted
3 Resequenced DW
Source
4
Summarized
Source5
DW: Non-volatile
Nonvolatilemeans that, once entered into
the warehouse, data should not change
DW data is loaded and accessed but not

updated
Data is loaded as snapshots, when

changes occur, a new snapshot is created.
Result:a historical record of data is kept in

the data warehouse
DW: Time Variant
Every unit of data in the DW is
accurate as of a moment of time
Each record has a time marking

to show when it was accurate
Time Horizon
Isthe length of time data is
represented in an environment
Collective time horizon for the data

in the DW is much longer than that
in the operational systems
Time horizon of DW ~ 5-10 years
DW contains historical information

Roadmap
Review
Datawarehouse characteristics
DW design issues
◦ Granularity
 Definition
Example
◦ Data partitioning
Structure of data in DW
DW design issues
Major design issues:
◦ Granularity
◦ Partitioning
Granularity
What is granularity?
 The level of detail or summarization of
the units of data in the data warehouse
LOW
Detailed data
Granularity
Lightly summarized data
Highly summarized data

HIGH
Granularity is critical !!
 The level of granularity
1. Affects the amount of data that

resides in the data warehouse
2. Affects the types of queries that can

be answered
Granularity trade-off
High level of detail Low level of detail
Answer any query Limited queries

Large volumes of data Easy to manipulate
More space and I/Os Less space and I/Os
Example
High Low
details details
Details of every phone call for Summary of phone calls for

customers for a month customers for a month
Did Ali call his client in Alex last Friday?
Can be answered with some search Cannot be answered
But in DW we rarely look at single records !!

On the average how many long distance
calls we made from Cairo to Alex.?
Search through 175,000,000 records Search through 1,750,000 records

doing 45,000,000 I/Os doing 450,000 I/Os
Solution
Dual level of details
 Contains both lightly summarized and truly archived data
Lightly summarized have more details
Data with lowest granularity are stored as archival data

True archival data are stored with all the details from
operational environment
Archival data are stored on magnetic tapes
Most DSS is done against the lightly summarized data

that is compact and efficiently accessed
True archival data is used when more details are needed
Living Sample Database
Living sample database is a subset of
either the true archival or the lightly
summarized data taken from the DW
Is useful when the volume of data in the
DW is huge
Is a subset either from the archival or
lightly summarized data
Is Ali a customer ?
 Can be a customer but not in the living
sample
Good for statistical analysis, looking at
trends, and collective look at data
How to build a living sample
database?
 An extract (selection) program runs through a
large database
 Selects (randomly or based on a criteria)
every 100 or 1000 record
 The record is moved to the living sample DB
 The size of the resulting living sample DB is
then 1/100 or 1/1000 or the original DB
 Queries are run against the resulting sample
DB
 There is a trade-off between accuracy and
time
Data Partitioning
 Partitioning is the breakup of data
into separate physical units that can
be handled independently
 Partitioning helps in:

1. Locating data
2. Accessing data
3. Archiving data
4. Deleting data
5. Monitoring data
6. Storing data
How to partition?
Data is partitioned when data of the like
structure is divided into more than one physical
unit of data
Data can be divided by many criteria:

 Date
Line of business
Geography
All the above
In DW date should be 1 of the partitioning

criteria
Example
Insurance company
◦ 2000 health claims
◦ 1999 life claims
Partitioning by
year and claims
◦ 2000 auto claims
◦ 2001 auto claims
Structure of data in the
DW
Simple cumulative data
Rolling summary data
Simple direct file
Continuous file
The simple cumulative
structure
Daily transactions
Operational data
Daily summary
Jan1 Jan2 Jan3 ……..
Feb1 Feb2 Feb3 ……..
Mar1 Mar2 Mar3 ……..

Rolling summary data
Daily transactions
Operational data
day1 day2 day3 ……..

Daily summary
week1 week2 week3 ……..
mon1 mon2 mon3 ……..
year1 year2 year3 ……..

Comparison
Simple cumulative Rolling summary
data data
1. Much storage 1. Very compact

required
2. No loss of 2. Some loss of
details detail
3. Much processing 3. The older the
needed to do data gets, the
anything with less detail is kept
the data
Simple direct file
A simple direct file represents a snapshot of
operational data taken as of one instant in time
J Ali 123 Tahrir St.

R Ramy 456 Giza St.
H Mona 12 Sherif St.
L Laila 3 Fahmy St.
Operational data …………………
January customers
Continuous file
A continuous file is created from
2 or more simple direct files
January customers February customers
J Ali 123 Tahrir St. K Khalid 12 Helwan St.

R Ramy 456 Giza St. R Kamal 46 Gohar St.
H Sayed 12 Sherif St. H Sayed 34 Fesal St.
L Lotfy 3 Fahmy St. …………………
…………………
J Ali jan-present 123 Tahrir St.

R Ramy jan-present 456 Giza St.
H Sayed jan-jan 12 Sherif St.
H Sayed feb-present 34 Fesal St.
L Lotfy jan-present 3 Fahmy St.
K Khalid feb-present 12 Helwan St.
R Kamal feb-present 46 Gohar St.
…………………
Data heterogeneity
Data in the DW is heterogeneous
Data is divided into major subdivisions
called subject areas
Subject areas are the lines of major
subjects of the corporation
Data is subject area is divided into tables
Each table has its own data along the
same subject area thread
Within the physical tables  different
occurrence of data values
How to handle incorrect
data ?
A mistake of a bank entry on July 2 for nd
5000L.E instead of 750L.E discovered on

August 15th
1. Go back to the incorrect data
and update
 Neat and clean solution
 Data integrity is lost
 Update in the DW environment
1. Enter offsetting entries (+5000, -750

on Aug. 16th)
 Many entries need to be fixed
 Formula of correction can be complex
1. Reset the account to proper
value on Aug 16th
 New entry with current balance 750
 Does not account for the earlier error
Data warehouse
Architecture
Basic architecture
Architecture with staging area
Architecture with data marts

Lecture 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

Review

Why data warehouses?

Difference between traditional DB and

Naturally evolving architecture

Doesn’t interfere with local

Information copied at the DW

Barry Devlin, IBM consultant

A data warehouse is a subject-

Usingthis warehouse, you can

DW data is loaded and accessed but not

Data is loaded as snapshots, when

Result:a historical record of data is kept in

Each record has a time marking

Collective time horizon for the data

Time horizon of DW ~ 5-10 years

DW contains historical information

Major design issues:

Lightly summarized data

Highly summarized data

1. Affects the amount of data that

2. Affects the types of queries that can

High level of detail Low level of detail

Answer any query Limited queries

Details of every phone call for Summary of phone calls for

Did Ali call his client in Alex last Friday?

Can be answered with some search Cannot be answered

But in DW we rarely look at single records !!

Search through 175,000,000 records Search through 1,750,000 records

Data with lowest granularity are stored as archival data

Most DSS is done against the lightly summarized data

 Partitioning helps in:

Data can be divided by many criteria:

In DW date should be 1 of the partitioning

Rolling summary data

Simple direct file

Feb1 Feb2 Feb3 ……..

Mar1 Mar2 Mar3 ……..

day1 day2 day3 ……..

mon1 mon2 mon3 ……..

year1 year2 year3 ……..

1. Much storage 1. Very compact

J Ali 123 Tahrir St.

J Ali 123 Tahrir St. K Khalid 12 Helwan St.

J Ali jan-present 123 Tahrir St.

5000L.E instead of 750L.E discovered on

1. Enter offsetting entries (+5000, -750

Architecture with staging area

Architecture with data marts

You might also like