You are on page 1of 32

Review

Why data warehouses?

Difference between traditional DB and


DW

Naturally evolving architecture


Architecture environment
 4 levels of data
Advantages of data
warehousing approach
High query performance
◦ Not necessarily current information

Doesn’t interfere with local


processing at sources
◦ Complex queries at warehouse
◦ OLTP at information sources

Information copied at the DW


◦ Availability of historical information
Data Warehouse: Definition
1
A data warehouse is simply a
single, complete, and consistent
store of data obtained from a
variety of sources and made
available to end users in a way they
can understand and use in a
business context

Barry Devlin, IBM consultant


Data Warehouse: Definition
2

A data warehouse is a subject-


oriented, integrated, nonvolatile,
and time variant collection of data in
support of management’s decisions.

W.H.Inmon, 1996
DW: Subject-Oriented
• An insurance company

Autos
Customer

Health Claim
Policy

Life
premium

Applications Subjects
Example
To learn more about your
company's sales data, you can
build a warehouse that
concentrates on sales.

Usingthis warehouse, you can


answer questions like
◦ "Who was our best customer for this
item last year?"
Example: Customer
 Series of related tables
 Time key
 Different granularity Customer
activity Customer
Customer details activity
Customer data (2000- details
data (2003- Customer 2003) (2003-
(2000- 2005) activity
2005)
2002) (2000- Cust-ID
Cust-ID 2003) Activity
Cust-ID
Cust-ID From date
amount Activity
From To Cust-ID
clerk no date
To Name Month amount
location
Name Address item no
no. of trans time
Address Phone location
avg. …..
Phone ….. invoice no
amount
….. …..
max
Summarized data …..
Detailed data
DW: Integrated
Integration is the most important
aspect

Source1

Source
2
Converted
Source
Reformatted
3 Resequenced DW
Source
4
Summarized
Source5
DW: Non-volatile
Nonvolatilemeans that, once entered into
the warehouse, data should not change

DW data is loaded and accessed but not


updated

Data is loaded as snapshots, when


changes occur, a new snapshot is created.

Result:a historical record of data is kept in


the data warehouse
DW: Time Variant
Every unit of data in the DW is
accurate as of a moment of time

Each record has a time marking


to show when it was accurate
Time Horizon
Isthe length of time data is
represented in an environment

Collective time horizon for the data


in the DW is much longer than that
in the operational systems

Time horizon of DW ~ 5-10 years

DW contains historical information


Roadmap
Review
Datawarehouse characteristics
DW design issues
◦ Granularity
 Definition
Example
◦ Data partitioning
Structure of data in DW
DW design issues

Major design issues:

◦ Granularity

◦ Partitioning
Granularity
What is granularity?
 The level of detail or summarization of
the units of data in the data warehouse

LOW
Detailed data
Granularity

Lightly summarized data

Highly summarized data


HIGH
Granularity is critical !!
 The level of granularity

1. Affects the amount of data that


resides in the data warehouse

2. Affects the types of queries that can


be answered
Granularity trade-off

High level of detail Low level of detail

Answer any query Limited queries


Large volumes of data Easy to manipulate
More space and I/Os Less space and I/Os
Example
High Low
details details

Details of every phone call for Summary of phone calls for


customers for a month customers for a month

Did Ali call his client in Alex last Friday?

Can be answered with some search Cannot be answered

But in DW we rarely look at single records !!


On the average how many long distance
calls we made from Cairo to Alex.?

Search through 175,000,000 records Search through 1,750,000 records


doing 45,000,000 I/Os doing 450,000 I/Os
Solution
Dual level of details
 Contains both lightly summarized and truly archived data
Lightly summarized have more details

Data with lowest granularity are stored as archival data


True archival data are stored with all the details from
operational environment
Archival data are stored on magnetic tapes

Most DSS is done against the lightly summarized data


that is compact and efficiently accessed
True archival data is used when more details are needed
Living Sample Database
Living sample database is a subset of
either the true archival or the lightly
summarized data taken from the DW
Is useful when the volume of data in the
DW is huge
Is a subset either from the archival or
lightly summarized data
Is Ali a customer ?
 Can be a customer but not in the living
sample
Good for statistical analysis, looking at
trends, and collective look at data
How to build a living sample
database?
 An extract (selection) program runs through a
large database
 Selects (randomly or based on a criteria)
every 100 or 1000 record
 The record is moved to the living sample DB
 The size of the resulting living sample DB is
then 1/100 or 1/1000 or the original DB
 Queries are run against the resulting sample
DB
 There is a trade-off between accuracy and
time
Data Partitioning
 Partitioning is the breakup of data
into separate physical units that can
be handled independently

 Partitioning helps in:


1. Locating data
2. Accessing data
3. Archiving data
4. Deleting data
5. Monitoring data
6. Storing data
How to partition?
Data is partitioned when data of the like
structure is divided into more than one physical
unit of data

Data can be divided by many criteria:


 Date
Line of business
Geography
All the above

In DW date should be 1 of the partitioning


criteria
Example
Insurance company
◦ 2000 health claims
◦ 2001 health claims
◦ 2002 health claims
◦ 1999 life claims
Partitioning by
◦ 2000 life claims
◦ 2001 life claims
year and claims
◦ 2002 life claims
◦ 2000 auto claims
◦ 2001 auto claims
Structure of data in the
DW
Simple cumulative data

Rolling summary data

Simple direct file

Continuous file
The simple cumulative
structure
Daily transactions

Operational data

Daily summary
Jan1 Jan2 Jan3 ……..

Feb1 Feb2 Feb3 ……..

Mar1 Mar2 Mar3 ……..


Rolling summary data

Daily transactions

Operational data

day1 day2 day3 ……..


Daily summary
week1 week2 week3 ……..

mon1 mon2 mon3 ……..

year1 year2 year3 ……..


Comparison
Simple cumulative Rolling summary
data data

1. Much storage 1. Very compact


required
2. No loss of 2. Some loss of
details detail
3. Much processing 3. The older the
needed to do data gets, the
anything with less detail is kept
the data
Simple direct file
A simple direct file represents a snapshot of
operational data taken as of one instant in time

J Ali 123 Tahrir St.


R Ramy 456 Giza St.
H Mona 12 Sherif St.
L Laila 3 Fahmy St.
Operational data …………………

January customers
Continuous file
A continuous file is created from
2 or more simple direct files
January customers February customers

J Ali 123 Tahrir St. K Khalid 12 Helwan St.


R Ramy 456 Giza St. R Kamal 46 Gohar St.
H Sayed 12 Sherif St. H Sayed 34 Fesal St.
L Lotfy 3 Fahmy St. …………………
…………………

J Ali jan-present 123 Tahrir St.


R Ramy jan-present 456 Giza St.
H Sayed jan-jan 12 Sherif St.
H Sayed feb-present 34 Fesal St.
L Lotfy jan-present 3 Fahmy St.
K Khalid feb-present 12 Helwan St.
R Kamal feb-present 46 Gohar St.
…………………
Data heterogeneity
Data in the DW is heterogeneous
Data is divided into major subdivisions
called subject areas
Subject areas are the lines of major
subjects of the corporation
Data is subject area is divided into tables
Each table has its own data along the
same subject area thread
Within the physical tables  different
occurrence of data values
How to handle incorrect
data ?
A mistake of a bank entry on July 2 for nd

5000L.E instead of 750L.E discovered on


August 15th
1. Go back to the incorrect data
and update
 Neat and clean solution
 Data integrity is lost
 Update in the DW environment

1. Enter offsetting entries (+5000, -750


on Aug. 16th)
 Many entries need to be fixed
 Formula of correction can be complex
1. Reset the account to proper
value on Aug 16th
 New entry with current balance 750
 Does not account for the earlier error
Data warehouse
Architecture
Basic architecture

Architecture with staging area

Architecture with data marts

You might also like