You are on page 1of 73

R.

Marti
3-1 Data Warehouse Historization
Data Warehousing
Spring Semester 2012
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
2
The Data Warehouse in the DWh Reference Architecture
Data
Ware-
house
Source
Database
Source
Database
Source
Database
Data
Mart
Data
Mart
Dashboards
Reports
Interactive Analysis
Data Warehousing
Focus
Architectural options and variations in data warehouse projects
Design of the single integrated data warehouse, in particular
- how to handle temporal aspects (historization)
- how to ensure common dimensions ( Master Data Management)
Master
Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
3
Preliminaries: Notions of Time in Databases
Valid Time (sometimes also effective time, as of time, or business time)
is the time when a fact in the real world was, is, or will be true.
(More precise wording: the time a fact was or is believed to be true or is believed to become true.)
Note: Valid time must be entered by the user.

Transaction Time (sometimes also system time)
is the time when a fact in the real world was or is stored in the database
(correctly or incorrectly).
Note: Transaction time is automatically determined by the system
(once the user decides to update the corresponding data, of course ... ) .

Example of a fact stored in a DB on October 1 2010 (= transaction time):
David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).

Note: We will mostly be looking at valid time!
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Page 4
(Valid) Time in Star Schema Designs (1)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
5
(Valid) Time in Star Schema Designs (2)
Rows in fact tables are associated with a specific time, via the foreign key
value referencing the time dimension, indicating when they were valid.

However, rows in dimension tables are not associated with any time !
- new rows (rows with an unknown source system IDs) are simply added
- usually, no rows are deleted from a dimension table,even if rows with known
source system IDs are missing from a batch load:
. existing (old) facts still refer to objects corresponding to these missing rows
. if sources do not send explicit information on deletions, it is unclear whether
the corresponding dimensional objects have effectively become invalid or not
(Note: Sending this information might mean re-designing the source system!)
- changes in values of dimension rows with known source system IDs are
(1) either simply overwritten,
(2) or a new row with a new surrogate (but the old source system ID)
is added (see topic slowly changing dimensions)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Analysis of yearly salaries grouped by year and by employee rank.
Schema



DATE_ID, EMP_ID: warehouse-internal object identifiers (surrogates)
EMP_NO: external source system identifier, must be stable across subsequent loads
Page 6
Motivating Example: Star Schema
COMPENSATION

<fk
1
> DATE_ID
<fk
2
> EMP_ID
SALARY
EMPS

<pk> EMP_ID
EMP_NO
EMP_NAME
EMP_RANK
EMP_TITLE
DATES

<pk> DATE_ID
DATE_YEAR

DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Analysis of yearly salaries grouped by year and by employee rank.


select
DATE_YEAR, EMP_RANK, EMP_TITLE,
sum(SALARY) as SALARY
from
COMPENSATION c
join DATES d on d.DATE_ID = c.DATE_ID
join EMPS e on e.EMP_ID = c.EMP_ID
group by
DATE_YEAR, EMP_RANK, EMP_TITLE
order by
DATE_YEAR, EMP_RANK
;
Page 7
Motivating Example: Query
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Load
- Generate ID for new year
- Generate IDs for new employees
- Project contents of source into target
tables EMPS, COMPENSATION
8
Motivating Example: 2010 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
9
Motivating Example: 2010 Compensation Report
select
DATE_YEAR, EMP_RANK, EMP_TITLE,
sum(SALARY) as SALARY
from
COMPENSATION c
join DATES d on d.DATE_ID = c.DATE_ID
join EMPS e on e.EMP_ID = c.EMP_ID
group by
DATE_YEAR, EMP_RANK, EMP_TITLE
order by
DATE_YEAR, EMP_RANK
;

Result
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
10
Motivating Example: 2011 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
11
Issue: 2010 + 2011 Compensation Report
Old 2010 Result








2010+2011 Result
By destructively updating the
rank/title of employee with ID 2
from C to B, the 2010 report
has been unintentionally altered
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 12
Kimballs Types of Slowly Changing Dimensions
Ralph Kimball proposed 3 solutions regarding the historization of
dimensions in the context of the Star Schema called slowly
changing dimensions (SCD)
SCD Type 1: no history of the dimensional attribute is needed/kept
simply overwrite the value in the existing row
ok for e.g. the correction of mistakes in names, birthdays etc.
SCD Type 2: versions of some dimensional attributes are needed
store new rows in the dimension table, with a new warehouse ID,
the existing stable source system ID,
and the new (changed) values
e.g. a change in the rank of an employee
SCD Type 3: current and original (or previous) versions are needed
keep both a current and an original attribute in the dimension table
e.g. the current rank and the original rank of each employee
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 13
Assessment of SCD Type 1 (see previous solution)
Advantages
Simple to understand for business users and simple to implement
(especially when using MOLAP tools)
Requires the least space and has the best response time
Disadvantages
Simplicity is deceiving !
A change in a dimensional attribute effectively changes the context
for all facts captured prior to the change
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
14
Motivating Example with SCD Type 2: 2011 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
15
2010 + 2011 Compensation Report with SCD Type 2
Old 2010 Result







2010+2011 Result
2010 salaries get linked to old
version of employee,
2011 salaries get linked to new
version of employee
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 16
Assessment of SCD Type 2
Advantages
Reasonably understandable and simple to implement
(regardless of MOLAP / ROLAP tool)
Captures parts of the history

Disadvantages
The time of a change in a dimension is not captured
Requires more space since a single dimensional object is potentially
represented in several rows (but this is usually not an issue)
Can be confusing since changed dimensional data objects appear
more than once, with identical source system IDs, but at least one
changed attribute value
Checking when it is ok to refer to which DWh IDs is not possible
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
17
Motivating Example with SCD Type 3: 2011 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
2010+2011 Result in Terms of Original Ranks








2010+2011 Result in Terms of Current Ranks
2010 + 2011 Compensation Report with SCD Type 3
Both reports are incorrect
(red attribute values)!

Note: The query for the resullts
in terms of original ranks is left
as an exercise ...
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 19
Assessment of SCD Type 3
Advantages
Reasonably simple to implement
(regardless of MOLAP / ROLAP tool)
Captures parts of the history

Disadvantages
Can only have 2 versions of any attribute (usually original and current)
Each historized attribute A must be represented by 2 attributes
(namely, A and A_Original)
Requires more space since there are now 2 attributes instead of 1
(but this is usually not an issue)
Interpretation of results is confusing to most users
Unclear when original and current versions are/were valid
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
20
Temporal Database Systems and Languages in General
Recap: For some types of analysis, dimensions should be historized,
especially for comparisons of measures across different time periods.

Example:
How did buying habits of customers change over the last few years,
grouped by where they live.

! History of addresses of customers should also be kept!
Since 1980, a lot of research has been conducted in general temporal data
models, temporal query languages, and temporal database systems.
Generic support for temporal data is beginning to emerge in products:
Teradata Database 13.10, IBM DB2 V10, Oracle Workspace Manager
(see later)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Associating Time with Data A Theoretical Model
21
time
tuples
attributes
Assumption: For each relation, a clock with
a given temporal granularity is specied,
e.g., a day, a second, or a millisecond.

Conceptually, the extension of a temporal


relation R can then be viewed as a
sequence of snapshot relations
#
R
t
= !
t
(R)

for every time point t of this clock.

"
t
is called snapshot operator

(sometimes also timeslice operator)
snapshot at time t
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
22
Benefits and Pitfalls of Sequence of Snapshots Model
Good for theoretical considerations, in particular
determining equivalence of different temporal representations
measuring the expressive power of temporal query languages
impractical as an implementation model if it requires lots of space,
especially when
granularity of time is fine-grained (minutes, seconds, milliseconds, ... )
represented facts do not change often, i.e. stay the same over a longer
period of time (usually because they describe states rather than events)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
23
From Sequence of Snapshots Model to Time Intervals
Remedy:

Dont store data that did not change since the previous clock tick

! Tuples (or even attributes) whose values are identical across different
snapshots are associated with time intervals (also called periods)
rather than time points


Alternatives:

(1) associate temporal intervals to each tuple

(2) associate temporal intervals to each attribute value
(but this approach requires complex attributes, violating 1NF)

DWh 2012: 3-1 Data Warehouse - Historization R. Marti
24
Valid Time Relations capturing State
Conceptually, every tuple which captures a state is timestamped with a time
interval [t
from
, t
to
] indicating the validity of the (non-temporal) data
represented in the tuple
Remarks:
Transformation into 1NF by replacing V_INTERVAL
by V_FROM (valid from) and V_TO (valid to)
The symbol ? means unknown, until now or until further notice.
In standard SQL, it is usually represented by null or by the date 9999-12-31,
both of which are not entirely satisfactory ...
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
25
Side Issue: Representation of Time Intervals (Periods)
Closed-closed time intervals [t
from
, t
to
] tend to be preferred by end-users:
A fact was true from date t
from
up to and including date t
to
.
This choice also allows querying using the SQL between predicate:
valid at time t in SQL: :t between V_FROM and V_TO
Mathematically, closed-open time intervals [t
from
, t
to
) sometimes also
depicted as [t
from
, t
to
[ are preferable (see e.g. Allen)
A fact was true from date t
from
up to but excluding date t
to
.
valid at time t in SQL: :t >= V_FROM and :t < V_TO
Note:
Unless otherwise stated, I have used the closed-closed representation.
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
26
Typical Queries (1): Snapshot of Valid Time Relation
Snapshots of the previous valid time relation:





Remarks:
We assume that ID is the primary key at every point in time (in every snapshot).
Producing a snapshot from a valid time relation is a simple selection in rel. algebra:
select ID, NAME, FNAME, ADDR, SAL
from EMP
where :t in V_INTERVAL -- actually: :t between V_FROM and V_TO
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
27
Valid Time Relations capturing Recurring States
A specific state of affairs can recur several times (! several time periods)

# transformation to 1NF




The first two tuples are called value equivalent since they have the same
values in all attributes except the temporal attributes V_FROM and V_TO.
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
28
Options in the Representation of Time
Canonical representation using maximal time intervals (as on previous slide):





One (of many) possible alternative representations using two (non-maximal)
contiguous intervals (assuming a temporal granularity of a day):
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
29
Issues with Non-canonical Representations
Non-canonical representations may lead to incorrect answers (for unsuspecting
users).





Example Query: Who left the company before 2008-01-01 and when?
select ID, NAME, FNAME, V_TO
from EMP
where V_TO < date '2008-01-01'

(Incorrect) Result:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
30
Constraint to Avoid Non-canonical Representations
Ensure that intervals remain maximal when inserting or updating:
Let R be a valid time relation in canonical form (i.e., with maximal time intervals)
- n be a new valid time tuple to be inserted into the relation R
- x
1
, ... , x
n
(n ! 0) be all existing valid time tuple in relation R which are
value equivalent to x (cf. p. 12)
Then, for all i, 0 " i " n, the following must hold (in pseudo-SQL notation):
not exists (
select *
from R x
i

where x
i
= n
and (n.V_FROM - 1 between x
i
.V_FROM and x
i
.V_TO
or n.V_TO + 1 between x
i
.V_FROM and x
i
.V_TO)
)
(This could be specified as declarative check constraint if your DBMS implementation supports it ! )
value equivalence
intervals do not touch or overlap
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
31
Typical Queries (2): Temporal Projection
Unfortunately, (intermediate) query results may turn out to be non-canonical,
even if applied to a canonical representation:






Example: Where did employees live and when (irrespective of salary)?

select ID, NAME, FNAME, ADDR, V_FROM, V_TO from EMP

Result:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
32
Coalescing to Avoid Non-canonical Representations
Non-canonical representations can be transformed into the canonical
representation by an operator called temporal coalescing (TCOALESCE below)
which maximizes the length of all intervals by coalescing adjacent and
overlapping intervals of value-equivalent tuples.








Coalesced form:






DWh 2012: 3-1 Data Warehouse - Historization R. Marti
33
Temporal Coalescing in (Pseudo-) SQL
with recursive R
clos
as (
-- initial ("anchor") query
select R.values, R.V_FROM, R.V_TO from R
union
-- recursive query: executed until no new data generated
select R.values, R.V_FROM, R
clos
.V_TO
from R, R
clos

where R
clos
.values = R.values -- values of R
clos
and R are equivalent
and R
clos
.V_FROM >= R.V_FROM
and R
clos
.V_FROM-1 <= R.V_TO -- intervals of R
clos
and R overlap
)
select R
clos
.values, R
clos
.V_FROM, R
clos
.V_TO
from R
clos

where not exists ( -- no smaller (contained) interval
select * from R
where R.values = R
clos
.values
and ( R.V_FROM < R
clos
.V_FROM
or R.V_TO > R
clos
.V_TO )
)
more efficient
implementation
uses window
functions
(see [Zhou et al 2006])
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
34
Typical Queries (3): Temporal Join
Sometimes, the history of information stored in two relations is of interest:











Example: Who worked on which projects and when?

Result:




DWh 2012: 3-1 Data Warehouse - Historization R. Marti
35
Temporal Join in SQL (without temporal coalescing!)
Construct time intervals of result by intersecting time intervals of operands
(and keeping rows with non-empty intervals):

select * from (
select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,
case when e.V_FROM > w.V_FROM
then e.V_FROM
else w.V_FROM
end as V_FROM,
case when e.V_TO < w.V_TO
then e.V_TO
else w.V_TO
end as V_TO
from WORKS_ON w, EMP e
where e.ID = w.EMP_ID
) where V_FROM <= V_TO

Note: This gets more tedious when (temporally) joining 3 or more relations !
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
36
Transaction Time Relations
Note that transaction time should be automatically determined by the
system at insert/update/delete time (or, more precisely, commit time),
not by the user; granularity is typically as fine as possible
Transaction time can be represented exactly like valid time,
by associating a time interval with tuples.
Example: Transaction time history of employee 676 (also see slide 10)

1. 2006-07-01: insert 676 lives in Baar und earns 7000.


2. 2008-04-01: update 676 lives in Bern.
3. 2009-11-01: update 676 earns 7500.
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
37
Using DBMS Logging to capture Transaction Time
Since transaction time can be automatically determined by the system,
the DBMS logging facilities can be used.

This is/was done e.g. in Postgres/PostgreSQL/Illustra (and in Oracle).
Example: Transaction time history of employee 676 (see slide 15)

1. 2006-07-01: insert 676 lives in Baar and earns 7000.


2. 2008-04-01: update 676 lives in Bern.
3. 2009-11-01: update 676 earns 7500.
Normal (snapshot) table
containing current contents.
Undo log table containing
changes to produce
previous contents of
associated snaphsot table
(before images).
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
38
Implementing Logging Using Triggers
create or replace trigger TR_AU_EMP
after update
on EMP
for each row

declare
l_log EMP_UNDO_LOG%rowtype;

begin
l_log.X_TIME := current_timestamp;
l_log.UNDO_OP_CODE := 'update';
l_log.ID := :old.ID;
l_log.NAME := :old.NAME;
l_log.FNAME := :old.FNAME;
l_log.ADDR := :old.ADDR;
l_log.SAL := :old.SAL;
insert into EMP_UNDO_LOG values l_log;
end TR_AU_EMP;
/
written in Oracle PL/SQL
similar triggers required
for inserts and deletes
should probably check
that ID has not changed
and raise an application
error if this were the case
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
39
Bitemporal Relations
Valid time and transaction time can be combined to allow for a complete
history of what information was/is believed to be true and when this was
stored in the database.


Example: Complete (bitemporal) history of employee 676

1. 2006-07-01: insert 676 lives in Baar and earns 7000 as of 2006-08-01.

DWh 2012: 3-1 Data Warehouse - Historization R. Marti


40
Bitemporal Relations (2)
Example (continued): Complete (bi-temporal) history of employee 676

2. 2008-04-01: update 676 lives in Bern as of 2008-03-01.

DWh 2012: 3-1 Data Warehouse - Historization R. Marti


41
Bitemporal Relations (3)
Example (continued): Complete (bi-temporal) history of employee 676

3. 2009-11-01: update 676 earns 7500 as of 2010-01-01.

DWh 2012: 3-1 Data Warehouse - Historization R. Marti


42
Bitemporal Relations (4)

Example (continued): Complete (bi-temporal) history of employee 676

4. 2009-11-11: update correction: 676 earns 7700 as of 2010-01-01.

DWh 2012: 3-1 Data Warehouse - Historization R. Marti


43
Design of Temporal Databases
Basic idea
Do non-temporal database design
Annotate which tables / attributes need to be historized (especially valid time)
and how (state-based vs. event-based)
Generate temporal data structures ... but how?

Questions:
Entity integrity (implemented by primary keys)
! temporal entity integrity
Referential integrity (implemented by foreign keys)
! temporal referential integrity

Arbiter: sequence of snapshots model
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
44
Temporal Entity Integrity (1)
Temporal entity integrity = for every snapshot, entity integrity should hold.
Pro memoria:
- primary keys should consist of a minimal number of attributes
which unqiuely identify each tuple
- these attributes should ideally not change over time
Alternatives for the primary key of a valid time relation (e.g. for table EMP)
(1) ID, V_FROM
(2) ID, V_TO
(3) ID, V_FROM, V_TO (non-minimal primary key!)
(4) ID, SEQ_NO (where SEQ_NO is a sequence number or counter)
Since all attributes except ID (and SEQ_NO) can change over the lifetime of
the identified tuple
- alternative (4) is probably the best,
- followed by alternative (1) as V_FROM only changes in case of an error
(and should not be referenced by foreign keys, as well see)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
45
Temporal Entity Integrity (2)
In addition, it might be desirable to enforce other constraints, including
Time intervals must not be empty
Time intervals should be maximal (unless e.g. queries like what was the
case before or after a specific point in time are not of importance)
create table EMP (
ID integer not null,
SEQ_NO integer not null,
NAME varchar(20) not null,
...
V_FROM date not null,
V_TO date default date '9999-12-31',
primary key (ID, SEQ_NO),
check ( V_FROM <= V_TO ),
check ( not exists (
select * from EMP other
where other.ID = ID and other.NAME = NAME and ...
and ( other.V_FROM between V_FROM-1 and V_TO+1
or other.V_TO between V_FROM-1 and V_TO+1 )
) ) )
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
46
Referential Integrity between Snapshot Relations
The foreign key (FK) attribute value(s) in the referencing relation must exist as
primary key (PK) values in the referenced relation:
Example: Works_On[Emp_Id] $ Emp[Id]

Note: In relational theory, this is sometimes also called an inclusion dependency.
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
47
Temporal Referential Integrity (1)
Temporal referential integrity = for every snapshot, referential integrity must hold.
Problem:
- primary keys now have a temporal part (on top of the non-temporal part)
- valid time periods in the foreign key (referencing) relation are not
necessarily the same as those of the primary key (referenced) relation
At every point in time when the FK value was valid,
the referenced PK value must be valid.
%t ( "
t
(Works_On[Emp_Id]) $ "
t
(Emp[Id]) )
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
48
Temporal Referential Integrity (2)
%t ( "
t
(Works_On[Emp_Id]) $ "
t
(Emp[Id]) ) holds for employee 676 because
projection followed by temporal coalescing would result in:






Of course, performing temporal coalescing for
- adding tuples to and/or extending time intervals of the referencing relation
- deleting tuples from and/or shrinking time intervals in the referenced relation
would be an expensive proposition !
Recommendation: Track complete lifetimes of objects in a separate relation
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
49
Temporal Referential Integrity (3)
Split valid time relation on referenced (PK) side into
(1) an object relation (suffix _OBJ) and (2) a property relation (suffix _PROP)
Add a referential integrity constraint from property relation to object relation.
Re-route non-temporal referential integrity constraints from other relations
to the object relation.






DWh 2012: 3-1 Data Warehouse - Historization R. Marti
50
Temporal Referential Integrity (4)
In referencing relations, it might be desirable to enforce referential integrity
non-temporal part: as usual
temporal part: time interval contained in time interval of referenced object

create table WORKS_ON (
EMP_ID integer not null,
PROJ_ID integer not null,
SEQ_NO integer not null,
V_FROM date not null,
V_TO date default date '9999-12-31',
primary key (EMP_ID, PROJ_ID, SEQ_NO),
check ( V_FROM <= V_TO ),
foreign key (EMP_ID) references EMP_OBJ(ID),
check ( exists (
select * from EMP_OBJ ref
where ref.ID = EMP_ID
and ref.V_FROM <= V_FROM and ref.V_TO >= V_TO
) )
... -- e.g. temporal FK to a table PROJ_OBJ
)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
51
Temporal Normalization (1): Time-invariant Attributes
Assume that attribute FName cannot change over the lifetime of an Emp
(except to correct mistakes).
In other words, the functional dependency (FD) Id # FName holds
relation Emp_Prop below is not in 2NF (attribute depends on part of PK)
relation Emp_Prop exhibits update anomalies
when having to fix a mistake in Sues first name (e.g. change to Susan)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
52
Temporal Normalization (2): Time-invariant Attributes
Recommendation:
Consider moving time-invariant attributes (e.g. FName) from the property
relation (e.g. Emp_Prop) to the object relation (e.g. Emp_Obj).
In Emp_Obj, the FD Id # FName still holds (and is enforced by the PK),
so the relation does not exhibit update anomalies.
In Emp_Prop, all attributes are now fully dependent on the PK but there is still an issue ...
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
53
Temporal Normalization (3): Asynchronous Changes
Example: After having inserted the salary raise to employe 676 as of beginning
of 2010, we learn that she actually moved to Aarau as of Dev 1 2009.
update anomaly: several tuples need to be changed (in addition to insert)





Recommendation:
Attributes whose values change independently of other attributes should be put
into different relations
(somewhat like achieving 4NF in the face of multi-valued dependencies).
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
54
Temporal Normalization (4): Asynchronous Changes
Example: Since address and salary of an employee may change independently
(and asynchronuously), these attributes should be put into different relations.
no update anomaly: only one tuple needs change (in addition to insert)




Employee salaries remain untouched:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Summary of Design Recommendations
For kernel entity types (with objects whose existence is independent of other
entities), consider the introduction of an object relation to capture the lifetime
of these objects main benefits:
- referential integrity checking over time
- home for time-invariant attributes
For relations representing object properties (or relationships between objects)
and their history, consider choosing a temporal primary key consisting of the
non-temporal primary key attributes plus a (meaningless) sequence number.
For relations representing object properties (or relationships between objects),
consider decomposing them into groups of attributes which
- are either time-invariant
this attribute group is moved to the object relation
- or change independently of one another (i.e., potentially at different times)
each such attribute group is moved into a separate relation keeping
track of the history of the values
Remember: Following
them is no free lunch!
55
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
56
Proposals for Temporal Support in SQL
There are proposals to hide all this temporal complexity in SQL,
e.g., the SQL/Temporal part of a future SQL3 standard.
Originally, a temporal join (including temporal coalescing) was supposed to be
specified as follows:

validtime
select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,
from WORKS_ON w, EMP e
where e.ID = w.EMP_ID

see Richard T. Snodgrass: Developing Time-Oriented Database Applications.
Morgan Kaufmann, 1999.

Note: This publication is out of print, but available electronically as pdf a
http://www.cs.arizona.edu/people/rts/publications.html

Apparently, DB2 10 for z/OS (see following slides for some examples) and
Teradata Database V13.10 support most of the SQL/Temporal proposal.




DWh 2012: 3-1 Data Warehouse - Historization R. Marti
57
Example: Temporal Support in IBM DB2 10 (1)
Non-temporal table POLICY capturing information about insurance policies for
cars (vehicles):
ID: unchanging IDentifier
VIN: Vehicle Identification Number
rental_car: is the car a rental car (legal values: Y and N)
annual_mileage: approximate distance in miles per year
coverage_amt: maximum amount paid by insurance company,
presumably in US Dollars (Are there any other currencies on this planet? :-)






Fig. 1: Sample POLICY table (without temporal support)
ID VIN annual_mileage rental_car coverage_amt
1111 A1111 10000 Y 500000

Let`s explore how DB2`s temporal support can help you ma
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
58
Example: Temporal Support in IBM DB2 10 (2)
Declaring tables to capture system time (= transaction time) + history of changes

-- Step 1: Create a table with a SYSTEM_TIME period.
CREATE TABLE policy (
id INT PRIMARY KEY NOT NULL,
...
sys_start TIMESTAMP(12) GENERATED ALWAYS AS ROW BEGIN NOT NULL,
sys_end TIMESTAMP(12) GENERATED ALWAYS AS ROW END NOT NULL,
trans_start TIMESTAMP(12) GENERATED ALWAYS AS
TRANSACTION START ID IMPLICITLY HIDDEN,
PERIOD SYSTEM_TIME (sys_start, sys_end)
);

-- Step 2: Create an associated history table.
CREATE TABLE policy_history LIKE policy;

-- Step 3: Enable versioning.
ALTER TABLE policy ADD VERSIONING USE HISTORY TABLE policy_history;
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
59
Example: Temporal Support in IBM DB2 10 (3)
Result of previous create table statements:

Fig. 2: Sample tables for our system time scenario
POLICY table (contains current data)
ID VIN annual_mileage rental_car coverage_amt sys_start sys_end trans_start





POLICYHISTORY table (contains historical data)
ID VIN annual_mileage rental_car coverage_amt sys_start sys_end trans_start





You can also use the ALTER TABLE statement to modiIy existing tables to track system
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
60
Example: Temporal Support in IBM DB2 10 (4)
Insertions do not affect the history table:
INSERT INTO policy(id,vin,annual_mileage,rental_car,coverage_amt)
VALUES (1111, 'A1111', 10000, 'Y', 500000);
INSERT INTO policy(id,vin,annual_mileage,rental_car,coverage_amt)
VALUES (1414, 'B7777', 14000, 'N', 750000);
-- both statements executed on November 15, 2010


Fig. 3: Current and history table contents after INSERTs on Nov. 15, 2010
POLICY
ID VIN annual_mileage rental_car coverage_amt sys_start sys_end
1111 A1111 10000
Y
500000 2010-11-15 9999-12-31
1414 B7777 14000
N
750000 2010-11-15 9999-12-31

POLICYHISTORY (empty)
ID VIN annual_mileage rental_car coverage_amt sys_start sys_end




The SYSTEMSTART values in the POLICY table reIlect when the rows were inserted
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
61
Example: Temporal Support in IBM DB2 10 (5)
Updates do affect the history table (as do deletions ... see later)
UPDATE policy
SET coverage_amt = 750000
WHERE id = 1111;
-- statement executed on January 31, 2011

POLICY
D: EDF .,,).5GH#5+.$+ *+,-.5G6.* 62I+*.$+G.H- 131G1-.*- 131G+,/
1111 A1111 10000
Y
750000 2011-01-31 9999-12-31
1414 B7777 14000
N
750000 2010-11-15 9999-12-31

POLICYHISTORY
D: EDF .,,).5GH#5+.$+ *+,-.5G6.* 62I+*.$+G.H- 131G1-.*- 131G+,/
1111 A1111 10000
Y
500000 2010-11-15 2011-01-31

As you might expect, any subsequent updates to policies are handled in a similar manner.
*+,-.5G6.*


*+,-.5G6.*


DWh 2012: 3-1 Data Warehouse - Historization R. Marti
62
Example: Temporal Support in IBM DB2 10 (6)
Another update, 1 year later ...
UPDATE policy
SET annual_mileage = 5000, rental_car='N', coverage_amt = 250000
WHERE id = 1111;
-- statement executed on January 31, 2012

POLICY
D: EDF .,,).5GH#5+.$+ *+,-.5G6.* 62I+*.$+G.H- 131G1-.*- 131G+,/
1111 A1111 5000
N
250000 2012-01-31 9999-12-31
1414 B7777 14000
N
750000 2010-11-15 9999-12-31

POLICYHISTORY
D: EDF .,,).5GH#5+.$+ *+,-.5G6.* 62I+*.$+G.H- 131G1-.*- 131G+,/
1111 A1111 10000
Y
500000 2010-11-15 2011-01-31
1111 A1111 10000
Y
750000 2011-01-31 2012-01-31
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
63
Example: Temporal Support in IBM DB2 10 (7)
And a deletion ...
DELETE FROM policy
WHERE id = 1414;
-- statement executed on March 31, 2012

POLICY
B8 CBD .,,).5EF#5+.$+ *+,-.5E6.* 62G+*.$+E.F- 131E1-.*- 131E+,/
1111 A1111 5000
N
250000 2012-01-31 9999-12-31

POLICYHISTORY
B8 CBD .,,).5EF#5+.$+ *+,-.5E6.* 62G+*.$+E.F- 131E1-.*- 131E+,/
1111 A1111 10000
Y
500000 2010-11-15 2011-01-31
1111 A1111 10000
Y
750000 2011-01-31 2012-01-31
1414 B7777 14000 N 750000 2010-11-15 2012-03-31

DWh 2012: 3-1 Data Warehouse - Historization R. Marti
64
Example: Temporal Support in IBM DB2 10 (8)
Retrieving current data (from the current table shown on the previous slide):
SELECT coverage_amt
FROM policy
WHERE id = 1111;
-- returns 250000

Retrieving historical data (from the current/historical tables shown on the previous slide):
SELECT coverage_amt
FROM policy FOR SYSTEM_TIME AS OF TIMESTAMP(!2010-12-01!)
WHERE id = 1111;
-- returns 500000


DWh 2012: 3-1 Data Warehouse - Historization R. Marti
65
Example: Temporal Support in IBM DB2 10 (9)
Declaring a table to capture business time (= valid time)

CREATE TABLE policy (
id INT PRIMARY KEY NOT NULL,
...
bus_start DATE NOT NULL,
bus_end DATE NOT NULL,
PERIOD BUSINESS_TIME (bus_start, bus_end)
PRIMARY KEY (id, BUSINESS_TIME WITHOUT OVERLAPS)
);
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
66
Example: Temporal Support in IBM DB2 10 (10)
Insertions are straightforward and require appropriate values for business time start / end:
INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)
VALUES (1111, 'A1111', 10000, 'Y', 500000, '2010-01-01', '2011-01-01');
INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)
VALUES (1111, 'A1111', 10000, 'Y', 750000, '2011-01-01', '9999-12-31');
INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)
VALUES (1414, 'B7777', 14000, 'N', 750000, '2008-05-01', '2010-03-01');
INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)
VALUES (1414, 'B7777', 12000, 'N', 600000, '2010-03-01', '2011-01-01');


Fig. 7: POLICY table after INSERT statements
ID VIN annual_mileage rental_car coverage_amt bus_start bus_end
1111 A1111 10000
Y
500000 2010-01-01 2011-01-01
1111 A1111 10000
Y
750000 2011-01-01 9999-12-31
1414 B7777 14000
N
750000 2008-05-01 2010-03-01
1414 B7777 12000
N
600000 2010-03-01 2011-01-01

It may help to summarize the contents oI this table in business terms. Very brieIly, the
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
67
Example: Temporal Support in IBM DB2 10 (11)
An insertion with a business time period that overlaps with business time period(s) of
existing rows raises an error:
INSERT INTO policy(id,vin, ... ,coverage_amt, bus_start, bus_end)
VALUES (1111, 'A1111', 10000, 'Y', 900000, '2010-06-01', '2011-09-01');
-- overlap with 2 existing rows => rejected by system

Use an update statement instead:
UPDATE policy
FOR PORTION OF BUSINESS_TIME
FROM '2010-06-01'
TO '2011-09-01'
SET coverage_amt = 900000
WHERE id = 1111;
Fig. 8. Row splits caused by the UPDATE statement
row row
2010-01-01 2011-01-01 9999-12-31
UPDATE . FROM 2010-06-01 TO 2011-09-01
row row
2010-01-01 2011-01-01 9999-12-31
2010-06-01 2011-09-01
row row
Before the update (Fig. 7):
After the update (Fig. 9):


DWh 2012: 3-1 Data Warehouse - Historization R. Marti
68
Example: Temporal Support in IBM DB2 10 (12)
Table resulting after execution of update statement shown on previous slide:

Fig. 9. POLICY table after UPDATE of Policy 1111
ID VIN annual_mileage rental_car coverage_amt bus_start bus_end
"""" E"""" "FFFF
:
GFFFFF HF"FIF"IF" HF"FIFJIF"
"""" E"""" "FFFF
:
(FFFFF HF"FIFJIF" HF""IF"IF"
"""" E"""" "FFFF
:
(FFFFF HF""IF"IF" HF""IF(IF"
"""" E"""" "FFFF
:
>GFFFF HF""IF(IF" ((((I"HIK"
"#"# L>>>> "#FFF
M
>GFFFF HFFNIFGIF" HF"FIFKIF"
"#"# L>>>> "HFFF
M
JFFFFF HF"FIFKIF" HF""IF"IF"
DeIeting data from a tabIe with business time






DWh 2012: 3-1 Data Warehouse - Historization R. Marti
69
Example: Temporal Support in IBM DB2 10 (13)
Deletion from table shown on previous slide:
DELETE FROM policy
FOR PORTION OF BUSINESS_TIME
FROM '2010-06-01' TO '2011-01-01'
WHERE id = 1414;


*5 =*> .??@.0AB"01.#1 31?-.0A:.3 :9C13.#1A.B- /@DAD-.3- /@DA1?E
"""" E"""" "FFFF
:
GFFFFF HF"FIF"IF" HF"FIFJIF"
"""" E"""" "FFFF
:
(FFFFF HF"FIFJIF" HF""IF"IF"
"""" E"""" "FFFF
:
(FFFFF HF""IF"IF" HF""IF(IF"
"""" E"""" "FFFF
:
>GFFFF HF""IF(IF" ((((I"HIK"
"#"# L>>>> "#FFF
M
>GFFFF HFFNIFGIF" HF"FIFKIF"
"#"# L>>>> "HFFF
M
JFFFFF HF"FIFKIF" HF"FIFJIF"
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
70
Example: Temporal Support in IBM DB2 10 (14)






Retrieving data across all business time periods from table shown above:
SELECT COUNT(*) FROM policy WHERE id = 1111;
-- returns 2
Retrieving data as of a specific business time from table shown on previous slide:
SELECT coverage_amt
FROM policy FOR BUSINESS_TIME AS OF TIMESTAMP(!2010-12-01!)
WHERE id = 1111;
-- returns 500000

Fig. 7: POLICY table after INSERT statements
ID VIN annual_mileage rental_car coverage_amt bus_start bus_end
1111 A1111 10000
Y
500000 2010-01-01 2011-01-01
1111 A1111 10000
Y
750000 2011-01-01 9999-12-31
1414 B7777 14000
N
750000 2008-05-01 2010-03-01
1414 B7777 12000
N
600000 2010-03-01 2011-01-01

It may help to summarize the contents oI this table in business terms. Very brieIly, the
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
71
Example: Temporal Support in IBM DB2 10 (15)
Retrieving data as of specific business times (from table on previous slide):
SELECT coverage_amt
FROM policy
FOR BUSINESS_TIME FROM TIMESTAMP(!2009-01-01!)
TO TIMESTAMP(!2011-01-01!)
WHERE id = 1414;



Fig. 12: Query result
ID VIN annual_mileage rental_car coverage_amt bus_start bus_end
1414 B7777 14000
N
750000 2008-05-01 2010-03-01
1414 B7777 12000
N
600000 2010-03-01 2011-01-01

Temporal queries against tables with business time are internally re-written to a query
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
72
Example: Temporal Support in IBM DB2 10 (16)
Declaring a bitemporal table, capturing business time and system time:
(1) Declare a table with business time and system time.
CREATE TABLE policy (
id INT PRIMARY KEY NOT NULL,
...
bus_start DATE NOT NULL,
bus_end DATE NOT NULL,
sys_start TIMESTAMP(12) GENERATED ALWAYS AS ROW BEGIN NOT NULL,
sys_end TIMESTAMP(12) GENERATED ALWAYS AS ROW END NOT NULL,
trans_start TIMESTAMP(12) GENERATED ALWAYS AS
TRANSACTION START ID IMPLICITLY HIDDEN,
PERIOD BUSINESS_TIME (bus_start, bus_end),
PERIOD SYSTEM_TIME (sys_start, sys_end),
PRIMARY KEY (id, BUSINESS_TIME WITHOUT OVERLAPS)
);
(2) Then declare a history table like the previous table
(3) Associate this history table with the table declared in step (1)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 73
Literature
General Temporal Database Concepts
[Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann,
1999. (see http://www.cs.arizona.edu/people/rts/publications.html)
[Zhou et al 2006] Xin Zhou, Fusheng Wang, Carlo Zaniolo: Efficient Temporal Coalescing Query Support in
Relational Database Systems. Proc. 17th International Conference on Database and Expert Systems
Applications - DEXA '06, 2006.
[Johnston & Weis 2010] Tom Johnston, Randall Weis: Managing Time in Relational Databases: How to Design,
Update and Query Temporal Data. Morgan Kaufmann, 2010.
[Sacacco et al 2010] Cynthia M. Saracco, Matthias Nicola, Lenisha Gandhi: A Matter of Time Temporal Data
Management in DB2 for z/OS. IBM Silicon Valley Laboratory, 2010 (?).

Data Warehouse Design
[Kimball & Ross 2002] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling, 2
nd
Edition. John Wiley, 2002.
[Imhoff et al 2003] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger: Mastering Data Warehouse Design:
Relational and Dimensional Techniques. John Wiley, 2003.
[Golfarelli & Rizzi 2009] Matteo Golfarelli, Stefano Rizzi: Data Warehouse Design: Modern Principles and
Methodologies. McGraw Hill, 2009.
[Adamson 2010] Christopher Adamson: Star Schema: The Complete Reference. McGraw Hill, 2010.

You might also like