Professional Documents
Culture Documents
Marti
3-1 Data Warehouse Historization
Data Warehousing
Spring Semester 2012
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
2
The Data Warehouse in the DWh Reference Architecture
Data
Ware-
house
Source
Database
Source
Database
Source
Database
Data
Mart
Data
Mart
Dashboards
Reports
Interactive Analysis
Data Warehousing
Focus
Architectural options and variations in data warehouse projects
Design of the single integrated data warehouse, in particular
- how to handle temporal aspects (historization)
- how to ensure common dimensions ( Master Data Management)
Master
Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
3
Preliminaries: Notions of Time in Databases
Valid Time (sometimes also effective time, as of time, or business time)
is the time when a fact in the real world was, is, or will be true.
(More precise wording: the time a fact was or is believed to be true or is believed to become true.)
Note: Valid time must be entered by the user.
Transaction Time (sometimes also system time)
is the time when a fact in the real world was or is stored in the database
(correctly or incorrectly).
Note: Transaction time is automatically determined by the system
(once the user decides to update the corresponding data, of course ... ) .
Example of a fact stored in a DB on October 1 2010 (= transaction time):
David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).
Note: We will mostly be looking at valid time!
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Page 4
(Valid) Time in Star Schema Designs (1)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
5
(Valid) Time in Star Schema Designs (2)
Rows in fact tables are associated with a specific time, via the foreign key
value referencing the time dimension, indicating when they were valid.
However, rows in dimension tables are not associated with any time !
- new rows (rows with an unknown source system IDs) are simply added
- usually, no rows are deleted from a dimension table,even if rows with known
source system IDs are missing from a batch load:
. existing (old) facts still refer to objects corresponding to these missing rows
. if sources do not send explicit information on deletions, it is unclear whether
the corresponding dimensional objects have effectively become invalid or not
(Note: Sending this information might mean re-designing the source system!)
- changes in values of dimension rows with known source system IDs are
(1) either simply overwritten,
(2) or a new row with a new surrogate (but the old source system ID)
is added (see topic slowly changing dimensions)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Analysis of yearly salaries grouped by year and by employee rank.
Schema
DATE_ID, EMP_ID: warehouse-internal object identifiers (surrogates)
EMP_NO: external source system identifier, must be stable across subsequent loads
Page 6
Motivating Example: Star Schema
COMPENSATION
<fk
1
> DATE_ID
<fk
2
> EMP_ID
SALARY
EMPS
<pk> EMP_ID
EMP_NO
EMP_NAME
EMP_RANK
EMP_TITLE
DATES
<pk> DATE_ID
DATE_YEAR
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Analysis of yearly salaries grouped by year and by employee rank.
select
DATE_YEAR, EMP_RANK, EMP_TITLE,
sum(SALARY) as SALARY
from
COMPENSATION c
join DATES d on d.DATE_ID = c.DATE_ID
join EMPS e on e.EMP_ID = c.EMP_ID
group by
DATE_YEAR, EMP_RANK, EMP_TITLE
order by
DATE_YEAR, EMP_RANK
;
Page 7
Motivating Example: Query
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Load
- Generate ID for new year
- Generate IDs for new employees
- Project contents of source into target
tables EMPS, COMPENSATION
8
Motivating Example: 2010 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
9
Motivating Example: 2010 Compensation Report
select
DATE_YEAR, EMP_RANK, EMP_TITLE,
sum(SALARY) as SALARY
from
COMPENSATION c
join DATES d on d.DATE_ID = c.DATE_ID
join EMPS e on e.EMP_ID = c.EMP_ID
group by
DATE_YEAR, EMP_RANK, EMP_TITLE
order by
DATE_YEAR, EMP_RANK
;
Result
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
10
Motivating Example: 2011 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
11
Issue: 2010 + 2011 Compensation Report
Old 2010 Result
2010+2011 Result
By destructively updating the
rank/title of employee with ID 2
from C to B, the 2010 report
has been unintentionally altered
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 12
Kimballs Types of Slowly Changing Dimensions
Ralph Kimball proposed 3 solutions regarding the historization of
dimensions in the context of the Star Schema called slowly
changing dimensions (SCD)
SCD Type 1: no history of the dimensional attribute is needed/kept
simply overwrite the value in the existing row
ok for e.g. the correction of mistakes in names, birthdays etc.
SCD Type 2: versions of some dimensional attributes are needed
store new rows in the dimension table, with a new warehouse ID,
the existing stable source system ID,
and the new (changed) values
e.g. a change in the rank of an employee
SCD Type 3: current and original (or previous) versions are needed
keep both a current and an original attribute in the dimension table
e.g. the current rank and the original rank of each employee
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 13
Assessment of SCD Type 1 (see previous solution)
Advantages
Simple to understand for business users and simple to implement
(especially when using MOLAP tools)
Requires the least space and has the best response time
Disadvantages
Simplicity is deceiving !
A change in a dimensional attribute effectively changes the context
for all facts captured prior to the change
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
14
Motivating Example with SCD Type 2: 2011 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
15
2010 + 2011 Compensation Report with SCD Type 2
Old 2010 Result
2010+2011 Result
2010 salaries get linked to old
version of employee,
2011 salaries get linked to new
version of employee
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 16
Assessment of SCD Type 2
Advantages
Reasonably understandable and simple to implement
(regardless of MOLAP / ROLAP tool)
Captures parts of the history
Disadvantages
The time of a change in a dimension is not captured
Requires more space since a single dimensional object is potentially
represented in several rows (but this is usually not an issue)
Can be confusing since changed dimensional data objects appear
more than once, with identical source system IDs, but at least one
changed attribute value
Checking when it is ok to refer to which DWh IDs is not possible
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
17
Motivating Example with SCD Type 3: 2011 Data
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
2010+2011 Result in Terms of Original Ranks
2010+2011 Result in Terms of Current Ranks
2010 + 2011 Compensation Report with SCD Type 3
Both reports are incorrect
(red attribute values)!
Note: The query for the resullts
in terms of original ranks is left
as an exercise ...
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Slide 19
Assessment of SCD Type 3
Advantages
Reasonably simple to implement
(regardless of MOLAP / ROLAP tool)
Captures parts of the history
Disadvantages
Can only have 2 versions of any attribute (usually original and current)
Each historized attribute A must be represented by 2 attributes
(namely, A and A_Original)
Requires more space since there are now 2 attributes instead of 1
(but this is usually not an issue)
Interpretation of results is confusing to most users
Unclear when original and current versions are/were valid
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
20
Temporal Database Systems and Languages in General
Recap: For some types of analysis, dimensions should be historized,
especially for comparisons of measures across different time periods.
Example:
How did buying habits of customers change over the last few years,
grouped by where they live.
! History of addresses of customers should also be kept!
Since 1980, a lot of research has been conducted in general temporal data
models, temporal query languages, and temporal database systems.
Generic support for temporal data is beginning to emerge in products:
Teradata Database 13.10, IBM DB2 V10, Oracle Workspace Manager
(see later)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
Associating Time with Data A Theoretical Model
21
time
tuples
attributes
Assumption: For each relation, a clock with
a given temporal granularity is specied,
e.g., a day, a second, or a millisecond.
"
t
is called snapshot operator
(sometimes also timeslice operator)
snapshot at time t
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
22
Benefits and Pitfalls of Sequence of Snapshots Model
Good for theoretical considerations, in particular
determining equivalence of different temporal representations
measuring the expressive power of temporal query languages
impractical as an implementation model if it requires lots of space,
especially when
granularity of time is fine-grained (minutes, seconds, milliseconds, ... )
represented facts do not change often, i.e. stay the same over a longer
period of time (usually because they describe states rather than events)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
23
From Sequence of Snapshots Model to Time Intervals
Remedy:
Dont store data that did not change since the previous clock tick
! Tuples (or even attributes) whose values are identical across different
snapshots are associated with time intervals (also called periods)
rather than time points
Alternatives:
(1) associate temporal intervals to each tuple
(2) associate temporal intervals to each attribute value
(but this approach requires complex attributes, violating 1NF)
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
24
Valid Time Relations capturing State
Conceptually, every tuple which captures a state is timestamped with a time
interval [t
from
, t
to
] indicating the validity of the (non-temporal) data
represented in the tuple
Remarks:
Transformation into 1NF by replacing V_INTERVAL
by V_FROM (valid from) and V_TO (valid to)
The symbol ? means unknown, until now or until further notice.
In standard SQL, it is usually represented by null or by the date 9999-12-31,
both of which are not entirely satisfactory ...
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
25
Side Issue: Representation of Time Intervals (Periods)
Closed-closed time intervals [t
from
, t
to
] tend to be preferred by end-users:
A fact was true from date t
from
up to and including date t
to
.
This choice also allows querying using the SQL between predicate:
valid at time t in SQL: :t between V_FROM and V_TO
Mathematically, closed-open time intervals [t
from
, t
to
) sometimes also
depicted as [t
from
, t
to
[ are preferable (see e.g. Allen)
A fact was true from date t
from
up to but excluding date t
to
.
valid at time t in SQL: :t >= V_FROM and :t < V_TO
Note:
Unless otherwise stated, I have used the closed-closed representation.
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
26
Typical Queries (1): Snapshot of Valid Time Relation
Snapshots of the previous valid time relation:
Remarks:
We assume that ID is the primary key at every point in time (in every snapshot).
Producing a snapshot from a valid time relation is a simple selection in rel. algebra:
select ID, NAME, FNAME, ADDR, SAL
from EMP
where :t in V_INTERVAL -- actually: :t between V_FROM and V_TO
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
27
Valid Time Relations capturing Recurring States
A specific state of affairs can recur several times (! several time periods)
# transformation to 1NF
The first two tuples are called value equivalent since they have the same
values in all attributes except the temporal attributes V_FROM and V_TO.
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
28
Options in the Representation of Time
Canonical representation using maximal time intervals (as on previous slide):
One (of many) possible alternative representations using two (non-maximal)
contiguous intervals (assuming a temporal granularity of a day):
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
29
Issues with Non-canonical Representations
Non-canonical representations may lead to incorrect answers (for unsuspecting
users).
Example Query: Who left the company before 2008-01-01 and when?
select ID, NAME, FNAME, V_TO
from EMP
where V_TO < date '2008-01-01'
(Incorrect) Result:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
30
Constraint to Avoid Non-canonical Representations
Ensure that intervals remain maximal when inserting or updating:
Let R be a valid time relation in canonical form (i.e., with maximal time intervals)
- n be a new valid time tuple to be inserted into the relation R
- x
1
, ... , x
n
(n ! 0) be all existing valid time tuple in relation R which are
value equivalent to x (cf. p. 12)
Then, for all i, 0 " i " n, the following must hold (in pseudo-SQL notation):
not exists (
select *
from R x
i
where x
i
= n
and (n.V_FROM - 1 between x
i
.V_FROM and x
i
.V_TO
or n.V_TO + 1 between x
i
.V_FROM and x
i
.V_TO)
)
(This could be specified as declarative check constraint if your DBMS implementation supports it ! )
value equivalence
intervals do not touch or overlap
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
31
Typical Queries (2): Temporal Projection
Unfortunately, (intermediate) query results may turn out to be non-canonical,
even if applied to a canonical representation:
Example: Where did employees live and when (irrespective of salary)?
select ID, NAME, FNAME, ADDR, V_FROM, V_TO from EMP
Result:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
32
Coalescing to Avoid Non-canonical Representations
Non-canonical representations can be transformed into the canonical
representation by an operator called temporal coalescing (TCOALESCE below)
which maximizes the length of all intervals by coalescing adjacent and
overlapping intervals of value-equivalent tuples.
Coalesced form:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
33
Temporal Coalescing in (Pseudo-) SQL
with recursive R
clos
as (
-- initial ("anchor") query
select R.values, R.V_FROM, R.V_TO from R
union
-- recursive query: executed until no new data generated
select R.values, R.V_FROM, R
clos
.V_TO
from R, R
clos
where R
clos
.values = R.values -- values of R
clos
and R are equivalent
and R
clos
.V_FROM >= R.V_FROM
and R
clos
.V_FROM-1 <= R.V_TO -- intervals of R
clos
and R overlap
)
select R
clos
.values, R
clos
.V_FROM, R
clos
.V_TO
from R
clos
where not exists ( -- no smaller (contained) interval
select * from R
where R.values = R
clos
.values
and ( R.V_FROM < R
clos
.V_FROM
or R.V_TO > R
clos
.V_TO )
)
more efficient
implementation
uses window
functions
(see [Zhou et al 2006])
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
34
Typical Queries (3): Temporal Join
Sometimes, the history of information stored in two relations is of interest:
Example: Who worked on which projects and when?
Result:
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
35
Temporal Join in SQL (without temporal coalescing!)
Construct time intervals of result by intersecting time intervals of operands
(and keeping rows with non-empty intervals):
select * from (
select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,
case when e.V_FROM > w.V_FROM
then e.V_FROM
else w.V_FROM
end as V_FROM,
case when e.V_TO < w.V_TO
then e.V_TO
else w.V_TO
end as V_TO
from WORKS_ON w, EMP e
where e.ID = w.EMP_ID
) where V_FROM <= V_TO
Note: This gets more tedious when (temporally) joining 3 or more relations !
DWh 2012: 3-1 Data Warehouse - Historization R. Marti
36
Transaction Time Relations
Note that transaction time should be automatically determined by the
system at insert/update/delete time (or, more precisely, commit time),
not by the user; granularity is typically as fine as possible
Transaction time can be represented exactly like valid time,
by associating a time interval with tuples.
Example: Transaction time history of employee 676 (also see slide 10)