Data Vault Modeling: A Resilient Method for Data Warehousing

w
r Introduction
r History and philosophy of Data Vault
r Implementation
r Loading practices
r Data Vault and dimensional modelling
r Data Vault scheme as neural network
r Origin
r References

r Data Vault Modeling is a method of designing a
database to provide historical storage of data coming
in from multiple operational systems with complete
tracing of where all the data in the database came from
r The method is designed to be resilient to change in the
environment.
r This purpose is mainly achieved by taking the business
organisation as starting point for the datamodel, since
it is assumed that this will change less often than the
operational systems used to support the business.

r In datawarehouse modelling there are two well-known

competing options for modelling the layer where the
data is stored. Either you model according to Kimball,
with conformed dimensions and an enterprise
databus, or you model according to Inmon with the
database in Third normal form. Both techniques have
issues when dealing with changes in the systems
feeding the datawarehouse. For conformed
dimensions you also have to cleanse data (to conform
it) and this is undesirable in a number of cases.
r Data Vault is designed to avoid or minimize the impact
of those issues.
r Dan Linstedt, the creator of the method, describes the
resulting database as follows:
r 6

!

6

"

r Jn alternative name for the method is "Common
Foundational Integration Modelling Jrchitecture."
r Data Vault's philosophy is that all data is relevant data,
even if it is "wrong". Data being wrong is a business
problem and usually not a technical problem. This
means you have to be able to capture all the data.
Jnother issue to which Data Vault is a response is that
more and more there is a need for complete
auditability and traceability of all the data in the
datawarehouse. Due to Sarbanes-Oxley in the USJ
and similar measures in Europe this is a relevant topic
for many business intelligence implementations

r Data Vault attempts to solve the problem of dealing with
change in the environment by separating the business keys
(that do not mutate as often, because they uniquely
identify a business entity) from the attributes of those keys.
Jttributes change at different rates, so you can group
attributes together in small tables called Satellites and link
those to the business keys that are in tables called Hubs.
Jssociations or transactions between business keys
(relating Hubs such as Customer and Product through the
Purchase transaction) are modelled using Link tables, that
also have satellites describing the attributes of the relation.
In other words, the satellites provide the context for the
business processes that are captured in Hubs and Links.

r Links can link to other Links, to deal with changes in granularity (for instance, adding a
new key to a database table would change the grain of the database table). Jdding a Link
to two Hubs to another Link to a Hub is similar to adding those 3 Hubs to a single Link.
See the section on Loading Practices why the latter method is preferred.
r Links sometimes link to only one Hub, a construct called 'peg-legged Link' by Dan
Linstedt. This occurs when one of the business keys associated by the Link is not a real
business key. Js an example, take an order form with "order number" as key, and order
lines that are keyed with a semi-random number to make them unique. Let's say, "unique
number". The latter key is not a real business key, so it is no Hub. However, we do need to
use it in order to guarantee the correct granularity for the Link. In this case, we do not use
a Hub with surrogate key, but add the business key "unique number" itself to the Link.
This is done only when there is no possibility of ever using the business key for another
Link or as key for attributes in a Satellite.
r Jll the tables contain metadata, minimally describing at least the source system and the
date on which this entry became valid, giving a complete historical view of the data as it
enters the data warehouse. Data is never deleted, unless you have a technical error while
loading data.
r The Data Vault does not maintain referential integrity between tables, but instead all
Link tables are many-to-many relationships. This means that missing data is not an issue
and also that Hubs, Links and Satellites can be loaded independent of each other, in
parallel.
~
r The ETL for updating a Data Vault model is fairly straightforward (see [3]). First
you have to update all the Hubs. Having done that, you can now resolve all
business keys to surrogate ID's. The second step is to add all attributes on the
key to the Hub satellites and at the same time create and update all Links
between Hubs. This resolves all the Link surrogate keys, enabling you to then
add all the Link satellites in step three.
r It is easy to verify that the updates inside each step are unrelated and can be
done in parallel. The ETL is quite straightforward and lends itself to easy
automation or templating. Problems occur only with Links relating to other
Links, because resolving the business keys in the Link only leads to another
Link that has to be resolved as well. Due to the equivalence of this situation
with a Link to multiple Hubs, this difficulty can be avoided by remodelling
such cases.
9

r The Data Vault modelled layer is normally used to store all
the data, all the time. J lot of end-user computing tools
expect their data to be contained in a dimensional model,
so a conversion is needed. However, the Hubs and related
satellites can be considered as Dimensions and the Links
and related Satellites as Fact tables in a dimensional model.
This enables you to quickly prototype a dimensional model
out of a Data Vault model using views.
9

r The Data Model is patterned off a simplistic view of neurons,
dendrites, and synapses - where neurons are associated with
Hubs and Hub Satellites, Links are dendrites (vectors of
information), and other Links are synapses (vectors in the
opposite direction). By utilizing a data mining set of algorithms,
links can be scored with confidence and strength ratings. They
can be created and dropped on the fly in accordance with
learning about relationships that currently don't exist. The
model can be automatically morphed, adapted, and adjusted as
it is used and fed new structures.
Ú
r Data Vault Modeling was originally concepted by Dan
Linstedt in 1990 and was released in 2000. Data Vault
Modeling is public domain and has been freely available
since 2000.
r In a series of five articles on The Data Jdministration
Newsletter the basic rules of the Data Vault method are
expanded and explained. These contain a general
overview[4], an overview of the components[5], a
discussion about end dates and joins[6], link tables[7] and
an article on loading practices
r References
http://en.wikipedia.org/wiki/Data_Vault_Modeling
Thank youǥ

Data Vault Modeling: A Resilient Method for Data Warehousing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Vault Modeling: A Resilient Method for Data Warehousing

Uploaded by

Copyright:

Available Formats

w

r In datawarehouse modelling there are two well-known

You might also like