You are on page 1of 61

cohesion institute

Data Warehouse
Architecture Best Practices

December 5, 2005

Speaker: R. Michael Pickering


President, Cohesion Systems Consulting Inc.
cohesion
institute

Agenda
 Introductions
 Business Intelligence Background
 Architecture Best Practices
 Questions & Answers

March 21, 2009 DW Architecture Best Practices 2


cohesion institute

Data Warehouse
Architecture Best Practices

Introductions
cohesion
institute

Presenter Biography
 R. Michael Pickering
President and Chief Architect,
Cohesion Systems Consulting Inc.

previously, Managing Consultant, BI&W, Oracle
Consulting (Canada)
 before that, Red Brick Systems, Inc.
 over 8 years DW experience

Manulife Reinsurance, Bell Canada, USDA, Kraft
Foods, LCBO, Telecom Argentina, Nortel Networks,
Procter & Gamble, Bayer, Syncrude, OMoHLTC…
 Mr. Pickering has had DW articles published
in The Handbook of Data Management

March 21, 2009 DW Architecture Best Practices 4


cohesion
institute

Cohesion Systems Consulting


 Provides DW and BI services,
specializing in:
 Architecture & Implementation Consulting
 Project Management
 Databases, Appliances & Emerging
Technology
 Training & Mentoring
 Since inception in 2000, clients have
included Enbridge, CIBC, The Bank of
New York, Loyalty Management Group,
Canada Post Borderfree, Katz Group
March 21, 2009 DW Architecture Best Practices 5
cohesion
institute

Audience Survey
 By a show of hands, please indicate your
experience with:
 normalization
 dimensional modeling
 operational data store
 data consolidation
 Extract Transform Load (ETL)
 metadata architecture
 DW appliances

March 21, 2009 DW Architecture Best Practices 6


cohesion institute

Data Warehouse
Architecture Best Practices

Business Intelligence
Background
cohesion
institute

What is Business Intelligence?


 A Data Warehouse is usually one
component of an overall business
intelligence solution
 IT people may be tempted to think in
terms of products and technologies

BUT...

March 21, 2009 DW Architecture Best Practices 8


cohesion
institute

Overarching Goal
 The overarching goal of
business intelligence is to
provide the information
necessary to MANAGE a
business
 This means providing
information in support of
management decision making,
which is why BI is also called
“Decision Support”
March 21, 2009 DW Architecture Best Practices 9
cohesion
institute

BI is about “Data Abstraction”


wisdom
Stages knowledge
(4) information
data

 audience for a data warehouse typically


considers higher slices of data abstraction
pyramid
 lowest level of pyramid is too detailed &
unwieldy
March 21, 2009 DW Architecture Best Practices 10
cohesion
institute

It’s Not Technology


 Business Intelligence is about delivering
business value
 provide tangible benefit by answering important
questions that can help the business to achieve its
strategic focus
 Improving profitability
 Who are our five most profitable clients?
 What are our least profitable products?
 Reducing cost
 Who are our lowest cost suppliers?
 Which materials incur highest spoilage costs?
 Improving customer satisfaction
 What factors may lead to lost customers?

March 21, 2009 DW Architecture Best Practices 11


cohesion
institute

Business of BI
 In some cases, legislation such as Sarbanes-
Oxley or Basel II makes some kind of BI
fundamental to doing business
 Many leading companies use BI to achieve
competitive advantage
 E.g. Walmart, Dell, Amazon.com, Kraft, American
Express, etc…

March 21, 2009 DW Architecture Best Practices 12


cohesion
institute

Data Warehouse Architecture


 architecture is about delivering
an elegant solution that meets
the solution requirements
 this means really understanding the
problem
 DW architecture is part art, part
science

March 21, 2009 DW Architecture Best Practices 13


cohesion
institute

Good Architecture
 ‘It’s not easy to describe a good
design, but I’ll know it when I
see it’

March 21, 2009 DW Architecture Best Practices 14


cohesion
institute

BI Architecture Requirements
 must recognize change as a
constant
 take incremental development
approach
 existing applications must
continue to work
 need to allow more data and
new types of data to be added

March 21, 2009 DW Architecture Best Practices 15


cohesion
institute

End User Acceptance


 understandability
 understandability is in the eyes
of the beholder
 want to hide the complexity
 try to make it:

intuitive, obvious
 visible, memorable

March 21, 2009 DW Architecture Best Practices 16


cohesion
institute

End User Acceptance


 performance
 don’t want to interrupt
the thinking process
 provide one click,
instantaneous access
 warehouse must be
available, “production”
system

March 21, 2009 DW Architecture Best Practices 17


cohesion institute

Data Warehouse
Architecture Best Practices

Architecture
Best Practices
cohesion
institute

High Level Architecture


 remember the different “worlds”
 on-line transaction processing (OLTP)
 business intelligence systems (BIS)
 users are different
 data content is different
 data structures are different
 architecture & methodology must be
different
March 21, 2009 DW Architecture Best Practices 19
cohesion
institute

Two Different Worlds


 On-Line Transaction Processing
 Entity Relational Data Model

created in 1960’s to address performance
issues with relational database
implementations

normalized to most efficiently get data in

divides the data into many discrete entities

many relationships between these entities

this approach was documented by C.J. Date in
An Introduction to Database Systems

March 21, 2009 DW Architecture Best Practices 20


cohesion
institute

Two Different Worlds


 Business Intelligence Systems
 Dimensional Data Model
 also called star schema
 designed to easily get information out
 fewer relationships than ERD, the only
table with multiple joins connecting to
other tables is the central table
 developed in 1960’s by data service
providers, formalized by Ralph Kimball in
The Data Warehouse Toolkit

March 21, 2009 DW Architecture Best Practices 21


cohesion
institute

Entity Relation Disadvantages


 all tables look the same
 people can’t visualize/remember
diagrams
 software can’t navigate as schema
becomes too complex
 business processes mixed together
 many artificial keys created
March 21, 2009 DW Architecture Best Practices 22
cohesion
institute

Dimensional Model
Advantages
 simplicity
 humans can navigate and remember
 software can navigate
deterministically
 business process explicitly separated
(Data Mart)
 not so many keys (keys = # of
attendant tables)
March 21, 2009 DW Architecture Best Practices 23
cohesion
institute

Best Practice #1
 Use a data model that is optimized
for information retrieval
 dimensional model
 denormalized
 hybrid approach

March 21, 2009 DW Architecture Best Practices 24


cohesion
institute

Data Acquisition Processes


 Extract Transform Load (ETL)
 the process of unloading or copying data
from the source systems, transforming it into
the format and data model required in the BI
environment, and loading it to the DW
 also, a software development tool for
building ETL processes (an ETL tool)
 many production DWs use COBOL or other
general-purpose programming languages to
implement ETL

March 21, 2009 DW Architecture Best Practices 25


cohesion
institute

Data Quality Assurance


 data cleansing
 the process of validating and enriching
the data as it is published to the DW
 also, a software development tool for
building data cleansing processes (a data
cleansing tool)
 many production DWs have only very
rudimentary data quality assurance
processes

March 21, 2009 DW Architecture Best Practices 26


cohesion
institute

Data Acquisition & Cleansing


 getting data loaded efficiently and
correctly is critical to the success of
your DW
 implementation of data acquisition &
cleansing processes represents from 50
to 80% of effort on typical DW projects
 inaccurate data content can be ‘the kiss
of death’ for user acceptance

March 21, 2009 DW Architecture Best Practices 27


cohesion
institute

Best Practice #2
 Carefully design the data
acquisition and cleansing processes
for your DW
 Ensure the data is processed
efficiently and accurately
 Consider acquiring ETL and Data
Cleansing tools
 Use them well!

March 21, 2009 DW Architecture Best Practices 28


cohesion
institute

Data Model
 Already discussed the benefits of a
dimensional model
 No matter whether dimensional
modeling or any other design
approach is used, the data model
must be documented

March 21, 2009 DW Architecture Best Practices 29


cohesion
institute

Documenting the Data Model


 The best practice is to use some kind of data
modeling tool
 CA ERwin
 Sybase PowerDesigner
 Oracle Designer
 IBM Rational Rose
 Etc.
 Different tools support different modeling notations,
but they are more or less equivalent anyway
 Most tools allow sharing of their metadata with an
ETL tool

March 21, 2009 DW Architecture Best Practices 30


cohesion
institute

Data Model Standards


 data model standards appropriate for the
environment and tools chosen in your data
warehouse should be adopted
 considerations should be given to data access
tool(s) and integration with overall enterprise
standards
 standards must be documented and enforced
within the DW team
 someone must ‘own’ the data model
 to ensure a quality data model, all changes
should be reviewed thru some formal process

March 21, 2009 DW Architecture Best Practices 31


cohesion
institute

Data Model Metadata


 Business definitions should be recorded
for every field (unless they are technical
fields only)
 Domain of data should be recorded
 Sample values should be included
 As more metadata is populated into the
modeling tool it becomes increasingly
important to be able to share this data
across ETL and Data Access tools

March 21, 2009 DW Architecture Best Practices 32


cohesion
institute

Metadata Architecture
 The strategy for sharing data model
and other metadata should be
formalized and documented
 Metadata management tools should
be considered & the overall
metadata architecture should be
carefully planned

March 21, 2009 DW Architecture Best Practices 33


cohesion
institute

Best Practice #3
 Design a metadata architecture
that allows sharing of metadata
between components of your DW
 consider metadata standards such as
OMG’s Common Warehouse
Metamodel (CWM)

March 21, 2009 DW Architecture Best Practices 34


cohesion
institute

Alternative Architecture
Approaches
 Bill Inmon: “Corporate Information
Factory”
 Hub and Spoke philosophy
 “JBOC” – just a bunch of cubes
 Let it evolve naturally

March 21, 2009 DW Architecture Best Practices 35


cohesion
institute

What We Want
(Architectural Principal)
 In most cases, business and IT
agree that the data warehouse
should provide a ‘single version of
the truth’
 Any approach that can result in
disparate data marts or cubes is
undesireable
 This is known as data silos or…

March 21, 2009 DW Architecture Best Practices 36


cohesion
institute

Enterprise DW Architecture
 how to design an enterprise data
warehouse and ensure a ‘single
version of the truth’?
 according to Kimball:
 start with an overall data architecture
phase
 use “Data Warehouse Bus” design to
integrate multiple data marts
 use incremental approach by building one
data mart at a time
March 21, 2009 DW Architecture Best Practices 37
cohesion
institute

Data Warehouse Bus


Architecture
 named for the bus in a computer
 standard interface that allows you to plug
in cdrom, disk drive, etc.
 these peripherals work together smoothly
 provides framework for data marts to
fit together
 allows separate data marts to be
implemented by different groups, even
at different times
March 21, 2009 DW Architecture Best Practices 38
cohesion
institute

Data Mart Definition


 data mart is a complete subset of the
overall data warehouse
 a single business process OR
 a group of related business processes
 think of a data mart as a collection of
related fact tables sharing conformed
dimensions, aka a ‘fact constellation’

March 21, 2009 DW Architecture Best Practices 39


cohesion
institute

Designing The DW Bus


 determine which dimensions will be
shared across multiple data marts
 conform the shared dimensions
 produce a master suite of shared
dimensions
 determine which facts will be shared
across data marts
 conform the facts
 standardize the definitions of facts
March 21, 2009 DW Architecture Best Practices 40
cohesion
institute

Dimension Granularity
 conformed dimensions will usually
be granular
 makes it easy to integrate with various
base level fact tables
 easy to extend fact table by adding
new facts
 no need to drop or reload fact tables,
and no keys have to be changed

March 21, 2009 DW Architecture Best Practices 41


cohesion
institute

Conforming Dimensions
 by adhering to standards, the separate
data marts can be plugged together
 e.g. customer, product, time

 they can even share data usefully, for


example in a drill across report
 ensures reports or queries from different
data marts share the same context

March 21, 2009 DW Architecture Best Practices 42


cohesion
institute

Conforming Dimensions (cont’d)

 accomplish this by adding any dimension


attribute(s) needed in any data mart(s)
to the standard dimension definition
 attributes not needed everywhere can always be
ignored
 typically harder to determine how to load
conformed dimensions than to design
them initially
 need a single integrated ETL process
 what is the SOR for each attribute?
 how do we deal with attributes for which there is
more than one possible SOR?
March 21, 2009 DW Architecture Best Practices 43
cohesion
institute

Conforming Facts

 in an enterprise, some metrics may


not have the same generally
accepted definition across all
business units
 conforming facts is generally a
bigger design challenge than
conforming dimensions
 why?

March 21, 2009 DW Architecture Best Practices 44


cohesion
institute

Conforming Facts - Benefits


 ensures the constituent data marts can as
clearly as possible represent fact data
expressed on the same basis using consistent
definitions
 ensures reports or queries from different data
marts share consistent content
 success of an Enterprise DW hinges on
successfully conformed facts
 any perceived inconsistencies in fact definitions across
data marts will generally be considered to be a DW
bug or data problem by users
 if users don’t have full confidence in data quality they
may stop using the DW

March 21, 2009 DW Architecture Best Practices 45


cohesion
institute

Data Consolidation
 a current trend in BI/DW is ‘data
consolidation’
 from a software vendor
perspective, it is tempting to
simplify this:
 ‘we can keep all the tables for all your
disparate applications in one physical
database’
March 21, 2009 DW Architecture Best Practices 46
cohesion
institute

Data Integration
 To truly achieve ‘a single version of
the truth’, must do more than simply
consolidating application databases
 Must integrate data models and
establish common terms of reference

March 21, 2009 DW Architecture Best Practices 47


cohesion
institute

Best Practice #4
 Take an approach that consolidates
data into ‘a single version of the
truth’
 Data Warehouse Bus
 conformed dimensions & facts
 OR?

March 21, 2009 DW Architecture Best Practices 48


cohesion
institute

Operational Data Store (ODS)


 a single point of integration for disparate
operational systems
 contains integrated data at the most
detailed level (transactional)
 may be loaded in ‘near real time’ or
periodically
 can be used for centralized operational
reporting

March 21, 2009 DW Architecture Best Practices 49


cohesion
institute

Role of an ODS in DW
Architecture
 In the case where an ODS is a
necessary component of the overall
DW, it should be carefully integrated
into the overall architecture
 Can also be used for:
 Staging area
 Master/reference data management
 Etc…
March 21, 2009 DW Architecture Best Practices 50
cohesion
institute

ODS Data Model


 Not clear if any design approach for
an ODS data model has emerged
as a best practice
 normalized
 dimensional
 denormalized/hybrid
 any suggestions?

March 21, 2009 DW Architecture Best Practices 51


cohesion
institute

Best Practice #5
 Consider implementing an ODS only
when information retrieval requirements
are near the bottom of the data
abstraction pyramid and/or when there
are multiple operational sources that
need to be accessed
 Must ensure that the data model is
integrated, not just consolidated
 May consider 3NF data model
 Avoid at all costs a ‘data dumping ground’

March 21, 2009 DW Architecture Best Practices 52


cohesion
institute

Capacity Planning
 DW workloads are typically very
demanding, especially for I/O capacity
 Successful implementations tend to grow
very quickly, both in number of users
and data volume
 Rules of thumb do exist for sizing the
hardware platform to provide adequate
initial performance
 typically based on estimated ‘raw’ data size
of proposed database e.g. 100-150 Gb per
modern CPU

March 21, 2009 DW Architecture Best Practices 53


cohesion
institute

SMP Server Scale Up


 Scaling performance within a single SMP
server is referred to as ‘scale up’
 Database benchmarks suggest Windows
scalability is near that of Linux
 IBM claims near-linear scalability for Linux
(on commodity hardware) up to about 4
processors
 Probably not cost effective to scale up Linux
much beyond 4 processors
 IBM claims near-linear scalability for AIX
on POWER5 up to about 8 processors
March 21, 2009 DW Architecture Best Practices 54
cohesion
institute

Scale Out
 There is an increasing trend in IT to ‘scale out’
processing capacity by deploying many small,
commodity servers rather than a single large
SMP system
 This strategy tends to work well for relatively
simple applications such as network or web
servers
 For very complex workloads such as a data
warehouse, this strategy is much more difficult
to effectively implement
 Especially so for the database server itself

March 21, 2009 DW Architecture Best Practices 55


cohesion
institute

Scale Up vs. Scale Out


 To obtain the total number of processors
required for the estimated DW workload,
must plan either to scale up or scale out
 Both options are viable but, all other
things being equal, scaling up is less
disruptive to end users and requires less
work to implement
 scaling up can offer lower hardware
investment, if practical
 however, network bandwidth or latency
issues can limit effectiveness of parallelism

March 21, 2009 DW Architecture Best Practices 56


cohesion
institute

Best Practice #6
 Create a capacity plan for your BI
application & monitor it carefully
 Consider future additional performance
demands
 Establish standard performance benchmark
queries and regularly run them
 Implement capacity monitoring tools
 Build scalability into your architecture
 May need to allow for scaling both up and
out!

March 21, 2009 DW Architecture Best Practices 57


cohesion
institute

Open Source Affordability


 Another emerging trend in IT generally is to
utilize Open Source software running on
commodity hardware
 this is expected to offer lower total cost of ownership
 certainly, GNU/Linux and other Open Source
initiatives do provide very good functionality and
quality for minimal cost
 This trend also applies to BI & DW:
 most traditional rdbms’s are now supported on Linux
 however, open source rdbms’s lag behind on
providing good performance for DW queries

March 21, 2009 DW Architecture Best Practices 58


cohesion
institute

DW Appliances
 DW appliances, consisting of
packaged solutions providing all
required software and hardware, are
beginning to offer very promising
price/performance
 production experience is limited so
far, so this is not yet a ‘best practice’

March 21, 2009 DW Architecture Best Practices 59


cohesion institute

Data Warehouse
Architecture Best Practices

Q&A
cohesion institute

cohesion systems consulting inc

the modern art


of data abstraction

You might also like