You are on page 1of 46

Data Warehouse Design

Enrico Franconi
CS 636
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Loading, cleansing,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...

CS 336 2
Monitoring
• Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire, …
• How to get data out?
− Replication tool
− Dump file
− Create report
− ODBC or third-party “wrappers”

CS 336 3
Monitoring Techniques
• Periodic snapshots
• Database triggers
• Log shipping
• Data shipping (replication service)
• Transaction shipping
• Polling (queries to source)
• Screen scraping
• Application level monitoring
CS 336 4
Monitoring Issues
• Frequency
− periodic: daily, weekly, …
− triggered: on “big” change, lots of changes, ...
• Data transformation
− convert data to uniform format
− remove & add fields (e.g., add date to get history)
• Standards (e.g., ODBC)
• Gateways
CS 336 5
Wrapper
Converts data and queries from one data model to another

Data Queries Data


Model Model
A Data B

Extends query capabilities for sources with limited capabilities

Queries Wrapper Source

CS 336 6
Wrapper Generation

• Solution 1: Hard code for each source


• Solution 2: Automatic wrapper generation

Wrapper
Wrapper Definition
Generator

CS 336 7
Integration
• Data Cleaning
• Data Loading
Client
• Derived Data
Client
Query & Analysis

Metadata Warehouse

Integration

Source Source Source

CS 336 8
Data Integration
• Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
• Rule-based
• Actions
− Resolve inconsistencies
− Eliminate duplicates
− Integrate into warehouse (may not be empty)
− Summarize data
− Fetch more data from sources (wh updates)
− etc.

CS 336 9
Data Cleaning

• Find (& remove) duplicate tuples


− e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
− Attribute values that don’t match
• Patch missing, unreadable data
− Insert default values
• Notify sources of errors found

CS 336 10
Data Cleaning
• Migration (e.g., yen to dollars)
• Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
• Fusion (e.g., mail list, customer merging)
billing DB customer1(Joe)
merged_customer(Joe)
service DB customer2(Joe)

CS 336 11
Loading Data in the Warehouse
• Incremental vs. refresh
• Off-line vs. on-line
• Frequency of loading
− At night, 1x a week/month, continuously
• Parallel/Partitioned load

CS 336 12
Warehouse Maintenance

• Warehouse data ≈ materialized view


− Initial loading
− View maintenance
• Derived Warehouse Data
− indexes
− aggregates
− materialized views
• View maintenance

CS 336 13
Materialized Views
• Define new warehouse relations using SQL
expressions
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

joinTb prodId name price storeId date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11 does not exist
p1 bolt 10 c3 1 50 at any source
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4

CS 336 14
Differs from Conventional View
Maintenance...

• Warehouses may be highly aggregated and


summarized
• Warehouse views may be over history of
base data
• Process large batch updates
• Schema may evolve

CS 336 15
Differs from Conventional View
Maintenance...
• Base data doesn’t participate in view
maintenance
− Simply reports changes
− Loosely coupled
− Absence of locking, global transactions
− May not be queriable

CS 336 16
Warehouse Maintenance Anomalies
• Materialized view maintenance in loosely
coupled, non-transactional environment
• Simple example
Data Sold (item,clerk,age)
Warehouse

Sold = Sale Emp


Integrator

Sales Comp.

Sale(item,clerk) Emp(clerk,age)

CS 336 17
Warehouse Maintenance Anomalies
Data Sold (item,clerk,age)
Warehouse

Integrator

Sales Comp.

Sale(item,clerk) Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1) → integrator adds Sale (Mary,25)
4. (2) → integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
CS 336 18
Maintenance Anomaly - Solutions

• Incremental update algorithms (ECA,


Strobe, etc.)
• Research issues: Self-maintainable views
− What views are self-maintainable
− Store auxiliary views so original + auxiliary
views are self-maintainable

CS 336 19
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk) Emp(clerk,age)
• Inserts into Emp
If Emp.clerk is key and Sale.clerk is
foreign key (with ref. int.) then no effect
• Inserts into Sale
Maintain auxiliary view: Emp-Πclerk,age(Sold)
• Deletes from Emp
Delete from Sold based on clerk
CS 336 20
Self-Maintainability: Examples

• Deletes from Sale


Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant

• Auxiliary views for self-maintainability


− Must themselves be self-maintainable
− One solution: all source data
− But want minimal set

CS 336 21
Partial Self-Maintainability

• Avoid (but don’t prohibit) going to sources


Sold=Sale(item,clerk) Emp(clerk,age)
• Inserts into Sale
− Check if clerk already in Sold, go to source if
not
− Or replicate all clerks over age 30
− Or ...

CS 336 22
Warehouse Specification (ideally)
View Definitions

Warehouse
Integration Warehouse
Configuration rules
Module
Change Integrator Metadata
Detection
Requirements

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
CS 336 23
Processing
• ROLAP servers vs. MOLAP servers
• Index Structures
• What to Materialize?
• Algorithms Client
Query & Analysis
Client

Metadata Warehouse

Integration

Source Source Source

CS 336 24
ROLAP Server
• Relational OLAP Server sale prodId date sum
p1 1 62
p2 1 19
p1 2 48

tools

ROLAP Special indices, tuning;


utilities Schema is “denormalized”
server

relational
DBMS

CS 336 25
MOLAP Server
• Multi-Dimensional OLAP Server
Sales

ty
B

Ci
A
milk

Product
M.D. tools soda
eggs
soap

1 2 3 4
Date

utilities
multi- could also
dimensional sit on
relational
server DBMS

CS 336 26
Index Structures (sketch)
• Traditional Access Methods
− B-trees, hash tables, R-trees, grids, …
• Popular in Warehouses
− inverted lists
− bit map indexes
− join indexes
− text indexes

CS 336 27
What to Materialize?
• Store in warehouse results useful for
common queries
• Example:
total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8

c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19

CS 336 28
Materialization Factors
• Type/frequency of queries
• Query response time
• Storage cost
• Update cost

CS 336 29
Cube Aggregates Lattice
129
all

c1 c2 c3
p1 67 12 50
city product date

city, product city, date product, date


c1 c2 c3
p1 56 4 50
p2 11 8

use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8

CS 336 30
Dimension Hierarchies

all

cities city state


state c1 CA
c2 NY

city

CS 336 31
Dimension Hierarchies

all

city product date

city, product city, date product, date

state
city, product, date
state, date
state, product

state, product, date

not all arcs shown...

CS 336 32
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters

months conceptual
dimension table

days

CS 336 33
Managing
• Metadata
• Warehouse Design
• Tools Client Client

Query & Analysis

Metadata Warehouse

Integration

Source Source Source

CS 336 34
Metadata
• Administrative
− definition of sources, tools, ...
− schemas, dimension hierarchies, …
− rules for extraction, cleaning, …
− refresh, purging policies
− user profiles, access control, ...

CS 336 35
Metadata
• Business
− business terms & definition
− data ownership, charging
• Operational
− data lineage
− data currency (e.g., active, archived, purged)
− use stats, error reports, audit trails

CS 336 36
Design Summary
• What data is needed?
• Where does it come from?
• How to clean data?
• How to represent in warehouse (schema)?
• What to summarize?
• What to materialize?
• What to index?

CS 336 37
Tools
• Development
− design & edit: schemas, views, scripts, rules, queries, reports

• Planning & Analysis


− what-if scenarios (schema changes, refresh rates), capacity planning

• Warehouse Management
− performance monitoring, usage patterns, exception reporting

• System & Network Management


− measure traffic (sources, warehouse, clients)

• Workflow Management
− “reliable scripts” for cleaning & analyzing data
CS 336 38
Current State of Industry
• Extraction and integration done off-line
− Usually in large, time-consuming, batches
• Everything copied at warehouse
− Not selective about what is stored
− Query benefit vs storage & update cost
• Query optimization aimed at OLTP
− High throughput instead of fast response
− Process whole query before displaying anything

CS 336 39
State of Commercial Practice ...
• Connectivity to sources • Data extract, clean,
− Apertus transform, refresh
− Information Builders − CA-Ingres Replicator
− Informix Enterprise Gateway − ETI-Extract
− Oracle Open Connect − IBM Data Joiner, Data
− CA-Ingres gateway Propagator
− MS ODBC − Prism Warehouse manager
− Platinum InfoHub − SAS Access
− Sybase Replication Server
− Trinzic InfoPump

CS 336 40
… State of Commercial Practice ...
• Multidimensional
Database Engines • ROLAP Servers
− Arbor Essbase − HP Intelligent Warehouse
− Oracle RIR Express − Informix Metacube
− Comshare Commader − MicroStrategy DSS Server
− SAS System − Information Advantage Asxys
• Warehouse Data Servers
− CA-Ingres
− Oracle 8
− RedBrick
− Sybase IQ
− Informix Dynamic Server
− IBM DB2

CS 336 41
… State of Commercial Practice
• Query/Reporting • Multidimensional Analysis
− Kenan Systems Acumate
Environments − Microsoft Excel
− IBM DataGuide
− Arbor Essbase Analysis server
− SAS Access CA Visual Express
− Cognos PowerPlay
Platinum Forest&Trees
− IQ Software IQ/Vision
− Informix ViewPoint
− Lotus 123
− SAS OLAP++
− Business Objects

• Lots and lots of consulting!!

CS 336 42
Future Directions
• Better performance
• Larger warehouses
• Easier to use
• What are companies & research labs
working on?

CS 336 43
Research (1)
• Incremental Maintenance
• Data Consistency
• Data Expiration
• Recovery
• Data Quality
• Error Handling (Back Flush)

CS 336 44
Research (2)
• Rapid Monitor Construction
• Temporal Warehouses
• Materialization & Index Selection
• Data Fusion
• Data Mining
• Integration of Text & Relational Data
• Conceptual Modelling

CS 336 45
Conclusions
• Massive amounts of data and
complexity of queries will push limits
of current warehouses
• Need better systems:
− easier to use
− provide quality information

CS 336 46

You might also like