DW Design

Data Warehouse Design
Enrico Franconi
CS 636
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Loading, cleansing,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...
CS 336 2
Monitoring
• Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire, …
• How to get data out?
− Replication tool
− Dump file
− Create report
− ODBC or third-party “wrappers”
CS 336 3
Monitoring Techniques
• Periodic snapshots
• Database triggers
• Log shipping
• Data shipping (replication service)
• Transaction shipping
• Polling (queries to source)
• Screen scraping
• Application level monitoring
CS 336 4
Monitoring Issues
• Frequency
− periodic: daily, weekly, …
− triggered: on “big” change, lots of changes, ...
• Data transformation
− convert data to uniform format
− remove & add fields (e.g., add date to get history)
• Standards (e.g., ODBC)
• Gateways
CS 336 5
Wrapper
Converts data and queries from one data model to another
Data Queries Data

Model Model
A Data B
Extends query capabilities for sources with limited capabilities
Queries Wrapper Source
CS 336 6
Wrapper Generation
• Solution 1: Hard code for each source

• Solution 2: Automatic wrapper generation
Wrapper
Wrapper Definition
Generator
CS 336 7
Integration
• Data Cleaning
• Data Loading
Client
• Derived Data
Client
Query & Analysis
Metadata Warehouse
Integration
Source Source Source
CS 336 8
Data Integration
• Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
• Rule-based
• Actions
− Resolve inconsistencies
− Eliminate duplicates
− Integrate into warehouse (may not be empty)
− Summarize data
− Fetch more data from sources (wh updates)
− etc.
CS 336 9
Data Cleaning
• Find (& remove) duplicate tuples

− e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
− Attribute values that don’t match
• Patch missing, unreadable data
− Insert default values
• Notify sources of errors found
CS 336 10
Data Cleaning
• Migration (e.g., yen to dollars)
• Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
• Fusion (e.g., mail list, customer merging)
billing DB customer1(Joe)
merged_customer(Joe)
service DB customer2(Joe)
CS 336 11
Loading Data in the Warehouse
• Incremental vs. refresh
• Off-line vs. on-line
• Frequency of loading
− At night, 1x a week/month, continuously
• Parallel/Partitioned load
CS 336 12
Warehouse Maintenance
• Warehouse data ≈ materialized view

− Initial loading
− View maintenance
• Derived Warehouse Data
− indexes
− aggregates
− materialized views
• View maintenance
CS 336 13
Materialized Views
• Define new warehouse relations using SQL
expressions
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
joinTb prodId name price storeId date amt

p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11 does not exist
p1 bolt 10 c3 1 50 at any source
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
CS 336 14
Differs from Conventional View
Maintenance...
• Warehouses may be highly aggregated and

summarized
• Warehouse views may be over history of
base data
• Process large batch updates
• Schema may evolve
CS 336 15
Differs from Conventional View
Maintenance...
• Base data doesn’t participate in view
maintenance
− Simply reports changes
− Loosely coupled
− Absence of locking, global transactions
− May not be queriable
CS 336 16
Warehouse Maintenance Anomalies
• Materialized view maintenance in loosely
coupled, non-transactional environment
• Simple example
Data Sold (item,clerk,age)
Warehouse
Sold = Sale Emp

Integrator
Sales Comp.
Sale(item,clerk) Emp(clerk,age)
CS 336 17
Warehouse Maintenance Anomalies
Data Sold (item,clerk,age)
Warehouse
Integrator
Sales Comp.
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1) → integrator adds Sale (Mary,25)
4. (2) → integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
CS 336 18
Maintenance Anomaly - Solutions
• Incremental update algorithms (ECA,

Strobe, etc.)
• Research issues: Self-maintainable views
− What views are self-maintainable
− Store auxiliary views so original + auxiliary
views are self-maintainable
CS 336 19
Self-Maintainability: Examples
Sold(item,clerk,age) =
• Inserts into Emp
If Emp.clerk is key and Sale.clerk is
foreign key (with ref. int.) then no effect
• Inserts into Sale
Maintain auxiliary view: Emp-Πclerk,age(Sold)
• Deletes from Emp
Delete from Sold based on clerk
CS 336 20
Self-Maintainability: Examples
• Deletes from Sale

Delete from Sold based on {item,clerk}
Unless age at time of sale is relevant
• Auxiliary views for self-maintainability

− Must themselves be self-maintainable
− One solution: all source data
− But want minimal set
CS 336 21
Partial Self-Maintainability
• Avoid (but don’t prohibit) going to sources

Sold=Sale(item,clerk) Emp(clerk,age)
• Inserts into Sale
− Check if clerk already in Sold, go to source if
not
− Or replicate all clerks over age 30
− Or ...
CS 336 22
Warehouse Specification (ideally)
View Definitions
Warehouse
Integration Warehouse
Configuration rules
Module
Change Integrator Metadata
Detection
Requirements
Extractor/ Extractor/ Extractor/

Monitor Monitor Monitor
...
CS 336 23
Processing
• ROLAP servers vs. MOLAP servers
• Index Structures
• What to Materialize?
• Algorithms Client
Query & Analysis
Client
Metadata Warehouse
Integration
CS 336 24
ROLAP Server
• Relational OLAP Server sale prodId date sum
p1 1 62
p2 1 19
p1 2 48
tools
ROLAP Special indices, tuning;

utilities Schema is “denormalized”
server
relational
DBMS
CS 336 25
MOLAP Server
• Multi-Dimensional OLAP Server
Sales
ty
B
Ci
A
milk
Product
M.D. tools soda
eggs
soap
1 2 3 4
Date
utilities
multi- could also
dimensional sit on
relational
server DBMS
CS 336 26
Index Structures (sketch)
• Traditional Access Methods
− B-trees, hash tables, R-trees, grids, …
• Popular in Warehouses
− inverted lists
− bit map indexes
− join indexes
− text indexes
CS 336 27
What to Materialize?
• Store in warehouse results useful for
common queries
• Example:
total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19
CS 336 28
Materialization Factors
• Type/frequency of queries
• Query response time
• Storage cost
• Update cost
CS 336 29
Cube Aggregates Lattice
129
all
c1 c2 c3
p1 67 12 50
city product date
city, product city, date product, date

c1 c2 c3
p1 56 4 50
p2 11 8
use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8
CS 336 30
Dimension Hierarchies
all
cities city state

state c1 CA
c2 NY
city
CS 336 31
Dimension Hierarchies
all
city product date
city, product city, date product, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...
CS 336 32
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters
months conceptual
dimension table
days
CS 336 33
Managing
• Metadata
• Warehouse Design
• Tools Client Client
Query & Analysis
Metadata Warehouse
Integration
CS 336 34
Metadata
• Administrative
− definition of sources, tools, ...
− schemas, dimension hierarchies, …
− rules for extraction, cleaning, …
− refresh, purging policies
− user profiles, access control, ...
CS 336 35
Metadata
• Business
− business terms & definition
− data ownership, charging
• Operational
− data lineage
− data currency (e.g., active, archived, purged)
− use stats, error reports, audit trails
CS 336 36
Design Summary
• What data is needed?
• Where does it come from?
• How to clean data?
• How to represent in warehouse (schema)?
• What to summarize?
• What to materialize?
• What to index?
CS 336 37
Tools
• Development
− design & edit: schemas, views, scripts, rules, queries, reports
• Planning & Analysis

− what-if scenarios (schema changes, refresh rates), capacity planning
• Warehouse Management
− performance monitoring, usage patterns, exception reporting
• System & Network Management

− measure traffic (sources, warehouse, clients)
• Workflow Management
− “reliable scripts” for cleaning & analyzing data
CS 336 38
Current State of Industry
• Extraction and integration done off-line
− Usually in large, time-consuming, batches
• Everything copied at warehouse
− Not selective about what is stored
− Query benefit vs storage & update cost
• Query optimization aimed at OLTP
− High throughput instead of fast response
− Process whole query before displaying anything
CS 336 39
State of Commercial Practice ...
• Connectivity to sources • Data extract, clean,
− Apertus transform, refresh
− Information Builders − CA-Ingres Replicator
− Informix Enterprise Gateway − ETI-Extract
− Oracle Open Connect − IBM Data Joiner, Data
− CA-Ingres gateway Propagator
− MS ODBC − Prism Warehouse manager
− Platinum InfoHub − SAS Access
− Sybase Replication Server
− Trinzic InfoPump
CS 336 40
… State of Commercial Practice ...
• Multidimensional
Database Engines • ROLAP Servers
− Arbor Essbase − HP Intelligent Warehouse
− Oracle RIR Express − Informix Metacube
− Comshare Commader − MicroStrategy DSS Server
− SAS System − Information Advantage Asxys
• Warehouse Data Servers
− CA-Ingres
− Oracle 8
− RedBrick
− Sybase IQ
− Informix Dynamic Server
− IBM DB2
CS 336 41
… State of Commercial Practice
• Query/Reporting • Multidimensional Analysis
− Kenan Systems Acumate
Environments − Microsoft Excel
− IBM DataGuide
− Arbor Essbase Analysis server
− SAS Access CA Visual Express
− Cognos PowerPlay
Platinum Forest&Trees
− IQ Software IQ/Vision
− Informix ViewPoint
− Lotus 123
− SAS OLAP++
− Business Objects
• Lots and lots of consulting!!
CS 336 42
Future Directions
• Better performance
• Larger warehouses
• Easier to use
• What are companies & research labs
working on?
CS 336 43
Research (1)
• Incremental Maintenance
• Data Consistency
• Data Expiration
• Recovery
• Data Quality
• Error Handling (Back Flush)
CS 336 44
Research (2)
• Rapid Monitor Construction
• Temporal Warehouses
• Materialization & Index Selection
• Data Fusion
• Data Mining
• Integration of Text & Relational Data
• Conceptual Modelling
CS 336 45
Conclusions
• Massive amounts of data and
complexity of queries will push limits
of current warehouses
• Need better systems:
− easier to use
− provide quality information
CS 336 46

DW Design

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DW Design

Uploaded by

Copyright:

Available Formats

Data Warehouse Design

Data Queries Data

Extends query capabilities for sources with limited capabilities

Queries Wrapper Source

• Solution 1: Hard code for each source

Source Source Source

• Find (& remove) duplicate tuples

• Warehouse data ≈ materialized view

joinTb prodId name price storeId date amt

• Warehouses may be highly aggregated and

Sold = Sale Emp

• Incremental update algorithms (ECA,

• Deletes from Sale

• Auxiliary views for self-maintainability

• Avoid (but don’t prohibit) going to sources

Extractor/ Extractor/ Extractor/

Source Source Source

ROLAP Special indices, tuning;

city, product city, date product, date

cities city state

city product date

city, product city, date product, date

state, product, date

not all arcs shown...

Query & Analysis

Source Source Source

• Planning & Analysis

• System & Network Management

• Lots and lots of consulting!!

You might also like