Professional Documents
Culture Documents
Enrico Franconi
CS 636
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Loading, cleansing,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...
CS 336 2
Monitoring
• Source Types: relational, flat file, IMS,
VSAM, IDMS, WWW, news-wire, …
• How to get data out?
− Replication tool
− Dump file
− Create report
− ODBC or third-party “wrappers”
CS 336 3
Monitoring Techniques
• Periodic snapshots
• Database triggers
• Log shipping
• Data shipping (replication service)
• Transaction shipping
• Polling (queries to source)
• Screen scraping
• Application level monitoring
CS 336 4
Monitoring Issues
• Frequency
− periodic: daily, weekly, …
− triggered: on “big” change, lots of changes, ...
• Data transformation
− convert data to uniform format
− remove & add fields (e.g., add date to get history)
• Standards (e.g., ODBC)
• Gateways
CS 336 5
Wrapper
Converts data and queries from one data model to another
CS 336 6
Wrapper Generation
Wrapper
Wrapper Definition
Generator
CS 336 7
Integration
• Data Cleaning
• Data Loading
Client
• Derived Data
Client
Query & Analysis
Metadata Warehouse
Integration
CS 336 8
Data Integration
• Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
• Rule-based
• Actions
− Resolve inconsistencies
− Eliminate duplicates
− Integrate into warehouse (may not be empty)
− Summarize data
− Fetch more data from sources (wh updates)
− etc.
CS 336 9
Data Cleaning
CS 336 10
Data Cleaning
• Migration (e.g., yen to dollars)
• Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
• Fusion (e.g., mail list, customer merging)
billing DB customer1(Joe)
merged_customer(Joe)
service DB customer2(Joe)
CS 336 11
Loading Data in the Warehouse
• Incremental vs. refresh
• Off-line vs. on-line
• Frequency of loading
− At night, 1x a week/month, continuously
• Parallel/Partitioned load
CS 336 12
Warehouse Maintenance
CS 336 13
Materialized Views
• Define new warehouse relations using SQL
expressions
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
CS 336 14
Differs from Conventional View
Maintenance...
CS 336 15
Differs from Conventional View
Maintenance...
• Base data doesn’t participate in view
maintenance
− Simply reports changes
− Loosely coupled
− Absence of locking, global transactions
− May not be queriable
CS 336 16
Warehouse Maintenance Anomalies
• Materialized view maintenance in loosely
coupled, non-transactional environment
• Simple example
Data Sold (item,clerk,age)
Warehouse
Sales Comp.
Sale(item,clerk) Emp(clerk,age)
CS 336 17
Warehouse Maintenance Anomalies
Data Sold (item,clerk,age)
Warehouse
Integrator
Sales Comp.
Sale(item,clerk) Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1) → integrator adds Sale (Mary,25)
4. (2) → integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
CS 336 18
Maintenance Anomaly - Solutions
CS 336 19
Self-Maintainability: Examples
Sold(item,clerk,age) =
Sale(item,clerk) Emp(clerk,age)
• Inserts into Emp
If Emp.clerk is key and Sale.clerk is
foreign key (with ref. int.) then no effect
• Inserts into Sale
Maintain auxiliary view: Emp-Πclerk,age(Sold)
• Deletes from Emp
Delete from Sold based on clerk
CS 336 20
Self-Maintainability: Examples
CS 336 21
Partial Self-Maintainability
CS 336 22
Warehouse Specification (ideally)
View Definitions
Warehouse
Integration Warehouse
Configuration rules
Module
Change Integrator Metadata
Detection
Requirements
...
CS 336 23
Processing
• ROLAP servers vs. MOLAP servers
• Index Structures
• What to Materialize?
• Algorithms Client
Query & Analysis
Client
Metadata Warehouse
Integration
CS 336 24
ROLAP Server
• Relational OLAP Server sale prodId date sum
p1 1 62
p2 1 19
p1 2 48
tools
relational
DBMS
CS 336 25
MOLAP Server
• Multi-Dimensional OLAP Server
Sales
ty
B
Ci
A
milk
Product
M.D. tools soda
eggs
soap
1 2 3 4
Date
utilities
multi- could also
dimensional sit on
relational
server DBMS
CS 336 26
Index Structures (sketch)
• Traditional Access Methods
− B-trees, hash tables, R-trees, grids, …
• Popular in Warehouses
− inverted lists
− bit map indexes
− join indexes
− text indexes
CS 336 27
What to Materialize?
• Store in warehouse results useful for
common queries
• Example:
total sales
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1 110
p2 19
CS 336 28
Materialization Factors
• Type/frequency of queries
• Query response time
• Storage cost
• Update cost
CS 336 29
Cube Aggregates Lattice
129
all
c1 c2 c3
p1 67 12 50
city product date
use greedy
day 2
c1 c2 c3
city, product, date algorithm to
day 1
p1
p2 c1
44
c2
4
c3 decide what
to materialize
p1 12 50
p2 11 8
CS 336 30
Dimension Hierarchies
all
city
CS 336 31
Dimension Hierarchies
all
state
city, product, date
state, date
state, product
CS 336 32
Interesting Hierarchy
time day week month quarter year
all 1 1 1 1 2000
2 1 1 1 2000
3 1 1 1 2000
4 1 1 1 2000
years 5 1 1 1 2000
6 1 1 1 2000
7 1 1 1 2000
weeks 8 2 1 1 2000
quarters
months conceptual
dimension table
days
CS 336 33
Managing
• Metadata
• Warehouse Design
• Tools Client Client
Metadata Warehouse
Integration
CS 336 34
Metadata
• Administrative
− definition of sources, tools, ...
− schemas, dimension hierarchies, …
− rules for extraction, cleaning, …
− refresh, purging policies
− user profiles, access control, ...
CS 336 35
Metadata
• Business
− business terms & definition
− data ownership, charging
• Operational
− data lineage
− data currency (e.g., active, archived, purged)
− use stats, error reports, audit trails
CS 336 36
Design Summary
• What data is needed?
• Where does it come from?
• How to clean data?
• How to represent in warehouse (schema)?
• What to summarize?
• What to materialize?
• What to index?
CS 336 37
Tools
• Development
− design & edit: schemas, views, scripts, rules, queries, reports
• Warehouse Management
− performance monitoring, usage patterns, exception reporting
• Workflow Management
− “reliable scripts” for cleaning & analyzing data
CS 336 38
Current State of Industry
• Extraction and integration done off-line
− Usually in large, time-consuming, batches
• Everything copied at warehouse
− Not selective about what is stored
− Query benefit vs storage & update cost
• Query optimization aimed at OLTP
− High throughput instead of fast response
− Process whole query before displaying anything
CS 336 39
State of Commercial Practice ...
• Connectivity to sources • Data extract, clean,
− Apertus transform, refresh
− Information Builders − CA-Ingres Replicator
− Informix Enterprise Gateway − ETI-Extract
− Oracle Open Connect − IBM Data Joiner, Data
− CA-Ingres gateway Propagator
− MS ODBC − Prism Warehouse manager
− Platinum InfoHub − SAS Access
− Sybase Replication Server
− Trinzic InfoPump
CS 336 40
… State of Commercial Practice ...
• Multidimensional
Database Engines • ROLAP Servers
− Arbor Essbase − HP Intelligent Warehouse
− Oracle RIR Express − Informix Metacube
− Comshare Commader − MicroStrategy DSS Server
− SAS System − Information Advantage Asxys
• Warehouse Data Servers
− CA-Ingres
− Oracle 8
− RedBrick
− Sybase IQ
− Informix Dynamic Server
− IBM DB2
CS 336 41
… State of Commercial Practice
• Query/Reporting • Multidimensional Analysis
− Kenan Systems Acumate
Environments − Microsoft Excel
− IBM DataGuide
− Arbor Essbase Analysis server
− SAS Access CA Visual Express
− Cognos PowerPlay
Platinum Forest&Trees
− IQ Software IQ/Vision
− Informix ViewPoint
− Lotus 123
− SAS OLAP++
− Business Objects
CS 336 42
Future Directions
• Better performance
• Larger warehouses
• Easier to use
• What are companies & research labs
working on?
CS 336 43
Research (1)
• Incremental Maintenance
• Data Consistency
• Data Expiration
• Recovery
• Data Quality
• Error Handling (Back Flush)
CS 336 44
Research (2)
• Rapid Monitor Construction
• Temporal Warehouses
• Materialization & Index Selection
• Data Fusion
• Data Mining
• Integration of Text & Relational Data
• Conceptual Modelling
CS 336 45
Conclusions
• Massive amounts of data and
complexity of queries will push limits
of current warehouses
• Need better systems:
− easier to use
− provide quality information
CS 336 46