Professional Documents
Culture Documents
and
OLAP
What is a Data Warehouse
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.” --- W. H. Inmon
• Collection of data that is used primarily in
organizational decision making
• A decision support database that is maintained
separately from the organization’s operational
database
2
Data Warehouse - Subject Oriented
• Subject oriented: oriented to the major
subject areas of the corporation that have
been defined in the data model.
• E.g. for an insurance company: customer,
product, transaction or activity, policy, claim,
account, and etc.
• Operational DB and applications may be
organized differently
• E.g. based on type of insurance's: auto, life,
medical, fire, ...
3
Data Warehouse – Integrated
4
Data Warehouse - Non-Volatile
• Operational data is regularly accessed and
manipulated a record at a time, and update
is done to data in the operational
environment.
• Warehouse Data is loaded and accessed.
Update of data does not occur in the data
warehouse environment.
5
Data Warehouse - Time Variance
• The time horizon for the data warehouse is
significantly longer than that of operational
systems.
• Operational database: current value data.
• Data warehouse data : nothing more than a
sophisticated series of snapshots, taken of at some
moment in time.
• The key structure of operational data may or may
not contain some element of time. The key
structure of the data warehouse always contains
some element of time.
6
Why Separate Data Warehouse?
• Performance
• special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical
of OLAP
• Complex OLAP queries would degrade
performance for operational transactions
• Concurrency control and recovery modes of
OLTP are not compatible with OLAP analysis
7
Why Separate Data Warehouse?
• Function
• missing data: Decision support requires historical
data which operational DBs do not typically maintain
• data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources: operational DBs, external
sources
• data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled.
8
Advantages of Warehousing
• High query performance
• Queries not visible outside warehouse
• Local processing at sources unaffected
• Can operate when sources unavailable
• Can query data not stored in a DBMS
• Extra information at warehouse
– Modify, summarize (store aggregates)
– Add historical information
9
Advantages of Mediator Systems
• No need to copy data
– less storage
– no need to purchase data
• More up-to-date data
• Query needs can be unknown
• Only query interface needed at sources
• May be less draining on sources
10
Operational
databases
The Architecture
External data
sources
Extract
Transform
of Data Warehousing
Load
Refresh
Metadata
repository
Data Warehouse
Data
Serves marts
OLAP
server
11
Data Sources
• Data sources are often the operational systems,
providing the lowest level of data.
• Data sources are designed for operational use, not
for decision support, and the data reflect this fact.
• Multiple data sources are often from different
systems, run on a wide range of hardware and
much of the software is built in-house or highly
customized.
• Multiple data sources introduce a large number of
issues -- semantic conflicts.
12
Creating and Maintaining a
Warehouse
Data warehouse needs several tools that
automate or support tasks such as:
Data extraction from different external data
sources, operational databases, files of standard
applications (e.g. Excel, COBOL applications),
and other documents (Word, WWW).
Data cleaning (finding and resolving
inconsistency in the source data)
Integration and transformation of data (between
different data formats, languages, etc.)
13
Creating and Maintaining a
Warehouse
Data loading (loading the data into the data
warehouse)
Data replication (replicating source database into
the data warehouse)
Data refreshment
Data archiving
Checking for data quality
Analyzing metadata
14
Physical Structure of Data
Warehouse
There are three basic architectures for
constructing a data warehouse:
• Centralized
• Distributed
• Federated
• Tiered
The data warehouse is distributed for: load
balancing, scalability and higher availability
15
Physical Structure of Data
Warehouse
Client Client Client
Central
Data
Warehouse
Source Source
Centralized architecture
16
Physical Structure of Data
Warehouse
End
Users
Marketing
Local Financial
Data Distribution
Marts
Logical
Data
Warehouse
Source Source
Federated architecture
17
Physical Structure of Data
Warehouse
Workstations
(higly summarized
data)
Local
Data
Marts
Physical
Data
Warehouse
Tiered architecture
Source Source
18
Physical Structure of Data
Warehouse
• Federated architecture
– The logical data warehouse is only virtual
• Tiered architecture
• The central data warehouse is physical
• There exist local data marts on different triers
which store copies or summarization of the
previous trier.
19
Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:
• Star schema
• Snowflake schema
• Fact constellations
20
Star schema
21
Star schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
22
Star schema
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
24
Terms
• Relation, which relates the dimensions to
the measure of interest, is called the fact
table (e.g. sale)
• Information about dimensions can be
represented as a collection of relations –
called the dimension tables (product,
customer, store)
• Each dimension can have a set of
associated attributes
25
Example of Star Schema
Date Product
ProductNo
Date
Month Sales Fact Table ProdName
ProdDesc
Year
Date Category
QOH
Product
Store
Store
StoreID Customer
City Customer CustId
State
CustName
Country unit_sales CustCity
Region
dollar_sales CustCountry
schilling_sales
Measurements
26
Dimension Hierarchies
• For each dimension, the set of associated
attributes can be structured as a hierarchy
sType
store
city region
27
Dimension Hierarchies
28
Snowflake Schema
29
Product
Example of Snowflake Schema ProductNo
ProdName
Month ProdDesc
Year Month Category
Year Date QOH
Year
Date
Sales Fact Table
Month
Date
Product
Store
Store Customer
StoreID unit_sales
City City Cust
dollar_sales
City CustId
schilling_sales
State State CustName
CustCity
State
CustCountry
Country
Country
Country Measurements
Region
30
Fact constellations
31
Database design methodology for data
warehouses (1)
• Nine-step methodology – proposed by Kimball
Step Activity
1 Choosing the process
2 Choosing the grain
3 Identifying and conforming the dimensions
4 Choosing the facts
5 Storing the precalculations in the fact table
6 Rounding out the dimension tables
7 Choosing the duration of the database
8 Tracking slowly changing dimensions
9 Deciding the query priorities and the query modes
32
Database design methodology for data
warehouses (2)
• There are manny approaches that offer alternative routes to
the creation of a data warehouse
• Typical approach – decompose the design of the data
warehouse into manageable parts – data marts, At a later
stage, the integration of the smaller data marts leads to the
creation of the enterprise-wide data warehouse.
• The methodology specifies the steps required for the
design of a data mart, however, the methodology also ties
together separate data marts so that over time they merge
together into a coherent overall data warehouse.
33
Step 1: Choosing the process
34
Step 2: Choosing the grain
• Choosing the grain means deciding exactly what a fact
table record represents. For example, the entity ‘Sales’
may represent the facts about each property sale.
Therefore, the grain of the ‘Property_Sales’ fact table is
individual property sale.
• Only when the grain for the fact table is chosen we can
identify the dimensions of the fact table.
• The grain decision for the fact table also determines the
grain of each of the dimension tables. For example, if the
grain for the ‘Property_Sales’ is an individual property
sale, then the grain of the ‘Client’ dimension is the detail
of the client who bought a particular property.
35
Step 3: Identifying and conforming the
dimensions
• Dimensions set the context for formulating queries about
the facts in the fact table.
• We identify dimensions in sufficient detail to describee
things such as clients and properties at the correct grain.
• If any dimension occurs in two data marts, they must be
exactly the same dimension, or or one must be a subset of
the other (this is the only way that two DM share one or
more dimensions in teh same application).
• When a dimension is used in more than one DM, the
dimension is referred to as being conformed.
36
Step 4: Choosing the facts
37
Step 5: Storing pre-calculations in the
fact table
• Once the facts have been selected each should be re-
examined to determine whether there are opportunities to
use pre-calculations.
• Common example: a profit or loss statement
• These types of facts are useful since they are additive
quantities, from which we can derive valuable information.
• This is particularly true for a value that is fundamental to
an enterprise, or if there is any chance of a user calculating
the value incorrectly.
38
Step 6: Rounding out the dimension
tables
• In this step we return to the dimension tables and add as
many text descriptions to the dimensions as possible.
• The text descriptions should be as intuitive and
understandable to the users as possible
39
Step 7: Choosing the duration of the
data warehouse
• The duration measures how far back in time the fact table
goes.
• For some companies (e.g. insurance companies) there may
be a legal requirement to retain data extending back five or
more years.
• Very large fact tables raise at least two very significant
data warehouse design issues:
– The older data, the more likely there will be problems in reading
and interpreting the old files
– It is mandatory that the old versions of the important dimensions
be used, not the most current versions (we will discuss this issue
later on)
40
Step 8: Tracking slowly changing
dimensions
• The changing dimension problem means that the proper
description of the old client and the old branch must be
used with the old data warehouse schema
• Usually, the data warehouse must assign a generalized key
to these important dimensions in order to distinguish
multiple snapshots of clients and branches over a period of
time
• There are different types of changes in dimensions:
– A dimension attribute is overwritten
– A dimension attribute caauses a new dimension record to be created
– etc.
41
Step 9: Deciding the query priorities
and the query modes
• In this step we consider physical design issues.
– The presence of pre-stored summaries and aggregates
– Indices
– Materialized views
– Security issue
– Backup issue
– Archive issue
42
Database design methodology for data
warehouses - summary
• At the end of this methodology, we have a design for a
data mart that supports the requirements of a particular
bussiness process and allows the easy integration with
other related data martsto ultimately form the enterprise-
wide data warehouse.
• A dimensional model, which contains more than one fact
table sharing one or more conformed dimension tables, is
referred to as a fact constellation.
43
Multidimensional Data Model
44
Multidimensional Data Model
45
Multidimensional Data Model
46
Multidimensional Data Model and
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
sale Product Client Date Amt
p1 c1 1 12
p2 c1 1 11 result
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
47
Multidimensional Data Model and
Aggregates
• Add up amounts by day
• In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date
48
Multidimensional Data Model and
Aggregates
• Add up amounts by client, product
• In SQL: SELECT client, product, sum(amt)
FROM SALE
GROUP BY client, product
49
Multidimensional Data Model and
Aggregates
50
Multidimensional Data Model and
Aggregates
• In multidimensional data model together
with measure values usually we store
summarizing information (aggregates)
c1 c2 c3 Sum
p1 56 4 50 110
p2 11 8 19
Sum 67 12 50 129
51
Aggregates
• Operators: sum, count, max, min,
median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)
52
Cube Aggregation
c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8
129
sum
p1 110
p2 19
53
Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8
129
sum
sale(c2,p2,*) p1 110 sale(*,*,*)
p2 19
54
Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* 67
c2 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
55
Aggregation Using Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
re g io n A re g io n B
p1 12 50
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
56
Aggregation Using Hierarchies
client
city
New c1 10 3 21
Orleans c2 12 5 9
11 7 7
region
c3
Poznań 12 11 15 Date of
c4 sale
video CD
Camera
Video Camera CD
aggregation with NO 22 8 30
respect to city PN 23 18 22
57
A Sample Data Cube
Date
1Q 2Q 3Q 4Q sum
c t camera
C
du video USA
o
Pr CD o
sum
Canada u
n
Mexico t
r
sum
y
58
Exercise (1)
• Suppose the AAA Automobile Co. builds a data
warehouse to analyze sales of its cars.
• The measure - price of a car
60
OLAP Servers
• Relational OLAP (ROLAP):
• Extended relational DBMS that maps
operations on multidimensional data to
standard relations operations
• Store all information, including fact tables, as
relations
• Multidimensional OLAP (MOLAP):
• Special purpose server that directly
implements multidimensional data and
operations
• store multidimensional datasets as arrays 61
OLAP Servers
62
OLAP Queries
• Roll up: summarize data along a
dimension hierarchy
• if we are given total sales volume per city we
can aggregate on the Location to obtain sales
per states
63
OLAP Queries
client
city
New c1 10 3 21
Orleans c2 12 5 9
11 7 7
region
c3
Poznań 12 11 15 Date of
c4 sale
video CD
Camera
Video Camera CD
roll up NO
PN
22
23
8
18
30
22
64
OLAP Queries
65
OLAP Queries
c1 c2 c3
day 2
p1 44 4
p2 c1 c2 c3
p1 12 50
day 1 p2 11 8
c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8
129
rollup sum
p1 110
drill-down p2 19
66
OLAP Queries
• Slice and dice: select and project
• Sales of video in USA over the last 6 months
• Slicing and dicing reduce the number of
dimensions
• Pivot: reorient cube
• The result of pivoting is called a cross-
tabulation
• If we pivot the Sales cube on the Client and
Product dimensions, we obtain a table for
each client for each product value
67
OLAP Queries
• Pivoting can be combined with aggregation
sale prodId clientid date amt
p1 c1 1 12
p2 c1 1 11
c1 c2 c3
p1 c3 1 50 day 2
p2 c2 1 8 p1 44 4
p1 c1 2 44 p2 c1 c2 c3
p1 c2 2 4 day 1
p1 12 50
p2 11 8
c1 c2 c3 Sum c1 c2 c3 Sum
1 23 8 50 81 p1 56 4 50 110
2 44 4 48 p2 11 8 19
Sum 67 12 50 129 Sum 67 12 50 129
68
OLAP Queries
• Ranking: selection of first n elements (e.g. select 5
best purchased products in July)
• Others: stored procedures, selection, etc.
• Time functions
– e.g., time average
69
Implementing a Warehouse
Implementing a Warehouse
• Designing and rolling out a data warehouse
is a complex process, consisting of the
following activities:
♦ Define the architecture, do capacity palnning,
and select the storage servers, database and
OLAP servers (ROLAP vs MOLAP), and tools,
♦ Integrate the servers, storage, and client tools,
♦ Design the warehouse schema and views,
71
Implementing a Warehouse
72
Implementing a Warehouse
♦ Populate the repository with the schema and
view definitions, scripts, and other metadata,
♦ Design and implement end-user applications,
♦ Roll out the warehouse and applications.
73
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Loading, cleansing,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...
74
Monitoring
• Data Extraction
– Data extraction from external sources is usually
implemented via gateways and standard
interfaces (such as Information Builders
EDA/SQL, ODBC, JDBC, Oracle Open
Connect, Sybase Enterprise Connect, Informix
Enterprise Gateway, etc.)
75
Monitoring Techniques
• Detect changes to an information source
that are of interest to the warehouse:
• define triggers in a full-functionality DBMS
• examine the updates in the log file
• write programs for legacy systems
• polling (queries to source)
• screen scraping
76
Integration
• Integrator
• Receive changes from the monitors
• make the data conform to the conceptual schema used
by the warehouse
• Integrate the changes into the warehouse
• merge the data with existing data already present
• resolve possible update anomalies
• Data Cleaning
• Data Loading
77
Data Cleaning
• Data cleaning is important to warehouse –
there is high probability of errors and
anomalies in the data:
– inconsistent field lengths, inconsistent
descriptions, inconsistent value assignments,
missing entries and violation of integrity
constraints.
– optional fields in data entry are significant
sources of inconsistent data.
78
Data Cleaning Techniques
• Data migration: allows simple data
transformation rules to be specified, e.g.
„replace the string gender by sex” (Warehouse
Manager from Prism is an example of this tool)
• Data scrubbing: uses domain-specific
knowledge to scrub data (e.g. postal addresses)
(Integrity and Trillum fall in this category)
• Data auditing: discovers rules and
relationships by scanning data (detect outliers).
Such tools may be considered as variants of
data mining tools
79
Data Loading
• After extracting, cleaning and transforming, data
must be loaded into the warehouse.
• Loading the warehouse includes some other
processing tasks: checking integrity constraints,
sorting, summarizing, etc.
• Typically, batch load utilities are used for
loading. A load utility must allow the
administrator to monitor status, to cancel, suspend,
and resume a load, and to restart after failure with
no loss of data integrity
80
Data Loading Issues
81
Data Refresh
• Refreshing a warehouse means propagating
updates on source data to the data stored in the
warehouse
• when to refresh:
• periodically (daily or weekly)
• immediately (defered refresh and immediate
refresh)
– determined by usage, types of data source, etc.
82
Data Refresh
• how to refresh
– data shipping
– transaction shipping
83
Data Shipping
• data shipping: (e.g. Oracle Replication Server), a
table in the warehouse is treated as a remote
snapshot of a table in the source database.
After_row trigger is used to update snapshot log
table and propagate the updated data to the
warehouse
84
Transaction Shipping
85
Derived Data
• Derived Warehouse Data
– indexes
– aggregates
– materialized views
• When to update derived data?
• The most difficult problem is how to refresh the
derived data? The problem of constructing
algorithms incrementally updating derived data
has been the subject of much research!
86
Materialized Views
• Define new warehouse relations using SQL
expressions
sale prodId clientid date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11
p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
join of sale and product
87
Processing
• Index Structures
• What to Materialize?
• Algorithms
88
Index Structures
• Indexing principle:
• mapping key values to records for
associative direct access
• Most popular indexing techniques in
relational database: B+-trees
• For multi-dimensional data, a large
number of indexing techniques have been
developed: R-trees
89
Index Structures
• Index structures applied in warehouses
– inverted lists
– bit map indexes
– join indexes
– text indexes
90
Inverted Lists
18
19
...
inverted data
age
lists records
index
91
Inverted Lists
• Query:
– Get people with age = 20 and name = “fred”
• List for age = 20: r4, r18, r34, r35
• List for name = “fred”: r18, r52
• Answer is intersection: r18
92
Bitmap Indexes
• Bitmap index: An indexing technique that has
attracted attention in multi-dimensional
database implementation
table
Customer City Car
c1 Detroit Ford
c2 Chicago Honda
c3 Detroit Honda
c4 Poznan Ford
c5 Paris BMW
c6 Paris Nissan
93
Bitmap Indexes
• The index consists of bitmaps:
Index on City:
ec1 Chicago Detroit Paris Poznan
1 0 1 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 0 1
5 0 0 1 0
6 0 0 1 0
bitmaps
94
Bitmap Indexes
Index on Car:
ec1 BMW Ford Honda Nissan
1 0 1 0 0
2 1 0 1 0
3 0 0 1 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1
bitmaps
95
Bitmap Indexes
• Index on a particular column
• Index consists of a number of bit vectors -
bitmaps
• Each value in the indexed column has a bit
vector (bitmaps)
• The length of the bit vector is the number of
records in the base table
• The i-th bit is set if the i-th row of the base
table has the value for the indexed column
96
Bitmap Index
18 1
19 1
0
1 id name age
1 1 joe 20
20
20 0 2 fred 20
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0
...
1
1
age bit data
index maps records
97
Using Bitmap indexes
• Query:
– Get people with age = 20 and name = “fred”
• List for age = 20: 1101100000
• List for name = “fred”: 0100000001
• Answer is intersection: 010000000000
• Good if domain cardinality small
• Bit vectors can be compressed
98
Using Bitmap indexes
• They allow the use of efficient bit operations to
answer some queries
• “how many customers from Detroit have car
‘Ford’”
– perform a bit-wise AND of two bitmaps:
answer – c1
• “how many customers have a car ‘Honda’”
– count 1’s in the bitmap - answer - 2
• Compression - bit vectors are usually sparse
for large databases – the need for
decompression 99
Bitmap Index – Summary
• With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries
• e.g., selection on two attributes
• Some commercial products have implemented
bitmap index
• Works poorly for high cardinality domains since
the number of bitmaps increases
• Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)
100
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
101
Join Indexes
join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4
102
Join Indexes
• Traditional indexes map the value to a list of
record ids. Join indexes map the tuples in the join
result of two relations to the source tables.
• In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows
in the fact table.
• For a warehouse with a Sales fact table and dimension
city, a join index on city maintains for each distinct city
a list of RIDs of the tuples recording the sales in the
city
• Join indexes can span multiple dimensions
103
What to Materialize?
• Store in warehouse results useful for
common queries
Example: total sale
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8
c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1
p2
110
19
104
Cube Operation
• SELECT date, product, customer, SUM
(amount)
FROM SALES
CUBE BY date, product, customer
• Need compute the following Group-Bys
– (date, product, customer),
– (date,product),(date, customer), (product,
customer),
– (date), (product), (customer)
105
Cuboid Lattice
• Data cube can be viewed as a lattice of
cuboids
• The bottom-most cuboid is the base cube.
• The top most cuboid contains only one cell.
(A,B,C,D)
( all ) 106
Cuboid Lattice
129
all
c1 c2 c3
p1 67 12 50
use greedy
day 2
c1 c2 c3
city, product, date algorithm to
p1 44 4
day 1
p2 c1 c2 c3 decide what
p1 12 50
p2 11 8 to materialize
107
Efficient Data Cube Computation
• Materialization of data cube
• Materialize every (cuboid), none, or some.
• Algorithms for selection of which cuboids to
materialize:
• size, sharing, and access frequency:
– Type/frequency of queries
– Query response time
– Storage cost
– Update cost
108
Dimension Hierarchies
• Client hierarchy
region cities city state region
c1 CA East
c2 NY East
state c3 SF West
city
109
Dimension Hierarchies
Computation
all
state
city, product, date
state, date
state, product
hierarchy
110
Cube Computation - Array Based
Algorithm
• An MOLAP approach:
• the base cuboid is stored as
multidimensional array.
• read in a number of cells to compute partial
cuboids
111
Cube computations
B
A
C
ALL
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }
112
View and Materialized Views
• View
• derived relation defined in terms of base
(stored) relations
• Materialized views
• a view can be materialized by storing the tuples
of the view in the database
• index structures can be built on the materialized
view
113
View and Materialized Views
• Maintenance is an issue for materialized
views
• recomputation
• incremental updating
114
Maintenance of materialized views
• “Deficit” departments
• To find all “deficit” departments:
– group by deptid
– join (deptid)
– select all dept.names with budget < sum(salary)
115
Maintenance of materialized views
• select DeptId, sum(salary) Real_Budget
from Employee
group by DeptId; Temp (relation)
• select Name
from Dept, Temp
where Dept.DeptId=Temp.DeptId
and Budget < Real_Budget;
116
Maintenance of materialized views
• assume the following update:
update Employee
set salary=salary+1000
where Lname=‘Jabbar’;
• recompute the whole view?
• use intermediate materialized results
(Temp), and update the view incrementally?
117
Managing
Metadata Repository
• Administrative metadata
• source database and their contents
• gateway descriptions
• warehouse schema, view and derived data
definitions
• dimensions and hierarchies
• pre-defined queries and reports
• data mart locations and contents
119
Metadata Repository
• Administrative metadata
• data partitions
• data extraction, cleansing, transformation rules,
defaults
• data refresh and purge rules
• user profiles, user groups
• security: user authorization, access control
120
Metadata Repository
• Business
– business terms & definition
– data ownership, charging
• Operational
– data layout
– data currency (e.g., active, archived, purged)
– use statistics, error reports, audit trails
121
Design
• What data is needed?
• Where does it come from?
• How to clean data?
• How to represent in warehouse (schema)?
• What to summarize?
• What to materialize?
• What to index?
122
Summary
• Data warehouse is not a software product
or application - it is an important
information processing system
architecture for decision making!
• Data warehouse combines a number of
products, each has operational uses
besides data warehouse
123
Summary
• OLAP provides powerful and fast tools
for reporting on data:
• ROLAP vs. MOLAP
• Issues associated with data warehouses:
• new techniques: multidimensional database,
data cube computation, query optimization,
indexing, …
• data warehousing and application design:
vendors and application developers.
124
Current State of Industry
• Extraction and integration done off-line
– Usually in large, time-consuming, batches
• Everything copied at warehouse
– Not selective about what is stored
– Query benefit vs storage & update cost
• Query optimization aimed at OLTP
– High throughput instead of fast response
– Process whole query before displaying anything
125