Professional Documents
Culture Documents
Last lecture
Traditional relational databases are geared towards online transaction processing Decision support has different requirements
Integrate multiple sources into a data warehouse Multi-dimensional view on the data Aggregations
Last lecture
Data cubes as a conceptual model of multi-dimensional, aggregated data.
TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Ireland France Germany sum
Last lecture
At the logical level: Query languages for supporting OLAP
SQL extensions MDX Slice-and-dice, pivot, roll-up, drill-down, etc.
Last lecture
At the physical level:
Straighforward implementation of the data cube is problematic
Last lecture
Last lecture
This lecture
Is it possible to get performance gains when only partially materializing the cube? Which parts should we materialize in order to get the best performance?
V. Harinarayan, A. Rajaraman, and J. D. Ullman: Implementing data cubes efficiently. In: ACM SIGMOD 1996.
Overview
Problem statement Formal model of computation
Partial order on Queries Cost model
The data cube contains all aggregations on part, supplier, customer; e.g.:
(p1,s1,<all>,165) (p1,s2,<all>,145)
Queries
We are interested in answering the following type of queries efficiently:
( part, supplier ) SELECT part, supplier, sum(sales) FROM Sales GROUP BY part, supplier ( part ) SELECT part, sum(sales) FROM Sales GROUP BY part
( part, customer ) SELECT customer, part, sum(sales) FROM Sales GROUP BY customer, part
Queries
The queries are quantified by their grouping attributes These queries select disjoint parts of the cube. The union of all these queries is the complete cube.
Queries
supplier
part
customer
Queries
supplier
(part,supplier,customer)
part
customer
Queries
supplier (supplier,customer)
part
customer
Queries
supplier
(customer) part
customer
Materialization
Materializing everything is impossible Cost of evaluating following query directly:
SELECT D1, , Dk, sum(M) FROM R GROUP BY D1, , Dk
Materialization
Example:
( part, customer ) SELECT customer, part, sum(sales) FROM Sales GROUP BY customer, part | Sales |
Materialization
Example:
( part, customer ) SELECT customer, part, sum(sales) FROM Sales GROUP BY customer, part materialized as PC ( part ) SELECT part, sum(sales) FROM Sales PC GROUP BY part | Sales |PC| | | Sales |PC| |
Example
Query (part,supplier,customer) (part,customer) (part,supplier) (supplier,customer) (part) (supplier) (customer) () Answer 6M 6M 0.8M 6M 0.2M 0.01M 0.1M 1
10
11
Research Questions
What is the optimal set of views to materialize? How many views must we materialize to get reasonable performance? Given bounded space S, what is the optimal choice of views to materialize?
12
Central Question
Given:
for every view its size a number k
Find:
which k materialized views give the most gain
Overview
Problem statement Formal model of computation
Partial order on Queries Cost model
13
(part, supplier)
(part, customer)
(supplier, customer)
(part)
(supplier)
(customer)
()
14
(a1, , an) (b1, , bn) if for all i = 1n, ai higher than or equal to bi in the hierarchy of the ith dimension
Cost Model
Given a set of materialized views S = {V1,,Vn}, the cost of query Q is costS(Q) := min( { |Vi| : i = 1n, Q Vi } ) The benefit of view W for computing Q w.r.t. S BW (Q | S) = costS{W}(Q) - costS(Q) Total benefit of W w.r.t. S BW(S) =
Q BW (Q | S)
15
ps (0.8M)
p (0.2M)
Overview
Problem statement Formal model of computation
Partial order on Queries Cost model
16
Greedy algorithm
Input:
Size for every view and constant k
In each step
Select the view with the most benefit Add it to the result Until we have k views
Greedy Algorithm
Algorithm
S := { top view }; for i:=1 to k { select view W such that BW(S) is maximal S := S {W} } return S;
17
Example
100
a
50
30
d
20 1
e g h
E.g. The cost of constructing view b given the view A is 75 100 If we choose b to materialize, f 40 the new cost of constructing view b is 50. 10
First round
100
a
50
30
75
d
20 1
e g h
f
10
Notice that not only b, but also d, e, g and h can be calculated from b So the total benefit is (100 50) x 5 = 250 40
18
Continue
100
a
50
30
75
Benefit
d
20 1
e g h
f
10
250 125
40 c
a
50
30
75
d
20 1
e g h
f
40 10
b c e
19
Lets choose b!
100
a
50
30
75
d
20 1
e g h
f
40 10
c d e f
Next round?
Seems we should choose e, as it has the second largest benefit. Lets see what will happen in the second round.
b c d e f 250 125 160 210 120
Benefit
20
Second round!
100
a
50
30
Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently 75 calculated by using b) Benefit = (100 75) x 2 = 50
d
20 1
e g h
f
40 10
c Benefit 50
a
50
30
If we choose f, we found that h can be effectively calculated by using f instead of b. 75 Benefit = (100 40) + (50 40)
20 1
d g
e h
f
40 10
Benefit c f 50 70
21
a
50
30
20 1
d g
e h
10
Observation
In the first round, the benefit of choosing f (only 120) is far from the best choice (250) But in second round, choosing f gives the maximum benefit!
1st round b c d e f Benefit 250 125 160 210 120 2nd round c d e f g Benefit 50 60 70 70 49
22
Overview
Problem statement Formal model of computation
Partial order on Queries Cost model
Simple? Optimal?
Trade off! This simple algorithm is not optimal in all cases! Consider the following case
23
Bad example
200
a
100
c
99
100
Bad example
200
Choose c Benefit = (200-99) x (1 + 20 + 20) = 4141 = maximum
a
100
20 nodes Total 1000
c
99
100
24
Bad example
200
Now choose either 1 of b and d (same benefit)
a
100
20 nodes Total 1000
c
99
100
Bad example
200
How about these? Very expensive!!!
a
100
20 nodes Total 1000
c
99
100
25
a
100
20 nodes Total 1000
c
99
100
26
Extensions (1)
Problem
The views in a lattice are unlikely to have the same probability of being requested in a query.
Solution:
We can weight each benefit by its probability.
Extensions (2)
Problem
Instead of asking for some fixed number (k) of views to materialize, we might instead allocate a fixed amount of space to views.
Solution
We can consider the benefit of each view per unit space.
27
Conclusions
Materialization of views is an essential query optimization strategy for decisionsupport applications. Reason to materialize some part of the data cube but not all of the cube. A lattice framework that models multidimensional analysis very well.
Conclusions (cont.)
Finding optimal solution is NP-hard. Introduction of greedy algorithm Greedy algorithm works on this lattice and picks the almost right views to materialize.
28
29