OLAP2

Data Warehousing and OLAP: Implementing Datacubes Efficiently
dr. Toon Calders t.calders@tue.nl
Last lecture
Traditional relational databases are geared towards online transaction processing Decision support has different requirements
Integrate multiple sources into a data warehouse Multi-dimensional view on the data Aggregations
Last lecture
Data cubes as a conceptual model of multi-dimensional, aggregated data.
TV PC VCR sum 1Qtr 2Qtr 3Qtr 4Qtr sum Ireland France Germany sum
Last lecture
At the logical level: Query languages for supporting OLAP
SQL extensions MDX Slice-and-dice, pivot, roll-up, drill-down, etc.
Last lecture
At the physical level:
Straighforward implementation of the data cube is problematic
Data explosion problem

Most databases are sparse Full cube is much larger than the original database
Last lecture
Source: the OLAP report: Database explosion
Last lecture
Source: the OLAP report: Database explosion
This lecture
Is it possible to get performance gains when only partially materializing the cube? Which parts should we materialize in order to get the best performance?
V. Harinarayan, A. Rajaraman, and J. D. Ullman: Implementing data cubes efficiently. In: ACM SIGMOD 1996.
Overview
Problem statement Formal model of computation
Partial order on Queries Cost model
Greedy solution Performance guarantee Conclusion
The problem: informally

Relation Sales
Dimensions: part, supplier, customer Measure: sales Aggregation: sum
The data cube contains all aggregations on part, supplier, customer; e.g.:
(p1,s1,<all>,165) (p1,s2,<all>,145)
Queries
We are interested in answering the following type of queries efficiently:
( part, supplier ) SELECT part, supplier, sum(sales) FROM Sales GROUP BY part, supplier ( part ) SELECT part, sum(sales) FROM Sales GROUP BY part
( supplier ) SELECT supplier, sum(sales) FROM Sales GROUP BY supplier
( part, customer ) SELECT customer, part, sum(sales) FROM Sales GROUP BY customer, part
Queries
The queries are quantified by their grouping attributes These queries select disjoint parts of the cube. The union of all these queries is the complete cube.
Queries
supplier
part
customer
Queries
supplier
(part,supplier,customer)
part
customer
Queries
supplier (supplier,customer)
part
customer
Queries
supplier
(customer) part
customer
Materialization
Materializing everything is impossible Cost of evaluating following query directly:
SELECT D1, , Dk, sum(M) FROM R GROUP BY D1, , Dk
in practice roughly linear in the size of R ( worst case: |R| log(|R|) )
Materializing some queries can help the other queries too
Materialization
Example:
( part, customer ) SELECT customer, part, sum(sales) FROM Sales GROUP BY customer, part | Sales |
( part ) SELECT part, sum(sales) FROM Sales GROUP BY part | Sales |
Materialization
Example:
( part, customer ) SELECT customer, part, sum(sales) FROM Sales GROUP BY customer, part materialized as PC ( part ) SELECT part, sum(sales) FROM Sales PC GROUP BY part | Sales |PC| | | Sales |PC| |
Example
Query (part,supplier,customer) (part,customer) (part,supplier) (supplier,customer) (part) (supplier) (customer) () Answer 6M 6M 0.8M 6M 0.2M 0.01M 0.1M 1
10
Example: nothing materialized

Query (part,supplier,customer) (part,customer) (part,supplier) (supplier,customer) (part) (supplier) (customer) () Answer 6M 6M 0.8M 6M 0.2M 0.01M 0.1M 1 Cost 6M 6M 6M 6M 6M 6M 6M 6M
Total cost: 48M
Example: some materialized

Query (part,supplier,customer) (part,customer) (part,supplier) (supplier,customer) (part) (supplier) (customer) () Answer 6M 6M 0.8M 6M 0.2M 0.01M 0.1M 1 Cost
11
Example: some materialized

Query (part,supplier,customer) (part,customer) (part,supplier) (supplier,customer) (part) (supplier) (customer) () Answer 6M 6M 0.8M 6M 0.2M 0.01M 0.1M 1 Cost 6M 6M 0.8M 6M 0.8M 0.8M 0.1M 0.1M
Total cost: 20.6M
Research Questions
What is the optimal set of views to materialize? How many views must we materialize to get reasonable performance? Given bounded space S, what is the optimal choice of views to materialize?
12
Central Question
Given:
for every view its size a number k
Find:
which k materialized views give the most gain
This problem is NP-complete

greedy algorithm
Overview
13
Partial order on queries

Q1 Q2 :
Q1 can be answered using the query results of Q2 A materialized view of Q2 can be used to answer Q1
is a partial order on the views

(part, supplier, customer)
(part, supplier)
(part, customer)
(supplier, customer)
(part)
(supplier)
(customer)
()
14

Can be generalized to hierarchies:
(product, country, -) (product, city,-) (-, country, -) (product, city, -) (-, -, -) (product, city, -) (-, country, year) (-, city, month)
(a1, , an) (b1, , bn) if for all i = 1n, ai higher than or equal to bi in the hierarchy of the ith dimension
Cost Model
Given a set of materialized views S = {V1,,Vn}, the cost of query Q is costS(Q) := min( { |Vi| : i = 1n, Q Vi } ) The benefit of view W for computing Q w.r.t. S BW (Q | S) = costS{W}(Q) - costS(Q) Total benefit of W w.r.t. S BW(S) =
Q BW (Q | S)
15
Cost Model: Example

psc (6M) Q psc pc (6M) sc (6M) ps pc p s (0.01M) c (0.1M) s c () (1) CS 6M 6M 6M 6M 6M 0.1 0.1M CSU{W} BW 6M 0 0.8M 5.2M 6M 0 0.8M 5.2M 0.8M 5.2M 0.1 0 0.1M 0
ps (0.8M)
p (0.2M)
Overview
16
Greedy algorithm
Input:
Size for every view and constant k
In each step
Select the view with the most benefit Add it to the result Until we have k views
Greedy Algorithm
Algorithm
S := { top view }; for i:=1 to k { select view W such that BW(S) is maximal S := S {W} } return S;
17
Example
100
a
50
30
d
20 1
e g h
E.g. The cost of constructing view b given the view A is 75 100 If we choose b to materialize, f 40 the new cost of constructing view b is 50. 10
First round
100
a
50
30
75
d
20 1
e g h
f
10
Notice that not only b, but also d, e, g and h can be calculated from b So the total benefit is (100 50) x 5 = 250 40
18
Continue
100
a
50
Similarly, the benefit of materializing c is (100 75) x 5 = 125
30
75
Benefit
d
20 1
e g h
f
10
250 125
40 c
Not yet finish

100
a
50
30
75
For e, Benefit = (100-30) x 3 = 210 Benefit
d
20 1
e g h
f
40 10
b c e
250 125 210
19
Lets choose b!
100
a
50
30
75
For d and f , Benefit = (100-20) x 2 = 160 and (100-40) x 2 = 120 respectively.

Benefit b 250 125 160 210 120
d
20 1
e g h
f
40 10
c d e f
Next round?
Seems we should choose e, as it has the second largest benefit. Lets see what will happen in the second round.
b c d e f 250 125 160 210 120
Benefit
20
Second round!
100
a
50
30
Now, only c and f get benefit if we materialize c (since e, g and h can be more efficiently 75 calculated by using b) Benefit = (100 75) x 2 = 50
d
20 1
e g h
f
40 10
c Benefit 50
How about choosing f?

100
a
50
30
If we choose f, we found that h can be effectively calculated by using f instead of b. 75 Benefit = (100 40) + (50 40)
20 1
d g
e h
f
40 10
Benefit c f 50 70
21
Easy to work out others

100
a
50
30
20 1
d g
e h
Benefit of d = (50 20) x 2 = 60 Benefit of e 75 = (50 30) x 3 = 60 Benefit of g = 50 1 = 49 40 Benefit of h = 50 10 = 40
10
Observation
In the first round, the benefit of choosing f (only 120) is far from the best choice (250) But in second round, choosing f gives the maximum benefit!
1st round b c d e f Benefit 250 125 160 210 120 2nd round c d e f g Benefit 50 60 70 70 49
22
Overview
Simple? Optimal?
Trade off! This simple algorithm is not optimal in all cases! Consider the following case
23
Bad example
200
a
100
c
99
100
20 nodes Total 1000
Bad example
200
Choose c Benefit = (200-99) x (1 + 20 + 20) = 4141 = maximum
a
100
20 nodes Total 1000
c
99
100
24
Bad example
200
Now choose either 1 of b and d (same benefit)
a
100
20 nodes Total 1000
c
99
100
Bad example
200
How about these? Very expensive!!!
a
100
20 nodes Total 1000
c
99
100
25
Optimal solution should be

200
Only c is a little bit expensive.
a
100
20 nodes Total 1000
c
99
100
Some theoretical result

It can be proved that we can get at least (e 1 ) / e % (which is about 63%) of the benefit of the optimal algorithm.
26
Extensions (1)
Problem
The views in a lattice are unlikely to have the same probability of being requested in a query.
Solution:
We can weight each benefit by its probability.
Extensions (2)
Problem
Instead of asking for some fixed number (k) of views to materialize, we might instead allocate a fixed amount of space to views.
Solution
We can consider the benefit of each view per unit space.
27
Conclusions
Materialization of views is an essential query optimization strategy for decisionsupport applications. Reason to materialize some part of the data cube but not all of the cube. A lattice framework that models multidimensional analysis very well.
Conclusions (cont.)
Finding optimal solution is NP-hard. Introduction of greedy algorithm Greedy algorithm works on this lattice and picks the almost right views to materialize.
28
Conclusions (the end)

There exists cases which greedy algorithm fails to produce optimal solution. But greedy algorithm has guaranteed performance Expansion of greedy algorithm.
29

OLAP2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OLAP2

Uploaded by

Copyright:

Available Formats

Data Warehousing and OLAP: Implementing Datacubes Efficiently

dr. Toon Calders t.calders@tue.nl

Data explosion problem

Source: the OLAP report: Database explosion

Source: the OLAP report: Database explosion

Greedy solution Performance guarantee Conclusion

The problem: informally

( supplier ) SELECT supplier, sum(sales) FROM Sales GROUP BY supplier

in practice roughly linear in the size of R ( worst case: |R| log(|R|) )

Materializing some queries can help the other queries too

( part ) SELECT part, sum(sales) FROM Sales GROUP BY part | Sales |

Example: nothing materialized

Total cost: 48M

Example: some materialized

Example: some materialized

Total cost: 20.6M

This problem is NP-complete

Greedy solution Performance guarantee Conclusion

Partial order on queries

is a partial order on the views

Partial order on queries

Partial order on queries

Cost Model: Example

Greedy solution Performance guarantee Conclusion

Similarly, the benefit of materializing c is (100 75) x 5 = 125

Not yet finish

For e, Benefit = (100-30) x 3 = 210 Benefit

250 125 210

For d and f , Benefit = (100-20) x 2 = 160 and (100-40) x 2 = 120 respectively.

How about choosing f?

Easy to work out others

Benefit of d = (50 20) x 2 = 60 Benefit of e 75 = (50 30) x 3 = 60 Benefit of g = 50 1 = 49 40 Benefit of h = 50 10 = 40

Greedy solution Performance guarantee Conclusion

20 nodes Total 1000

Optimal solution should be

Some theoretical result

Conclusions (the end)

You might also like