Professional Documents
Culture Documents
Concepts and
Techniques
— Chapter 4 —
Multidimensional Databases
Summary
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
Array-based “bottom-up”
algorithm a ll
Using multi-dimensional chunks
No direct tuple comparisons A B C
Simultaneous aggregation on
multiple dimensions
A B A C BC
Intermediate aggregate values
are re-used for computing
A BC
ancestor cuboids
Cannot do Apriori pruning: No
iceberg optimization
08/10/09 Data Mining: Concepts and Techniques 9
Multi-way Array Aggregation for Cube
Computation (MOLAP)
Partition arrays into chunks (a small subcube which fits in
memory).
Compressed sparse array addressing: (chunk_id, offset)
Compute aggregates in “multiway” by visiting cube cells in
the order which minimizes the # of times to visit each cell,
and reduces memory63 access and storage cost.
C c2 45
c3 61 62 64
46 47 48
c1 29 30 31 32 What is the best
c0
b3 B13 14 15 16 60 traversing order
44
9
28 56 to do multi-way
b2
B 40
24 52 aggregation?
b1 5 36
20
b0 1 2 3 4
a0 a1 a2 a3
08/10/09 A Data Mining: Concepts and Techniques 10
Multi-way Array Aggregation for Cube
Computation
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
C c3 61
c2 45
62 63 64
46 47 48
c1 29 30 31 32
c0
B13 14 15 16 60
b3 44
B 28 56
b2 9
40
24 52
b1 5
36
20
b0 1 2 3 4
a0 a1 a2 a3
A
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
1 a ll
iceberg pruning
If a partition does not 2 A 10 B 14 C 16 D
If minsup = 1 ⇒ compute
4 A BC 6 A BD 8 A C D 12 BC D
full CUBE!
No simultaneous aggregation 5 A BC D
08/10/09 Data Mining: Concepts and Techniques 15
BUC: Partitioning
Usually, entire data set
can’t fit in main memory
Sort distinct values, partition into blocks
that fit
Continue processing
Optimizations
Partitioning
External Sorting, Hashing, Counting
Sort
Ordering dimensions to encourage
pruning
08/10/09
Cardinality,Data
Skew, Correlation
Mining: Concepts and Techniques 16
Efficient Computation of Data
Cubes
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
Bottom-up A B C D
computation
A B A C A D B C B D C D
Exploring an H-tree
structure A B C A B D A C D B C D
If the current A B C D
computation of an H-
tree cannot pass
min_sup, do not
proceed further
(pruning)
No simultaneous
aggregation
08/10/09 Data Mining: Concepts and Techniques 18
H-tree: A Prefix Hyper-tree
Quant-
Attr. Val. Side-link
Info
Sum:2285
Edu root
…
Hhd …
Bus …
Header … … edu hhd bus
Jan …
table Feb …
… …
Tor … Jan Mar Jan Feb
Van …
Mon …
… …
Tor Van Tor Mon
Cust_gr
Month City Prod Cost Price
p
Jan Tor Edu Printer 500 485 Q.I. Q.I. Q.I.
Quant-
Jan Tor Hhd TV 800 1200 Info
Camer Sum:
Jan Tor Edu 1160 1280
a 1765
Feb Mon Bus Laptop 1500 2500
Cnt: 2
Mar Van Edu HD 540 520 bins
… … … … … …
08/10/09 Data Mining: Concepts and Techniques 19
Computing Cells Involving “City”
Attr. Side-
Q.I.
Val.
Edu …
link From (*, *, Tor) to (*, Jan, Tor)
Header Hhd …
root
Bus …
Table … …
HTor Jan
Feb
…
… Edu. Hhd. Bus.
… …
Attr.
Val.
Quant-Info Side-link Jan. Mar. Jan. Feb.
Edu Sum:2285 …
Hhd …
Bus …
… … Tor. Van. Tor. Mon.
Jan …
Feb …
… … Q.I. Q.I. Q.I.
Quant-
Tor …
Van … Info
Sum:
Mon … 1765
… …
Cnt: 2
bins
08/10/09 Data Mining: Concepts and Techniques 20
Computing Cells Involving Month But No
City
2. Compute cells
Edu. Hhd. Bus.
involving month but
no
Attr.
Val.
city
Quant-Info Side-link
Edu. Sum:2285 … Jan. Mar. Jan. Feb.
Hhd. …
Bus. …
Q.I. Q.I. Q.I. Q.I.
… …
Jan. …
Feb. …
Mar. …
… …
Tor. …
Tor. Van. Tor. Mont.
Van. … Top-k OK mark: if Q.I. in a child
Mont. … passes top-k avg threshold, so does
… … its parents. No binning is needed!
08/10/09 Data Mining: Concepts and Techniques 21
Computing Cells Involving Only
Cust_grp
root
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
Intuition: If a single-dimensional
aggregate on an attribute
value p does not satisfy the A B C D Count
iceberg condition, it is useless a1 b1 c1 d1 1
to distinguish them during the a1 b1 c4 d3 1
iceberg computation a1 b2 c2 d2 1
compressed table, it is a1 * * * 1
a2 * c3 d4 2
a ll
BC D : 51
A /A B /B C /C D /D
b*: 33 b1: 26
ro o t: 5
c*: 14 c3: 211 c* : 27
A B /A B A C /A C A D /A B C /B C B D /B C D
a1: 3 a2: 2
d*: 15 d4 : 2 12 d*: 28
A B C /A B C A B D /A B A C D /A BC D
b*: 1 b1: 2 b*: 2
General heuristics
Multi-way array aggregation
BUC
H-cubing
Star-Cubing
High-Dimensional OLAP
tid A B C D E
1 a1 b1 c1 d1 e1
2 a1 b2 c1 d2 e1
3 a1 b2 c1 d1 e2
4 a2 b1 c1 d1 e2
5 a2 b1 c1 d1 e3
ABC
a2 b2 4 5∩ 2 3 ⊗ 0
fragments (P1,…,Pk).
bottom- up fashion.
08/10/09 Data Mining: Concepts and Techniques 45
Frag-Shells (2)
Dimensions D Cuboid
EF Cuboid
A B C D E F … DE Cuboid
Cell Tuple-ID List
d1 e1 {1, 3, 8, 9}
d1 e2 {2, 4, 6, 7}
d2 e1 {5, 10}
… …
ABC DEF
Cube Cube
a1,a2 ,K ,an : M
A query has the general form
Each ai has 3 possible values
1. Instantiated value
2. Aggregate * function
3. Inquire ? function
A B C D E F G H I J K L M N …
Instantiated Online
Base Table Cube
Hypothesis-driven
exploration by user, huge search space
Discovery-driven (Sarawagi, et al.’98)
Effective navigation of large OLAP data cubes
pre-compute measures indicating exceptions,
guide user in the data analysis, at all levels of
aggregation
Exception: significantly different from the value
anticipated, based on a statistical model
Visual cues such as background color are used
08/10/09 to reflect the degree of exception
Data Mining: Concepts and Techniques of each cell 56
Kinds of Exceptions and their
Computation
Parameters
SelfExp: surprise of cell relative to other cells at
same level of aggregation
InExp: surprise beneath the cell
PathExp: surprise beneath cell for each drill-
down path
Computation of exception indicator (modeling
fitting and computing SelfExp, InExp, and PathExp
values) can be overlapped with cube construction
Exception themselves can be stored, indexed and
retrieved like precomputed aggregates
08/10/09 Data Mining: Concepts and Techniques 57
Examples: Discovery-Driven Data
Cubes
collection is expensive)
Many statistical tools available, to determine
validity
Confidence intervals
Hypothesis tests
18
19
20
18
19
20
Data
Cube
18
19
20
19
20
is algebraic
where both s and l (count) are algebraic
Thus one can calculate cells efficiently at more general
cuboids without having to start at the base cuboid each
time
18
19
20
18
19
20
18
19
20
population?
Two-sample t-test (confidence-based)
Example:
cuboid
Low CSD indicates high correlation with cube
BUC
H-cubing
Star-cubing
a0b1c0
c1 c0 c1 c2 c3
c2
c3 b0 x x x x
b1
a0b2c0
c1 b2
c2
c3 b3
a0b3c0
c1
c2
c3
…
08/10/09 Data Mining: Concepts and Techniques 93
a0b1 chunk
b1 c0 c1 c2 c3
a0b0c0
c1 a0 yyyy a0 xy xy xy xy Done with a0b0
c2
c3
a0b1c0
c1 c0 c1 c2 c3
c2
c3 b0 x x x x
b1 y y y y
a0b2c0
c1 b2
c2
c3 b3
a0b3c0
c1
c2
c3
…
08/10/09 Data Mining: Concepts and Techniques 94
a0b2 chunk
b2 c0 c1 c2 c3
a0b0c0
c1 a0 zzzz a0 xyz xyz xyz xyz Done with a0b1
c2
c3
a0b1c0
c1 c0 c1 c2 c3
c2
c3 b0 x x x x
b1 y y y y
a0b2c0
c1 b2 z z z z
c2
c3 b3
a0b3c0
c1
c2
c3
…
08/10/09 Data Mining: Concepts and Techniques 95
Table Visualization
b3 c0 c1 c2 c3
a0b0c0
c1 a0 uuuu a0 xyzu xyzu xyzu xyzu Done with a0b2
c2
c3
a0b1c0
c1 c0 c1 c2 c3
c2
c3 b0 x x x x
b1 y y y y
a0b2c0
c1 b2 z z z z
c2
c3 b3 u u u u
a0b3c0
c1
c2
c3
a1b1c0
c1 c0 c1 c2 c3
c2
c3 b0 xx xx xx xx
b1 y y y y
a1b2c0
c1 b2 z z z z
c2
c3 b3 u u u u
a1b3c0
c1
c2
c3
…
08/10/09 Data Mining: Concepts and Techniques 97
a3b3 chunk (last one)
…
b0 c0 c1 c2 c3
a3b0c0
c1 a3 uuuu a3 xyzu xyzu xyzu xyzu Done with a0b3
c2 Done with a0c*
c3 Done with b*c*
a3b1c0
c1 c0 c1 c2 c3
c2
c3 b0 xxxx xxxx xxxx xxxx
b1 yyyy yyyy yyyy yyyy
a3b2c0
c1 b2 zzzz zzzz zzzz zzzz
c2
c3 b3 uuuu uuuu uuuu uuuu
a3b3c0
c1
c2
c3
Finish
root: 4
(*,*,*,*) : 4
a1 3
a2 1 a1: 3 a2: 1
b1 3
b2 1 b1:2 b2:1 b1: 1
c1 2
c2 1
c1: 1 c2: 1 c3: 1 c1: 1
c3 1
H-table Output
H-Tree: 1.1
condition: ??c1
root: 4
(*,*,*,*) : 4
a1 1 (*,*,*,c1): 2
a2 1 a1: 1 a2: 1
b1 2
b2 1 b1:1 b1: 1
H-table Output
H-Tree: 1.1.1
condition: ?b1c1
root: 4
(*,*,*) : 4
a1 1 (*,*,c1): 2
a2 1 a1: 1 a2: 1 (*,b1,c1): 2
H-table Output
H-Tree: 1.1.2
condition: ??c1
root: 4
(*,*,*) : 4
a1 1 (*,*,c1): 2
a2 1 a1: 1 a2: 1 (*,b1,c1): 2
H-table Output
H-Tree: 1.2
condition: ???
root: 4
(*,*,*) : 4
a1 3 (*,*,c1): 2
a2 1 a1: 3 a2: 1 (*,b1,c1): 2
b1 3
b2 1 b1:2 b2:1 b1: 1
H-table Output
H-Tree: 1.2.1
condition: ???
root: 4
(*,*,*) : 4
a1 2 (*,*,c1): 2
a2 1 a1: 2 a2: 1 (*,b1,c1): 2
(*,b1,*): 3
(a1,b1,*):2
H-table Output
H-Tree: 1.2.2
condition: ???
root: 4
(*,*,*) : 4
a1 3 (*,*,c1): 2
a2 1 a1: 3 a2: 1 (*,b1,c1): 2
(*,b1,*): 3
(a1,b1,*):2
(a1,*,*): 3
Finish
root: 5
NULL
a1: 3 a2: 2
BCD:5
root: 5
a1: 3 a2: 2
BCD:5 (a1,*,*) : 3
root: 5
a1: 3 a2: 2
a1CD/a1:3
b*: 1 b1: 2 b*: 2
b*: 1
a1: 3 a2: 2
a1b*D/a1b*:1
b*: 1
a1: 3 a2: 2
c*: 1
b*: 1 b1: 2 b*: 2
a1b*D/a1b*:1
a1b*c*/a1b*c*:1
BCD:5
root: 5 (a1,*,*) : 3
b*: 1
a1: 3 a2: 2
c*: 1
d*: 1
a1b*D/a1b*:1
d*: 1
a1b*c*/a1b*c*:1
BCD:5
root: 5 (a1,*,*) : 3
b*: 1
a1: 3 a2: 2
c*: 1
d*: 2 d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
BCD:5
root: 5 (a1,*,*) : 3
b*: 1
a1: 3 a2: 2
c*: 1
c*: 2 c3: 2
a1CD/a1:3
d*: 2 d4: 2
c*: 1
d*: 1
a1b*D/a1b*:1
mine this subtree
d*: 1
but nothing to do
remove
BCD:5
root: 5 (a1,*,*) : 3
b*: 1 b1: 2 (a1,b1,*):2
a1: 3 a2: 2
c*: 1
b1: 2 b*: 2
d*: 1
c*: 2 c3: 2
a1CD/a1:3
d*: 2 d4: 2
c*: 1
d*: 1
a1b1D/a1b1:2
BCD:5
root: 5 (a1,*,*) : 3
b*: 1 b1: 2 (a1,b1,*):2
a1: 3 a2: 2
c*: 1 c*: 2
b1: 2 b*: 2
d*: 1
c*: 2 c3: 2
a1CD/a1:3
d*: 2 d4: 2
c*: 3
d*: 1
a1b1D/a1b1:2
a1b1c*/a1b1c*:2
BCD:5
root: 5 (a1,*,*) : 3
b*: 1 b1: 2 (a1,b1,*):2
a1: 3 a2: 2
c*: 1 c*: 2
b1: 2 b*: 2
d*: 1 d*: 2
c*: 2 c3: 2
a1CD/a1:3
d*: 2 d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
d*: 3
a1b1c*/a1b1c*:2
BCD:5
root: 5 (a1,*,*) : 3
b*: 1 b1: 2 (a1,b1,*):2
a1: 3 a2: 2
c*: 1 c*: 2
b1: 2 b*: 2
d*: 1 d*: 2
c*: 2 c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 1 b1: 2
a1: 3 a2: 2
c*: 1 c*: 2
b1: 2 b*: 2
d*: 1 d*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
a1b1D/a1b1:2
mine this subtree
but nothing to do d*: 3
(all interior nodes *)
remove
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 1 b1: 2
a1: 3 a2: 2
c*: 1 c*: 2
b*: 2
d*: 1 d*: 2
c3: 2
a1CD/a1:3
d4: 2
c*: 3
d*: 3
BCD:5
root: 5 (a1,*,*) : 3
b*: 1 b1: 2 (a1,b1,*):2
a2: 2
c*: 1 c*: 2
(a2,*,*): 2
b*: 2
d*: 1 d*: 2
c3: 2
a2CD/a2:2
d4: 2
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
a2: 2
c*: 1 c*: 2
b*: 2
d*: 1 d*: 2
c3: 2
a2CD/a2:2
d4: 2
a2b*D/a2b*:2
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
a2: 2
c*: 1 c3: 2 c*: 2
b*: 2
d*: 1 d*: 2
c3: 2
a2CD/a2:2
d4: 2
c3: 2
a2b*D/a2b*:2
a2b*c3/a2b*c3:2
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
a2: 2
c*: 1 c3: 2 c*: 2
b*: 2
d*: 1 d4: 2 d*: 2
c3: 2
a2CD/a2:2
d4: 2 c3: 2
d4: 2
a2b*D/a2b*:2
a2b*c3/a2b*c3:2
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
a2: 2
c*: 1 c3: 2 c*: 2
b*: 2
d*: 1 d4: 2 d*: 2
c3: 2
a2CD/a2:2
c3: 2
d4: 2
a2b*D/a2b*:2
mine subtree
nothing to do a2b*c3/a2b*c3:2
remove
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
a2: 2
c*: 1 c3: 2 c*: 2
b*: 2
d*: 1 d4: 2 d*: 2
a2CD/a2:2
c3: 2
d4: 2
a2b*D/a2b*:2
mine subtree
nothing to do
remove
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
a2: 2
c*: 1 c3: 2 c*: 2
a2CD/a2:2
c3: 2
d4: 2
mine subtree
AC/AC, AD/A
remove
(a1,*,*) : 3
a2CD/a2:2 a2D/a2:2 (a1,b1,*):2
(a2,*,*): 2
c3: 2
d4: 2
(a1,*,*) : 3
a2CD/a2:2 a2D/a2:2 (a1,b1,*):2
(a2,*,*): 2
c3: 2
(a2,*,c3): 2
d4: 2 a2c3/a2c3:2
(a1,*,*) : 3
a2CD/a2:2 a2D/a2:2 (a1,b1,*):2
(a2,*,*): 2
c3: 2 d4: 2 (a2,*,c3): 2
(a2,c3,d4): 2
d4: 2
a2c3/a2c3:2
Same as before
As we backtrack
recursively mine child
trees
BCD:5
root: 5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2
(a2,*,*): 2
c*: 1 c3: 2 c*: 2
(a2,*,c3): 2
(a2,c3,d4): 2
d*: 1 d4: 2 d*: 2
mine subtree
BC/BC, BD/B, CD
remove
BCD:5 (a1,*,*) : 3
(a1,b1,*):2
b*: 3 b1: 2 (a2,*,*): 2
(a2,*,c3): 2
c*: 1 c3: 2 c*: 2 (a2,c3,d4): 2
d*: 1 d4: 2 d*: 2
root: 5 (a1,*,*) : 3
(a1,b1,*):2
(a2,*,*): 2
(a2,*,c3): 2
(a2,c3,d4): 2
BCD tree patterns
Cust_gr
Month City
p
Prod Cost Price CREATE CUBE Sales_Iceberg AS
Jan Tor Edu Printer 500 485 SELECT month, city, cust_grp,
Jan Tor Hld TV
Camer
800 1200
AVG(price), COUNT(*)
Jan Tor Edu 1160 1280
Feb Mon Bus
a
Laptop 1500 2500
FROM Sales_Infor
Mar Van Edu HD 540 520 CUBEBY month, city, cust_grp
… … … … … … HAVING AVG(price) >= 800 AND
COUNT(*) >= 50
08/10/09 Data Mining: Concepts and Techniques 137
From Average to Top-k Average
Let (*, Van, *) cover 1,000 records
Avg(price) is the average price of those 1000
sales
Avg50(price) is the average price of the top-50
sales (top-50 according to the sales price
Top-k average is anti-monotonic
The top 50 sales in Van. is with avg(price) <=
800 the top 50 deals in Van. during Feb. must
be with avg(price) <= 800
Month City
Cust_gr
p
Prod Cost Price
… … … … … …
… … … … … …
weakest strongest
Approximate real avg50() avg()
avg50() Anti-monotonic, Not anti-
Anti-monotonic, but monotoni
can be computed computationally c
efficiently costly
08/10/09 Data Mining: Concepts and Techniques 141
Computing Iceberg Cubes with
Other Complex Measures
Cube-Gradient Analysis
Dimensions Measures
Base cell cid Yr City Cst_grp Prd_grp Cnt Avg_price
c1 00 Van Busi PC 300 2100
Aggregated cell
c2 * Van Busi PC 2800 1800
Siblings c3 * Tor Busi PC 7900 2350
c4 * * busi PC 58600 2250
Ancestor
08/10/09 Data Mining: Concepts and Techniques 148
Efficient Computing Cube-
gradients