You are on page 1of 125

Data Warehousing

and
OLAP
What is a Data Warehouse
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.” --- W. H. Inmon
• Collection of data that is used primarily in
organizational decision making
• A decision support database that is maintained
separately from the organization’s operational
database

2
Data Warehouse - Subject Oriented
• Subject oriented: oriented to the major
subject areas of the corporation that have
been defined in the data model.
• E.g. for an insurance company: customer,
product, transaction or activity, policy, claim,
account, and etc.
• Operational DB and applications may be
organized differently
• E.g. based on type of insurance's: auto, life,
medical, fire, ...
3
Data Warehouse – Integrated

• There is no consistency in encoding,


naming conventions, …, among different
data sources
• Heterogeneous data sources
• When data is moved to the warehouse, it is
converted.

4
Data Warehouse - Non-Volatile
• Operational data is regularly accessed and
manipulated a record at a time, and update
is done to data in the operational
environment.
• Warehouse Data is loaded and accessed.
Update of data does not occur in the data
warehouse environment.

5
Data Warehouse - Time Variance
• The time horizon for the data warehouse is
significantly longer than that of operational
systems.
• Operational database: current value data.
• Data warehouse data : nothing more than a
sophisticated series of snapshots, taken of at some
moment in time.
• The key structure of operational data may or may
not contain some element of time. The key
structure of the data warehouse always contains
some element of time.
6
Why Separate Data Warehouse?
• Performance
• special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical
of OLAP
• Complex OLAP queries would degrade
performance for operational transactions
• Concurrency control and recovery modes of
OLTP are not compatible with OLAP analysis

7
Why Separate Data Warehouse?
• Function
• missing data: Decision support requires historical
data which operational DBs do not typically maintain
• data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources: operational DBs, external
sources
• data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled.

8
Advantages of Warehousing
• High query performance
• Queries not visible outside warehouse
• Local processing at sources unaffected
• Can operate when sources unavailable
• Can query data not stored in a DBMS
• Extra information at warehouse
– Modify, summarize (store aggregates)
– Add historical information

9
Advantages of Mediator Systems
• No need to copy data
– less storage
– no need to purchase data
• More up-to-date data
• Query needs can be unknown
• Only query interface needed at sources
• May be less draining on sources

10
Operational
databases

The Architecture
External data
sources
Extract
Transform
of Data Warehousing
Load
Refresh

Metadata
repository

Data Warehouse

Data
Serves marts

OLAP
server

Reports Data mining


OLAP

11
Data Sources
• Data sources are often the operational systems,
providing the lowest level of data.
• Data sources are designed for operational use, not
for decision support, and the data reflect this fact.
• Multiple data sources are often from different
systems, run on a wide range of hardware and
much of the software is built in-house or highly
customized.
• Multiple data sources introduce a large number of
issues -- semantic conflicts.
12
Creating and Maintaining a
Warehouse
Data warehouse needs several tools that
automate or support tasks such as:
 Data extraction from different external data
sources, operational databases, files of standard
applications (e.g. Excel, COBOL applications),
and other documents (Word, WWW).
 Data cleaning (finding and resolving
inconsistency in the source data)
 Integration and transformation of data (between
different data formats, languages, etc.)
13
Creating and Maintaining a
Warehouse
 Data loading (loading the data into the data
warehouse)
 Data replication (replicating source database into
the data warehouse)
 Data refreshment
 Data archiving
 Checking for data quality
 Analyzing metadata

14
Physical Structure of Data
Warehouse
There are three basic architectures for
constructing a data warehouse:
• Centralized
• Distributed
• Federated
• Tiered
The data warehouse is distributed for: load
balancing, scalability and higher availability

15
Physical Structure of Data
Warehouse
Client Client Client

Central
Data
Warehouse

Source Source

Centralized architecture

16
Physical Structure of Data
Warehouse
End
Users

Marketing
Local Financial
Data Distribution
Marts

Logical
Data
Warehouse

Source Source

Federated architecture
17
Physical Structure of Data
Warehouse
Workstations
(higly summarized
data)

Local
Data
Marts

Physical
Data
Warehouse

Tiered architecture
Source Source
18
Physical Structure of Data
Warehouse
• Federated architecture
– The logical data warehouse is only virtual

• Tiered architecture
• The central data warehouse is physical
• There exist local data marts on different triers
which store copies or summarization of the
previous trier.

19
Conceptual Modeling of
Data Warehouses
Three basic conceptual schemas:

• Star schema
• Snowflake schema
• Fact constellations

20
Star schema

Star schema: A single object (fact table) in


the middle connected to a number of
dimension tables

21
Star schema

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

22
Star schema
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
o105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la
23
Terms
• Basic notion: a measure (e.g. sales, qty,
etc)
• Given a collection of numeric measures
• Each measure depends on a set of
dimensions (e.g. sales volume as a function of
product, time, and location)

24
Terms
• Relation, which relates the dimensions to
the measure of interest, is called the fact
table (e.g. sale)
• Information about dimensions can be
represented as a collection of relations –
called the dimension tables (product,
customer, store)
• Each dimension can have a set of
associated attributes
25
Example of Star Schema

Date Product
ProductNo
Date
Month Sales Fact Table ProdName
ProdDesc
Year
Date Category
QOH
Product
Store
Store
StoreID Customer
City Customer CustId
State
CustName
Country unit_sales CustCity
Region
dollar_sales CustCountry

schilling_sales
Measurements
26
Dimension Hierarchies
• For each dimension, the set of associated
attributes can be structured as a hierarchy

sType
store
city region

customer city state country

27
Dimension Hierarchies

sType tId size location


t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

region regId name


north cold region
south warm region

28
Snowflake Schema

Snowflake schema: A refinement of star


schema where the dimensional hierarchy
is represented explicitly by normalizing
the dimension tables

29
Product
Example of Snowflake Schema ProductNo
ProdName
Month ProdDesc
Year Month Category
Year Date QOH
Year
Date
Sales Fact Table
Month
Date
Product
Store

Store Customer
StoreID unit_sales
City City Cust
dollar_sales
City CustId
schilling_sales
State State CustName
CustCity
State
CustCountry
Country
Country
Country Measurements
Region
30
Fact constellations

Fact constellations: Multiple fact tables


share dimension tables

31
Database design methodology for data
warehouses (1)
• Nine-step methodology – proposed by Kimball
Step Activity
1 Choosing the process
2 Choosing the grain
3 Identifying and conforming the dimensions
4 Choosing the facts
5 Storing the precalculations in the fact table
6 Rounding out the dimension tables
7 Choosing the duration of the database
8 Tracking slowly changing dimensions
9 Deciding the query priorities and the query modes

32
Database design methodology for data
warehouses (2)
• There are manny approaches that offer alternative routes to
the creation of a data warehouse
• Typical approach – decompose the design of the data
warehouse into manageable parts – data marts, At a later
stage, the integration of the smaller data marts leads to the
creation of the enterprise-wide data warehouse.
• The methodology specifies the steps required for the
design of a data mart, however, the methodology also ties
together separate data marts so that over time they merge
together into a coherent overall data warehouse.

33
Step 1: Choosing the process

• The process (function) refers to the subject matter


of a particular data marts. The first data mart to be
built should be the one that is most likely to be
delivered on time, within budget, and to answer
the most commercially important business
questions.
• The best choice for the first data mart tends to be
the one that is related to ‘sales’

34
Step 2: Choosing the grain
• Choosing the grain means deciding exactly what a fact
table record represents. For example, the entity ‘Sales’
may represent the facts about each property sale.
Therefore, the grain of the ‘Property_Sales’ fact table is
individual property sale.
• Only when the grain for the fact table is chosen we can
identify the dimensions of the fact table.
• The grain decision for the fact table also determines the
grain of each of the dimension tables. For example, if the
grain for the ‘Property_Sales’ is an individual property
sale, then the grain of the ‘Client’ dimension is the detail
of the client who bought a particular property.

35
Step 3: Identifying and conforming the
dimensions
• Dimensions set the context for formulating queries about
the facts in the fact table.
• We identify dimensions in sufficient detail to describee
things such as clients and properties at the correct grain.
• If any dimension occurs in two data marts, they must be
exactly the same dimension, or or one must be a subset of
the other (this is the only way that two DM share one or
more dimensions in teh same application).
• When a dimension is used in more than one DM, the
dimension is referred to as being conformed.

36
Step 4: Choosing the facts

• The grain of the fact table determines which facts can be


used in the data mart – all facts must be expressed at the
level implied by the grain.
• In other words, if the grain of the fact table is an individual
property sale, then all the numerical facts must refer to this
particular sale (the facts should be numeric and additive).

37
Step 5: Storing pre-calculations in the
fact table
• Once the facts have been selected each should be re-
examined to determine whether there are opportunities to
use pre-calculations.
• Common example: a profit or loss statement
• These types of facts are useful since they are additive
quantities, from which we can derive valuable information.
• This is particularly true for a value that is fundamental to
an enterprise, or if there is any chance of a user calculating
the value incorrectly.

38
Step 6: Rounding out the dimension
tables
• In this step we return to the dimension tables and add as
many text descriptions to the dimensions as possible.
• The text descriptions should be as intuitive and
understandable to the users as possible

39
Step 7: Choosing the duration of the
data warehouse
• The duration measures how far back in time the fact table
goes.
• For some companies (e.g. insurance companies) there may
be a legal requirement to retain data extending back five or
more years.
• Very large fact tables raise at least two very significant
data warehouse design issues:
– The older data, the more likely there will be problems in reading
and interpreting the old files
– It is mandatory that the old versions of the important dimensions
be used, not the most current versions (we will discuss this issue
later on)

40
Step 8: Tracking slowly changing
dimensions
• The changing dimension problem means that the proper
description of the old client and the old branch must be
used with the old data warehouse schema
• Usually, the data warehouse must assign a generalized key
to these important dimensions in order to distinguish
multiple snapshots of clients and branches over a period of
time
• There are different types of changes in dimensions:
– A dimension attribute is overwritten
– A dimension attribute caauses a new dimension record to be created
– etc.

41
Step 9: Deciding the query priorities
and the query modes
• In this step we consider physical design issues.
– The presence of pre-stored summaries and aggregates
– Indices
– Materialized views
– Security issue
– Backup issue
– Archive issue

42
Database design methodology for data
warehouses - summary
• At the end of this methodology, we have a design for a
data mart that supports the requirements of a particular
bussiness process and allows the easy integration with
other related data martsto ultimately form the enterprise-
wide data warehouse.
• A dimensional model, which contains more than one fact
table sharing one or more conformed dimension tables, is
referred to as a fact constellation.

43
Multidimensional Data Model

Sales of products may be represented in


one dimension (as a fact relation) or
in two dimensions, e.g. : clients and
products

Multidimensional Data Model

44
Multidimensional Data Model

Fact relation Two-dimensional cube

sale Product Client Amt


p1 c1 12
p2 c1 11
c1 c2 c3
p1 c3 50 p1 12 50
p2 c2 8 p2 11 8

45
Multidimensional Data Model

Fact relation 3-dimensional cube

sale Product Client Date Amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8
p1 c1 2 44 p2 c1 c2 c3
day 1
p1 c2 2 4 p1 12 50
p2 11 8

46
Multidimensional Data Model and
Aggregates
• Add up amounts for day 1
• In SQL: SELECT sum(Amt) FROM SALE
WHERE Date = 1
sale Product Client Date Amt
p1 c1 1 12
p2 c1 1 11 result
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

47
Multidimensional Data Model and
Aggregates
• Add up amounts by day
• In SQL: SELECT Date, sum(Amt)
FROM SALE GROUP BY Date

sale Product Client Date Amt


p1 c1 1 12
p2 c1 1 11 result Date sum
p1 c3 1 50
p2 c2 1 8 1 81
p1 c1 2 44 2 48
p1 c2 2 4

48
Multidimensional Data Model and
Aggregates
• Add up amounts by client, product
• In SQL: SELECT client, product, sum(amt)
FROM SALE
GROUP BY client, product

49
Multidimensional Data Model and
Aggregates

sale Product Client Date Amt sale Product Client Sum


p1 c1 1 12 p1 c1 56
p2 c1 1 11
p1 c2 4
p1 c3 1 50
p2 c2 1 8 p1 c3 50
p1 c1 2 44 p2 c1 11
p1 c2 2 4 p2 c2 8

50
Multidimensional Data Model and
Aggregates
• In multidimensional data model together
with measure values usually we store
summarizing information (aggregates)

c1 c2 c3 Sum
p1 56 4 50 110
p2 11 8 19
Sum 67 12 50 129

51
Aggregates
• Operators: sum, count, max, min,
median, ave
• “Having” clause
• Using dimension hierarchy
– average by region (within store)
– maximum by month (within date)

52
Cube Aggregation

Example: computing sums


c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8

c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8
129
sum
p1 110
p2 19
53
Cube Operators

c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8 sale(c1,*,*)

c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8
129
sum
sale(c2,p2,*) p1 110 sale(*,*,*)
p2 19
54
Cube

* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* 67
c2 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1
p1 *
12 44 4
50 62 48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81

55
Aggregation Using Hierarchies

c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8

country

re g io n A re g io n B
p1 12 50
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)

56
Aggregation Using Hierarchies
client

city
New c1 10 3 21
Orleans c2 12 5 9
11 7 7
region
c3
Poznań 12 11 15 Date of
c4 sale

video CD
Camera
Video Camera CD
aggregation with NO 22 8 30
respect to city PN 23 18 22

57
A Sample Data Cube
Date
1Q 2Q 3Q 4Q sum
c t camera
C
du video USA
o
Pr CD o
sum
Canada u
n
Mexico t
r
sum
y

58
Exercise (1)
• Suppose the AAA Automobile Co. builds a data
warehouse to analyze sales of its cars.
• The measure - price of a car

• We would like to answer the following typical


queries:
– find total sales by day, week, month and year
– find total sales by week, month, ... for each dealer
– find total sales by week, month, ... for each car model
– find total sales by month for all dealers in a given city,
region and state.
59
Exercise (2)
• Dimensions:
– time (day, week, month, quarter, year)
– dealer (name, city, state, region, phone)
– cars (serialno, model, color, category , …)

• Design the conceptual data warehouse schema

60
OLAP Servers
• Relational OLAP (ROLAP):
• Extended relational DBMS that maps
operations on multidimensional data to
standard relations operations
• Store all information, including fact tables, as
relations
• Multidimensional OLAP (MOLAP):
• Special purpose server that directly
implements multidimensional data and
operations
• store multidimensional datasets as arrays 61
OLAP Servers

• Hybrid OLAP (HOLAP):


• Give users/system administrators freedom to
select different partitions.

62
OLAP Queries
• Roll up: summarize data along a
dimension hierarchy
• if we are given total sales volume per city we
can aggregate on the Location to obtain sales
per states

63
OLAP Queries
client

city
New c1 10 3 21
Orleans c2 12 5 9
11 7 7
region
c3
Poznań 12 11 15 Date of
c4 sale

video CD
Camera
Video Camera CD

roll up NO
PN
22
23
8
18
30
22

64
OLAP Queries

• Roll down, drill down: go from higher


level summary to lower level summary or
detailed data
• For a particular product category, find the
detailed sales data for each salesperson by
date
• Given total sales by state, we can ask for sales
per city, or just sales by city for a selected state

65
OLAP Queries

c1 c2 c3
day 2
p1 44 4
p2 c1 c2 c3
p1 12 50
day 1 p2 11 8

c1 c2 c3
c1 c2 c3 sum 67 12 50
p1 56 4 50
p2 11 8
129
rollup sum
p1 110
drill-down p2 19
66
OLAP Queries
• Slice and dice: select and project
• Sales of video in USA over the last 6 months
• Slicing and dicing reduce the number of
dimensions
• Pivot: reorient cube
• The result of pivoting is called a cross-
tabulation
• If we pivot the Sales cube on the Client and
Product dimensions, we obtain a table for
each client for each product value
67
OLAP Queries
• Pivoting can be combined with aggregation
sale prodId clientid date amt
p1 c1 1 12
p2 c1 1 11
c1 c2 c3
p1 c3 1 50 day 2
p2 c2 1 8 p1 44 4
p1 c1 2 44 p2 c1 c2 c3
p1 c2 2 4 day 1
p1 12 50
p2 11 8

c1 c2 c3 Sum c1 c2 c3 Sum
1 23 8 50 81 p1 56 4 50 110
2 44 4 48 p2 11 8 19
Sum 67 12 50 129 Sum 67 12 50 129

68
OLAP Queries
• Ranking: selection of first n elements (e.g. select 5
best purchased products in July)
• Others: stored procedures, selection, etc.
• Time functions
– e.g., time average

69
Implementing a Warehouse
Implementing a Warehouse
• Designing and rolling out a data warehouse
is a complex process, consisting of the
following activities:
♦ Define the architecture, do capacity palnning,
and select the storage servers, database and
OLAP servers (ROLAP vs MOLAP), and tools,
♦ Integrate the servers, storage, and client tools,
♦ Design the warehouse schema and views,

71
Implementing a Warehouse

♦ Define the physical warehouse organization, data


placement, partitioning, and access method,
♦ Connect the sources using gateways, ODBC
drivers, or other wrappers,
♦ Design and implement scripts for data extraction,
cleaning, transformation, load, and refresh,

72
Implementing a Warehouse
♦ Populate the repository with the schema and
view definitions, scripts, and other metadata,
♦ Design and implement end-user applications,
♦ Roll out the warehouse and applications.

73
Implementing a Warehouse
• Monitoring: Sending data from sources
• Integrating: Loading, cleansing,...
• Processing: Query processing, indexing, ...
• Managing: Metadata, Design, ...

74
Monitoring
• Data Extraction
– Data extraction from external sources is usually
implemented via gateways and standard
interfaces (such as Information Builders
EDA/SQL, ODBC, JDBC, Oracle Open
Connect, Sybase Enterprise Connect, Informix
Enterprise Gateway, etc.)

75
Monitoring Techniques
• Detect changes to an information source
that are of interest to the warehouse:
• define triggers in a full-functionality DBMS
• examine the updates in the log file
• write programs for legacy systems
• polling (queries to source)
• screen scraping

• Propagate the change in a generic form to


the integrator

76
Integration
• Integrator
• Receive changes from the monitors
• make the data conform to the conceptual schema used
by the warehouse
• Integrate the changes into the warehouse
• merge the data with existing data already present
• resolve possible update anomalies
• Data Cleaning
• Data Loading

77
Data Cleaning
• Data cleaning is important to warehouse –
there is high probability of errors and
anomalies in the data:
– inconsistent field lengths, inconsistent
descriptions, inconsistent value assignments,
missing entries and violation of integrity
constraints.
– optional fields in data entry are significant
sources of inconsistent data.

78
Data Cleaning Techniques
• Data migration: allows simple data
transformation rules to be specified, e.g.
„replace the string gender by sex” (Warehouse
Manager from Prism is an example of this tool)
• Data scrubbing: uses domain-specific
knowledge to scrub data (e.g. postal addresses)
(Integrity and Trillum fall in this category)
• Data auditing: discovers rules and
relationships by scanning data (detect outliers).
Such tools may be considered as variants of
data mining tools
79
Data Loading
• After extracting, cleaning and transforming, data
must be loaded into the warehouse.
• Loading the warehouse includes some other
processing tasks: checking integrity constraints,
sorting, summarizing, etc.
• Typically, batch load utilities are used for
loading. A load utility must allow the
administrator to monitor status, to cancel, suspend,
and resume a load, and to restart after failure with
no loss of data integrity

80
Data Loading Issues

• The load utilities for data warehouses have to deal


with very large data volumes
• Sequential loads can take a very long time.
• Full load can be treated as a single long batch
transaction that builds up a new database. Using
checkpoints ensures that if a failure occurs during
the load, the process can restart from the last
checkpoint

81
Data Refresh
• Refreshing a warehouse means propagating
updates on source data to the data stored in the
warehouse
• when to refresh:
• periodically (daily or weekly)
• immediately (defered refresh and immediate
refresh)
– determined by usage, types of data source, etc.

82
Data Refresh
• how to refresh
– data shipping
– transaction shipping

• Most commercial DBMS provide replication servers


that support incremental techniques for propagating
updates from a primary database to one or more
replicas. Such replication servers can be used to
incrementally refresh a warehouse when sources
change

83
Data Shipping
• data shipping: (e.g. Oracle Replication Server), a
table in the warehouse is treated as a remote
snapshot of a table in the source database.
After_row trigger is used to update snapshot log
table and propagate the updated data to the
warehouse

84
Transaction Shipping

• transaction shipping: (e.g. Sybase Replication


Server, Microsoft SQL Server), the regular
transaction log is used. The transaction log is
checked to detect updates on replicated tables, and
those log records are transferred to a replication
server, which packages up the corresponding
transactions to update the replicas

85
Derived Data
• Derived Warehouse Data
– indexes
– aggregates
– materialized views
• When to update derived data?
• The most difficult problem is how to refresh the
derived data? The problem of constructing
algorithms incrementally updating derived data
has been the subject of much research!

86
Materialized Views
• Define new warehouse relations using SQL
expressions
sale prodId clientid date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11
p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
join of sale and product

joinTb prodId name price clientid date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4

87
Processing
• Index Structures
• What to Materialize?
• Algorithms

88
Index Structures
• Indexing principle:
• mapping key values to records for
associative direct access
• Most popular indexing techniques in
relational database: B+-trees
• For multi-dimensional data, a large
number of indexing techniques have been
developed: R-trees

89
Index Structures
• Index structures applied in warehouses
– inverted lists
– bit map indexes
– join indexes
– text indexes

90
Inverted Lists
18
19

r4 rId name age


r18 r4 joe 20
20
20 r34 r18 fred 20
23 r19 sally 21
21 r35
22 r34 nancy 20
r35 tom 20
r5
r36 pat 25
r19
23 r5 dave 21
r37
25 r41 jeff 26
r40
26

...
inverted data
age
lists records
index
91
Inverted Lists
• Query:
– Get people with age = 20 and name = “fred”
• List for age = 20: r4, r18, r34, r35
• List for name = “fred”: r18, r52
• Answer is intersection: r18

92
Bitmap Indexes
• Bitmap index: An indexing technique that has
attracted attention in multi-dimensional
database implementation
table
Customer City Car
c1 Detroit Ford
c2 Chicago Honda
c3 Detroit Honda
c4 Poznan Ford
c5 Paris BMW
c6 Paris Nissan

93
Bitmap Indexes
• The index consists of bitmaps:
Index on City:
ec1 Chicago Detroit Paris Poznan
1 0 1 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 0 1
5 0 0 1 0
6 0 0 1 0

bitmaps

94
Bitmap Indexes
Index on Car:
ec1 BMW Ford Honda Nissan
1 0 1 0 0
2 1 0 1 0
3 0 0 1 0
4 0 1 0 0
5 1 0 0 0
6 0 0 0 1

bitmaps

95
Bitmap Indexes
• Index on a particular column
• Index consists of a number of bit vectors -
bitmaps
• Each value in the indexed column has a bit
vector (bitmaps)
• The length of the bit vector is the number of
records in the base table
• The i-th bit is set if the i-th row of the base
table has the value for the indexed column

96
Bitmap Index
18 1
19 1
0
1 id name age
1 1 joe 20
20
20 0 2 fred 20
23 0
3 sally 21
21 0 0
22 0 4 nancy 20
1
0 5 tom 20
0
6 pat 25
0
23 7 dave 21
0
25 8 jeff 26
1
26
0

...
1
1
age bit data
index maps records
97
Using Bitmap indexes
• Query:
– Get people with age = 20 and name = “fred”
• List for age = 20: 1101100000
• List for name = “fred”: 0100000001
• Answer is intersection: 010000000000
• Good if domain cardinality small
• Bit vectors can be compressed

98
Using Bitmap indexes
• They allow the use of efficient bit operations to
answer some queries
• “how many customers from Detroit have car
‘Ford’”
– perform a bit-wise AND of two bitmaps:
answer – c1
• “how many customers have a car ‘Honda’”
– count 1’s in the bitmap - answer - 2
• Compression - bit vectors are usually sparse
for large databases – the need for
decompression 99
Bitmap Index – Summary
• With efficient hardware support for bitmap
operations (AND, OR, XOR, NOT), bitmap index
offers better access methods for certain queries
• e.g., selection on two attributes
• Some commercial products have implemented
bitmap index
• Works poorly for high cardinality domains since
the number of bitmaps increases
• Difficult to maintain - need reorganization when
relation sizes change (new bitmaps)
100
Join
• “Combine” SALE, PRODUCT relations
• In SQL: SELECT * FROM SALE, PRODUCT
sale prodId storeId date amt product id name price
p1 c1 1 12 p1 bolt 10
p2 c1 1 11 p2 nut 5
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

joinTb prodId name price storeId date amt


p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4

101
Join Indexes

join index
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4

sale rId prodId storeId date amt


r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4

102
Join Indexes
• Traditional indexes map the value to a list of
record ids. Join indexes map the tuples in the join
result of two relations to the source tables.
• In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows
in the fact table.
• For a warehouse with a Sales fact table and dimension
city, a join index on city maintains for each distinct city
a list of RIDs of the tuples recording the sales in the
city
• Join indexes can span multiple dimensions
103
What to Materialize?
• Store in warehouse results useful for
common queries
Example: total sale
c1 c2 c3
day 2 p1 44 4 ...
p2 c1 c2 c3
day 1 p1 12 50
p2 11 8

c1 c2 c3
p1 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8
129
c1
materialize p1
p2
110
19

104
Cube Operation
• SELECT date, product, customer, SUM
(amount)
FROM SALES
CUBE BY date, product, customer
• Need compute the following Group-Bys
– (date, product, customer),
– (date,product),(date, customer), (product,
customer),
– (date), (product), (customer)
105
Cuboid Lattice
• Data cube can be viewed as a lattice of
cuboids
• The bottom-most cuboid is the base cube.
• The top most cuboid contains only one cell.

(A,B,C,D)

(A,B,C) (A,B,D) (A,C,D) (B,C,D)

(A,B) (A,C) (A,D) (B,C) (B,D) (C,D)

(A) (B) (C) (D)

( all ) 106
Cuboid Lattice
129
all
c1 c2 c3
p1 67 12 50

city product date

city, product city, date product, date


c1 c2 c3
p1 56 4 50
p2 11 8

use greedy
day 2
c1 c2 c3
city, product, date algorithm to
p1 44 4

day 1
p2 c1 c2 c3 decide what
p1 12 50
p2 11 8 to materialize

107
Efficient Data Cube Computation
• Materialization of data cube
• Materialize every (cuboid), none, or some.
• Algorithms for selection of which cuboids to
materialize:
• size, sharing, and access frequency:
– Type/frequency of queries
– Query response time
– Storage cost
– Update cost

108
Dimension Hierarchies
• Client hierarchy
region cities city state region
c1 CA East
c2 NY East
state c3 SF West

city

109
Dimension Hierarchies
Computation
all

city product date

city, product city, date product, date

state
city, product, date
state, date
state, product

roll-up along client state, product, date

hierarchy
110
Cube Computation - Array Based
Algorithm
• An MOLAP approach:
• the base cuboid is stored as
multidimensional array.
• read in a number of cells to compute partial
cuboids

111
Cube computations

B
A
C

ALL
{ABC}{AB}{AC}{BC}
{A}{B}{C}{ }

112
View and Materialized Views
• View
• derived relation defined in terms of base
(stored) relations
• Materialized views
• a view can be materialized by storing the tuples
of the view in the database
• index structures can be built on the materialized
view

113
View and Materialized Views
• Maintenance is an issue for materialized
views
• recomputation
• incremental updating

114
Maintenance of materialized views
• “Deficit” departments
• To find all “deficit” departments:
– group by deptid
– join (deptid)
– select all dept.names with budget < sum(salary)

DeptId Name Budget EmpId Lname salary DeptId


1 CS 7500 100 Kim 2500 1
2 Math 5500 200 Jabbar 2000 1
3 Comm. 4500 300 Smith 3000 1
400 Brown 3500 2
500 Lu 3000 2

115
Maintenance of materialized views
• select DeptId, sum(salary) Real_Budget
from Employee
group by DeptId; Temp (relation)
• select Name
from Dept, Temp
where Dept.DeptId=Temp.DeptId
and Budget < Real_Budget;

116
Maintenance of materialized views
• assume the following update:
update Employee
set salary=salary+1000
where Lname=‘Jabbar’;
• recompute the whole view?
• use intermediate materialized results
(Temp), and update the view incrementally?

117
Managing
Metadata Repository
• Administrative metadata
• source database and their contents
• gateway descriptions
• warehouse schema, view and derived data
definitions
• dimensions and hierarchies
• pre-defined queries and reports
• data mart locations and contents

119
Metadata Repository
• Administrative metadata
• data partitions
• data extraction, cleansing, transformation rules,
defaults
• data refresh and purge rules
• user profiles, user groups
• security: user authorization, access control

120
Metadata Repository
• Business
– business terms & definition
– data ownership, charging
• Operational
– data layout
– data currency (e.g., active, archived, purged)
– use statistics, error reports, audit trails

121
Design
• What data is needed?
• Where does it come from?
• How to clean data?
• How to represent in warehouse (schema)?
• What to summarize?
• What to materialize?
• What to index?

122
Summary
• Data warehouse is not a software product
or application - it is an important
information processing system
architecture for decision making!
• Data warehouse combines a number of
products, each has operational uses
besides data warehouse

123
Summary
• OLAP provides powerful and fast tools
for reporting on data:
• ROLAP vs. MOLAP
• Issues associated with data warehouses:
• new techniques: multidimensional database,
data cube computation, query optimization,
indexing, …
• data warehousing and application design:
vendors and application developers.
124
Current State of Industry
• Extraction and integration done off-line
– Usually in large, time-consuming, batches
• Everything copied at warehouse
– Not selective about what is stored
– Query benefit vs storage & update cost
• Query optimization aimed at OLTP
– High throughput instead of fast response
– Process whole query before displaying anything

125

You might also like