You are on page 1of 822

DATA

WAREHOUSING

Dr. Navneet Goyal


Professor
Department of Computer Science
BITS, Pilani – Pilani Campus
HANDOUT
Books
Ponniah P, “Data Warehousing Fundamentals”, John Wiley,
2003.
Kimball R, “The Data Warehouse Toolkit”, 2e, John Wiley,
2002.
Anahory S, & Dennis M, “Data Warehousing in the Real World”,
Addison-Wesley, 2000
Kimball R, Reeves L, Ross M, & Thornthwaite, W, “The Data
Warehouse Lifecycle Toolkit”, John Wiley, 1998.
Inmon WH, “Building the Data Warehouse”, 3e, John, Wiley,
2002
Adamson C, & Venerable M, “Data Warehouse Design
Solutions”, John Wiley, 1998.
World’s Largest
Data Warehouse
SAP in conjunction with NetApp and several other partners
@ SAP/Intel data center in Santa Clara, California
12 petabytes (PB) of addressable storage had been created
Guinness World Record
Based on the SAP® HANA in-memory data platform, SAP IQ
(formerly Sybase IQ), and BMMsoft Federated EDMT.
NetApp® SAN storage
Contains more than 221 trillion transactional records
more than 100 billion unstructured documents, including emails,
SMS, and images
It also contains data from 30 billion sources, including users,
smart sensors, and mobile devices.

Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp
Background
1980’s to early 1990’s
 Focus on computerizing business processes
 To gain competitive advantage
By early 1990’s
 All companies had operational systems
 It no longer offered any advantage
How to get competitive advantage??
OLTP Systems:
Primary Purpose
Run the operations of the business
For example: Banks, Railway reservation etc.
Based on ER Data Modeling
Transaction based system
Data is always current valued
Little history is available
Data is highly volatile
Has “Intelligent keys”
OLTP Systems

Has relational normalized design


Redundant data is undesirable
Consists of many tables
High volume retrieval is inefficient
Optimized for repetitive “narrow” queries
Common data in many applications
Need for
Data Warehousing
Companies, over the years, gathered huge
volumes of data
“Hidden Treasure”
Can this data be used in any way?
Can we analyze this data to get any
competitive advantage?
If yes, what kind of advantage?
Benefits of
Data Warehousing

Allows “efficient” analysis of data


Competitive Advantage
Analysis aids strategic decision making
Increased productivity of decision makers
Potential high ROI
Classic example: Diaper and Beer
More recently: Polo shirts & Barbie dolls
Decision Support Systems,
DW, & OLAP
Information technology to help the knowledge
worker (executive, manager, analyst) make
faster and better decisions.
Data Warehouse is a DSS
A data warewhouse is an architectural construct
of an information system that provides users
with current and historical decision support
information that is hard to access or present in
traditional operational systems.
Data Warehouse is not an Intelligent system
On-Line Analytical Processing (OLAP) is an
element of DSS
Why Separate Data Warehouse?

Performance
 Op dbs designed & tuned for known txs &
workloads.
 Complex OLAP queries would degrade
performance for op txs.
 Special data organization, access & implementation
methods needed for multidimensional views &
queries.
 Current and historical decision support information
that is hard to access or present in traditional
operational systems.
Why Separate Data Warehouse?
Function
 Missing data: Decision support requires
historical data, which op dbs do not typically
maintain.
 Data consolidation: Decision support requires
consolidation (aggregation, summarization) of
data from many heterogeneous sources: op
dbs, external sources.
 Data quality: Different sources typically use
inconsistent data representations, codes, and
formats which have to be reconciled.
Data Warehouse:
Characteristics
Analysis driven
Ad-hoc queries
Complex queries
Used by top managers
Based on Dimensional Modeling
Denormalized structures
Data Warehouse:
Major Players
SAS institute
IBM
Oracle
Sybase
Microsoft
HP
Cognos
Business Objects
Data Warehouse
A decision support database that is maintained
separately from the organization’s operational
databases.
A data warehouse is a
 subject-oriented,

 integrated,

 time-varying,

 non-volatile

collection of data that is used primarily in


organizational decision making
Subject Oriented
Data Warehouse is designed around
“subjects” rather than processes
A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System
Problems Galore!!!
DW will have a Sales Subject Area
Subject Oriented

Retail Sales Outlet Sales Catalog Sales


System System
System

OLTP Systems

Sales Subject Area Data Warehouse

Subject-Oriented Sales Information


Integrated

Heterogeneous Source Systems


Little or no control
Need to Integrate source data
For Example: Product codes could be
different in different systems
Arrive at common code in DW
“Surrogate keys”
Non-Volatile(Read-Mostly)

Write
USER
OLTP
Read

Read
USER DW
Time Variant
Most business
analysis has a time Sales
component

Trend Analysis
(historical data is
required)
2001 2002 2003 2004
Data Warehousing
Architecture
Monitoring & Administration
OLAP servers

Metadata
Repository
Analysis
Extract
External
Transform
Query/
Sources Load Reporting
Serve
Refresh
Operational
dbs Data
Mining

Data Marts
Q&A
Thank You
Data Warehousing:
Introduction Continued

Dr. Navneet Goyal


Professor
Computer Science Department
BITS, Pilani
OLTP Systems:
Characteristics
Run the operations of the business
 For example: Banks, Railway
reservation etc.
 Based on ER Data Modeling
 Transaction based system
 Data is always current valued
 Little history is available
 Data is highly volatile
 Has “Intelligent keys”
Data Warehouse:
Characteristics
 Analysis driven
 Ad-hoc queries
 Complex queries
 Used by top managers
 Based on Dimensional Modeling
 Denormalized structures
Populating & Refreshing
the Warehouse
 Data Extraction
 Data Cleaning
 Data Transformation
 Convert from legacy/host format to
warehouse format
 Load
 Sort, summarize, consolidate, compute
views, check integrity, build indexes,
partition
 Refresh
 Bring new data from source systems
ETL Process
Issues & Challenges
 Consumes 70-80% of project time
 Heterogeneous Source Systems
 Little or no control over source systems
 Source systems scattered
 Source systems operating in different time
zones
 Different currencies
 Different measurement units
 Data not captured by OLTP systems
 Ensuring data quality
Data Staging Area
 A storage area where extracted data is
 Cleaned
 Transformed
 Deduplicated
 Initial storage for data
 Need not be based on Relational model
 Spread over a number of machines
 Mainly sorting and Sequential processing
 COBOL or C code running against flat files
 Does not provide data access to users
 Analogy – kitchen of a restaurant
Presentation Servers
 A target physical machine on which DW data is
organized for
 Direct querying by end users using OLAP
 Report writers
 Data Visualization tools
 Data mining tools
 Data stored in Dimensional framework
 Analogy – Sitting area of a restaurant
Data Cleaning
 Why?
 Data warehouse contains data that is
analyzed for business decisions
 More data and multiple sources could mean
more errors in the data and harder to trace
such errors
 Results in incorrect analysis
 Detecting data anomalies and rectifying
them early has huge payoffs
 Long Term Solution
 Change business practices and data entry
tools
 Repository for meta-data
Soundex Algorithms

 Misspelled terms
 For example NAMES
 Phonetic algorithms – can find
similar sounding names
 Based on the six phonetic
classifications of human speech
sounds
Data Warehouse Design
 OLTP Systems are Data Capture
Systems
 “DATA IN” systems
 DW are “DATA OUT” systems

OLTP DW
Analyzing the DATA
 Active Analysis – User Queries
 User-guided data analysis
 Show me how X varies with Y
 OLAP
 Automated Analysis – Data Mining
 What’s in there?
 Set the computer FREE on your data
 Supervised Learning (classification)
 Unsupervised Learning (clustering)
OLAP Queries
 How much of product P1 was
sold in 1999 state wise?
 Top 5 selling products in 2002
 Total Sales in Q1 of FY 2002-03?
 Color wise sales figure of cars
from 2000 to 2003
 Model wise sales of cars for the
month of Jan from 2000 to 2003
Data Mining Investigations
 Which type of customers are more
likely to spend most with us in the
coming year?
 What additional products are most
likely to be sold to customers who
buy sportswear?
 In which area should we open a new
store in the next year?
 What are the characteristics of
customers most likely to default on
their loans before the year is out?
Continuum of Analysis

Specialized
Algorithms
SQL

OLTP OLAP Data Mining


Primitive & Complex Automated
Canned Ad-hoc Analysis
Analysis Analysis
Q&A
Thank You
Data Warehousing:
Introduction to Data
Modeling

Dr. Navneet Goyal


Professor
Computer Science Department
BITS, Pilani
Data Warehouse Design
 OLTP Systems are Data Capture
Systems
 “DATA IN” systems
 DW are “DATA OUT” systems

OLTP DW
Design Requirements
 Design of the DW must directly
reflect the way the managers look
at the business
 Should capture the measurements of
importance along with parameters by
which these parameters are viewed
 It must facilitate data analysis, i.e.,
answering business questions
ER Modeling
 A logical design technique that
seeks to eliminate data redundancy
 Illuminates the microscopic
relationships among data elements
 Perfect for OLTP systems
 Responsible for success of
transaction processing in
Relational Databases
Problems with ER Model
ER models are NOT suitable for DW?
 End user cannot understand or
remember an ER Model
 Many DWs have failed because of
overly complex ER designs
 Not optimized for complex, ad-hoc
queries
 Data retrieval becomes difficult due to
normalization
 Browsing becomes difficult
ER vs Dimensional Modeling
 ER models are constituted to
 Remove redundant data (normalization)
 Facilitate retrieval of individual records
having certain critical identifiers
 Thereby optimizing OLTP performance
 Dimensional model supports the
reporting and analytical needs of a
data warehouse system.
Dimensional Modeling:
Salient Features
 Represents data in a standard
framework
 Framework is easily
understandable by end users
 Contains same information as ER
model
 Packages data in symmetric format
 Resilient to change
 Facilitates data retrieval/analysis
Dimensional Modeling:
Vocabulary
 Measures or facts
 Facts are “numeric” & “additive”
 For example; Sale Amount, Sale
Units
 Factors or dimensions
 Star Schemas
 Snowflake & Starflake Schemas

Sales Amt = f (Product,Location,Time)


Fact Dimensions
Star Schema
Product FK FK Location
Dimension Dimension

Sales Fact
Table

FK FK
Time Promotion
Dimension Dimension
Dimensional Modeling

 Facts are stored in FACT Tables


 Dimensions are stored in
DIMENSION tables
 Dimension tables contains textual
descriptors of business
 Fact and dimension tables form a
Star Schema
 “BIG” fact table in center
surrounded by “SMALL” dimension
tables
The “Classic” Star Schema
Fact Table
Store Dimension STORE KEY
Time Dimension
PRODUCT KEY
STORE KEY
PERIOD KEY PERIOD KEY
Store Description
City Dollars_sold
Units Period Desc
State
Dollars_cost Year
District ID
District Desc. Quarter
Region_ID Month
Region Desc. Day
Product Dimension
Regional Mgr.
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Fact Tables
 Contains numerical measurements of
the business
 Each measurement is taken at the
intersection of all dimensions
 Intersection is the composite key
 Represents Many-to-many
relationships between dimensions
 Examples of facts
Sale_amt, Units_sold, Cost,
Customer_count
Dimension Tables
 Contains attributes for dimensions
 50 to 100 attributes common
 Best attributes are textual and
descriptive
 DW is only as good as the dimension
attributes
 Contains hierarchal information albeit
redundantly
 Entry points into the fact table
Types of Facts
 Fully-additive-all dimensions
 Units_sold, Sales_amt
 Semi-additive-some dimensions
 Account_balance, Customer_count
28/3,tissue paper,store1, 25, 250,20
28/3,paper towel,store1, 35, 350,30
Is no. of customers who bought either tissue
paper or paper towel is 50? NO.
 Non-additive-none
 Gross margin=Gross profit/amount
 Note that GP and Amount are fully additive
 Ratio of the sums and not sum of the ratios
Data Warehouse:
Design Steps

Step 1: Identify the Business Process

Step 2: Declare the Grain

Step 3: Identify the Dimensions

Step 4: Identify the Facts


Grocery Store:
The Universal Example
The Scenario:
 Chain of 100 Grocery Stores
 60000 individual products in each
store
 10000 of these products sold on
any given day(average)
 3 year data
Grocery Store DW
 Step 1: Sales Business Process
 Step 2: Daily Grain
 A word about GRANULARITY
 Temp sensor data: per ms, sec, min, hr?
 Size of the DW is governed by granularity
 Daily grain (club products sold on a day for
each store) Aggregated data
 Receipt line Grain (each line in the receipt is
recorded – finest grain data)
Grocery Store:
DW Size Estimate
 Daily Grain
 Size of Fact Table
= 100*10000*3*365
= 1095 million records
 3 facts & 4 dimensions (49 bytes)
 1095 m * 49 bytes = 53655 m bytes
 i.e. ~ 50 GB
Data Cube

Goa
Dubai Dimensions:
Pilani Time, Product, Location
Juice 10 Attributes:
Product

Milk 34 Product (upc, price, …)


Coke 56 Location…
Cream 32 …
Soap 12 Hierarchies:
Bread 56 roll-up to week
Product  Brand  …
M T W Th F S S Day  Week  Quarter
Time City  Region  Country
56 units of bread sold in Pilani on M
Q&A
Thank You
Grocery Store
Data Warehouse

Dr. Navneet Goyal


Professor
Computer Science Department
BITS, Pilani
Business Processes
• Sales
• Inventory
• Procurement
• Order Management
• Promotion
Value Chain

Retailer Issues Deliveries @ Retailer WH


Purchase Order Retailer WH Inventory

Retail Store Retail Store Deliveries @


Sales Inventory Retail Store
The Scenario
•A chain of grocery stores in the US
• 100 stores
• 60,000 individual products on the shelves in each store
• 6,000 products (on an average) sell each day in a given
store
• Each product belongs to a subcategory
• Each subcategory belongs to a category
• Each category belongs to a department
Some Terms

• SKU (Stock Keeping Units)


• UPC (Universal Product Codes)
• EPOS ( Electronic Point of Sales)
What Management is
Interested In?

• Ordering logistics

• Stocking shelves
• Selling products
• Maximize profits
Data Warehouse:
Design Steps

Step 1: Identify the Business Process

Step 2: Declare the Grain

Step 3: Identify the Dimensions

Step 4: Identify the Facts


Star Schema

Product FK FK Location
Dimension Dimension

Sales Fact
Table

FK FK
Time Promotion
Dimension Dimension
The “Classic” Star
SchemaFact Table
Store Dimension STORE KEY
Time Dimension
PRODUCT KEY
STORE KEY
PERIOD KEY PERIOD KEY
Store Description
City Dollars_sold
Units Period Desc
State
Dollars_cost Year
District ID
District Desc. Quarter
Region_ID Month
Region Desc. Day
Product Dimension
Regional Mgr.
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Types of Facts
 Fully-additive-all dimensions
 Units_sold, Sales_amt

 Semi-additive-some dimensions
 Account_balance, Customer_count
28/3,tissue paper,store1, 25, 250,20
28/3,paper towel,store1, 35, 350,30
Is no. of customers who bought either tissue paper or
paper towel is 50? NO.
 Non-additive-none
 Gross margin=Gross profit/amount
 Note that GP and Amount are fully additive
 Ratio of the sums and not sum of the ratios
Facts for Grocery Store
1. Quantity sold (additive)
2. Dollar revenue (additive)
3. Dollar cost (additive)
4. Customer count (semi-additive, not additive along
the product dimension)
Fact Table for Grocery
Store
Field name Example Description/Remarks
Values
Date key (FK) 1 Surrogate key

Product key (FK) 1 Surrogate key

Store key (FK) 1 Surrogate key

EPOS transaction 100 Trancsaction number generated by the


no. Operational system to record sales

Sales Quantity 2 No. of units bought by a customer

Sales amount 72 Amount received by selling 2 units

Cost amount 65 Cost price of 2 units


Promotion Dimension
 Causal Dimension
 Which causes or being the cause
 Promotion conditions include
 TPRs
 End-aisle displays
 Newspapers ads
 Coupons
 Combinations are common
Promotion Dimension
 Management is interested in knowing how
effective their promotion schemes are
 Promotion are judged on the basis of:
 Lift and Baseline sales
 Time shifting
 Cannibalization
 Growing the market
Modeling Promotion
Dimension
 Difficult to capture the effect of promotion
 Little or NO provision in operational system
to capture promotions
 Multiple promotion schemes at the same time
 Promotion schemes applicable to many
products
 Different grain than sales
 What about products that were on promotion
but not sold?
Modeling Promotion
Dimension
 Captures combination of promotion techniques in
effect at the time of sale
 Promotions are generally at a higher grain than sales
fact table
 Adding a promotion dimension is thus possible
 Promotion and product relationship is captured
implicitly in the fact table
 But we are missing out on one important piece of
information
 Products on promotion that did not sell
Modeling Promotion
Dimension
 Different causal conditions are highly
correlated
 Create one row for each combination of
promotion conditions
 All stores run 3 promotion mechanisms
simultaneously, but a few stores are not able
to deploy end-aisle displays
 One record for combination of 3
 One record for combination of 2
Modeling Promotion
Dimension
 In one year, there may be 1000 ads, 5000 TPRs, and
1000 end-aisle displays
 Only 10000 combinations of these three conditions
affecting a particular product
 A sample promotion dimension
Promotion key Coupon type
Promotion name Ad media type
TPR type Display Provider
Promotion Media type Promotion Cost
Ad type Start Date
Display type End Date……

 Include a NO promotion in effect row in promotion


dimension
Modeling Promotion
Dimension
 Promotion Coverage Factless Fact Table
 Same Dimensions apply as that for Sales fact table
 So what is different?
 Is the grain different?
 One row in the fact table for each product in a store each
day ( or week ) regardless of whether the product was
sold or not
 NO FACTS INVOLVED!!
 How to find products that were on promotion on a day
but did not sell?
Database Sizing

FACT TABLE SIZE


 3 year data
 100 stores
 Daily grain
 60,000 SKUs
 Sparsity = 10%
 4 dimensions (16 bytes)
 4 facts (16 bytes)
Total Size=3x365x100x6000x32 20 GB
Sample Data Warehouse
Time Dimension
Product Dimension
Store Dimension
Promotion Dimension
Sales Fact Table
Promotion Coverage Fact Table
Q&A
Thank You
Basic Elements of a
Data Warehouse
Prof. Navneet Goyal
Department Of Computer Science
BITS, Pilani
20-Apr-18 1
Basic Elements of a DW
• Source Systems
• Data Staging Area
• Presentation Servers
• Data Mart/Super Marts
• Data Warehouse
• Operational Data Store
• OLAP
Kimball vs. Inmon

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 2


Data Warehousing
Architecture
Monitoring & Administration
OLAP servers

Metadata
Repository Analysis

Extract
Query/
External
Sources
Transform Reporting
Load Serve
Operational
Refresh Data
dbs Mining

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 3

Data Marts
Data Marts
• What is a data mart?
• Advantages and disadvantages of data marts
• Issues with the development and management of
data marts

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 4


Data Marts
• A subset of a data warehouse that supports the
requirements of a particular department or
business process
• Characteristics include:
– Does not always contain detailed data unlike data
warehouses
– More easily understood and navigated
– Can be dependent or independent

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 5


Reasons for Creating Data Marts

• Proof of Concept for the DW


• Can be developed quickly and less resource
intensive than DW
• To give users access to data they need to analyze
most often
• To improve query response time due to reduction
in the volume of data to be accessed

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 6


Kimball vs Inmon
• Bill Inmon's paradigm: Data warehouse is one part
of the overall business intelligence system. An
enterprise has one data warehouse, and data
marts source their information from the data
warehouse. In the data warehouse, information is
stored in 3rd normal form.
• Ralph Kimball's paradigm: Data warehouse is the
conglomerate of all data marts within the
enterprise. Information is always stored in the
dimensional model.

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 7


Kimball vs Inmon

• Bill Inmon: Endorses a Top-Down design


Independent data marts cannot comprise an effective EDW.
Organizations must focus on building EDW
• Ralph Kimball: Endorses a Bottom-Up design
EDW effectively grows up around many of the several
independent data marts – such as for sales, inventory, or
marketing

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 8


Kimball vs Inmon: War of Words
"...The data warehouse is nothing more than the
union of all the data marts...,"
Ralph Kimball, December 29, 1997.

"You can catch all the minnows in the ocean and


stack them together and they still do not make a
whale,"
Bill Inmon, January 8, 1998.

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 9


Data Warehouse or Data Mart First?
• Top-Down vs. Bottom-Up Approach
• Advantages of Top-Down
– A truly corporate effort, an enterprise view of data
– Inherently architected-not a union of disparate DMs
– Central rules and control
– May be developed fast using iterative approach

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 10


Data Warehouse or Data Mart First?
• Disadvantages of Top-Down
– Takes longer to build even with iterative method
– High exposure/risk to failure
– Needs high level of cross functional skills
– High outlay without proof of concept
– Difficult to sell this approach to senior management and
sponsors

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 11


Data Warehouse or Data Mart First?
• Advantages of Bottom-Up Approach
– Faster and easier implementation of manageable pieces
– Favorable ROI and proof of concept
– Less risk of failure
– Inherently incremental; can schedule important DMs first
– Allows project team to learn and grow

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 12


Data Warehouse or Data Mart First?
• Disadvantages of Bottom-Up Approach
– Each DM has its own narrow view of data
– Permeates redundant data in every DM
– Difficult to integrate if the overall requirements are not
considered in the beginning
• Kimball’s approach is considered as a Bottom-Up
approach, but he disagrees

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 13


The Bottom-Up Misnomer

Kimball encourages you to broaden your perspective


both vertically and horizontally while gathering
business requirements while developing data marts

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 14


The Bottom-Up Misnomer
• Vertical
– Don’t just rely on the business data analyst to determine
requirements
– Inputs from senior managers about their vision, objectives,
and challenges are critical
– Ignoring this vertical span might cause failure in
understanding the organization’s direction and likely future
trends

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 15


The Bottom-Up Misnomer
• Horizontal
– Look horizontally across the departments before designing
the DW
– Critical in establishing the enterprise view
– Challenging to do if one particular department if funding the
project
– Ignoring horizontal span will create isolated, department-
centric databases that are inconsistent and can’t be
integrated
– Complete coverage in a large organization is difficult
– One rep. from each dept. interacting with the core
development team can be of immense help

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 16


Data Warehouse or Data Mart First?
New Practical approach by Kimball
1. Plan and define requirements at the overall corporate level
2. Create a surrounding architecture for a complete
warehouse
3. Conform and standardize the data content
4. Implement the Data Warehouse as a series of Supermarts,
one at a time

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 17


A Word about SUPERMARTS
• Totally monolithic approach vs. totally stovepipe approach
• A step-by-step approach for building an EDW from granular
data
• A Supermart s a data mart that has been carefully built with a
disciplined architectural framework
• A Supermart is naturally a complete subset of the DW
• A Supermart is based on the most granular data that can
possible be collected and stored
• Conformed dimensions and standardized fact definitions

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 18


Pilot Projects: Risk vs. Reward
• Start with a pilot implementation as the first
rollout for DW
• Pilot projects have advantage of being small and
manageable
• Provide organization with a proof of concept

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 19


Pilot Projects: Risk vs. Reward
Functional scope of a pilot project should be
determined based on:
1. The Degree of risk enterprise is willing to take
2. The potential for leveraging the pilot project
 Avoid constructing a throwaway prototype
 Pilot warehouse must have actual value to the
enterprise

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 20


Pilot Projects: Risk vs. Reward

High Risk High Risk


Low Reward High reward
RISK

Low Risk Low Risk


Low Reward High Reward

20-Apr-18
REWARD
© Prof. Navneet Goyal, Dept. of Comp. Sc. 21
Kimball vs. Inmon
There is no right or wrong between these two
ideas, as they represent different data
warehousing philosophies. In reality, the data
warehouse in most enterprises are closer to Ralph
Kimball's idea. This is because most data
warehouses started out as a departmental effort,
and hence they originated as a data mart. Only
when more data marts are built later do they
evolve into a data warehouse.

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 22


Dependent Data Marts

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 23

Figure source unknown


Independent Data Marts

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 24

Figure source unknown


References
1. The Bottom-Up misnomer, Margy Ross and Ralph Kimball,
http://www.intelligententerprise.com/030917/615warehouse1_1.sh
tml,
September 2003.
2. Data Warehousing Fundamentals, Paulraj Pooniah, J Wiley,
2012.
3. The Data Warehouse Toolkit, 3e, Ralph Kimball, J Wiley, 2002.
4. Data Warehousing: Architecture and Implementation, Mark
Humphries et al., Pretince Hall PTR, 1999.
5. Building the Data Warehouse, 4e, WH Inmon,

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 25


ODS

• An operational data store (ODS) is a type of database


often used as an interim area for a data warehouse.
• ODS Is highly volatile
• An ODS is designed to quickly perform relatively simple
queries on small amounts of data (such as finding the
status of a customer order)
• An ODS is similar to your short term memory in that it
stores only very recent information; in comparison, the
data warehouse is more like long term memory in that it
stores relatively permanent information.
ODS

Figure taken from The Operational Data Store:


Designing the Operational Data Store, By Bill Inmon, DM Review, July 1998
ODS

Figure taken from The Operational Data Store:


By Bill Inmon, INFO DB, 1995
ODS

• In Figure 1 the ODS is seen to be an


architectural structure that is fed by
integration and transformation (i/t) programs.
These i/t programs can be the same programs
as the ones that feed the data warehouse or
they can be separate programs.
• The ODS, in turn, feeds data to the data
warehouse.
ODS

• According to Inmon, an ODS is a "subject-


oriented, integrated, volatile, current valued
data store, designed to serve operational
users as they do high performance integrated
processing.
• In the early 1990s, the original ODS systems
were developed as a reporting tool for
administrative purposes
ODS

• Subject-oriented
• Customer, product, account, vendor etc.
• Integrated
• Data is cleansed, standardized and placed into a consistent
data model
• Volatile
• UPDATEs occur regularly, whereas data warehouses are
refreshed via INSERTs to firmly preserve history
• Current valued
• Changes are made almost with zero latency
Classification of ODS

Table source unknown


ODS
• ODS is also referred to as Generation 1 DW
• Separate system that sits between source transactional
system & DW
• Hot extract used for answering narrow range of urgent
operational questions like:
– Was the order shipped?
– Was the payment made?
• ODS is particularly useful when:
– ETL process of the main DW delayed the availability of data
– Only aggregated data is available
ODS

• ODS plays a dual role:


– Serve as a source of data for DW
– Querying
• Supports lower-latency reporting through creation of a
distinct architectural construct & application separate
from DW
• Half operational & half DSS
• A place where data was integrated & fed to a
downstream DW
• Extension of the DW ETL layer
ODS

• ODS has been absorbed by the DW


– Modern DWs now routinely extract data on a daily
basis
– Real-time techniques allow the DW to always be
completely current
– DWs hav become far more operational than in the past
– Footprints of conventional DW & ODS now overlap so
completely that it is not fruitful to make a distinction
between the kinds of systems
ODS
• Classification of ODS based on:
– Urgency
• Class I - IV
– Position in overall architecture
• Internal or External
A Word About ODS
• Urgency
– Class I – Updates of data from operational systems to
ODS are synchronous
– Class II – Updates between operational environment &
ODS occurs between 2-3 hour frame
– Class III – synchronization of updates occurs overnight
A Word About ODS

• Urgency
– Class IV – Updates into the ODS from the DW are
unscheduled
• Data in the DW is analyzed, and periodically placed in the ODS
• For Example –Customer Profile Data
• Customer Name & ID
• Customer Volume – High/low
• Customer Profitability – High/low
• Customer Freq. of activity – very freq./very infreq.
• Customer likes & dislikes
ODS
ODS & Real-Time Data Warehousing
• Which class of ODS can be used for RTDWH?
• HOW?
• Let us first look at what we mean by RTDWH
• Wait till we talk about RTDWH
Q&A

20-Apr-18 41
Thank You

20-Apr-18 42
Basic Elements of a
Data Warehouse
Prof. Navneet Goyal
Department Of Computer Science
BITS, Pilani
20-Apr-18 1
Basic Elements of a DW
• Source Systems
• Data Staging Area
• Presentation Servers
• Data Mart/Super Marts
• Data Warehouse
• Operational Data Store
• OLAP
Kimball vs. Inmon

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 2


Data Warehousing
Architecture
Monitoring & Administration
OLAP servers

Metadata
Repository Analysis

Extract
Query/
External
Sources
Transform Reporting
Load Serve
Operational
Refresh Data
dbs Mining

20-Apr-18 © Prof. Navneet Goyal, Dept. of Comp. Sc. 3

Data Marts
Data Staging Area (DSA)
 A storage area where extracted data is
 Cleaned
 Transformed
 Deduplicated
 Initial storage for data
 Need not be based on Relational model
 Spread over a number of machines
 Mainly sorting and Sequential processing
 COBOL or C code running against flat files
 Does not provide data access to users
 Analogy – kitchen of a restaurant
Data Staging Area
 The Data Warehouse Staging Area is temporary
location where data from source systems is copied
 Due to:
 varying business cycles
 data processing cycles
 hardware and network resource limitations and
 geographical factors
it is not feasible to extract all the data from all
Operational databases at exactly the same time
 For example: Data from Singapore branch will arrive
much earlier than from the NY branch
Data Staging Area
 Simplifies the overall management of a Data
Warehousing system
 ETL tools work here!
 DSA is everything between the source systems and
the presentation server
 Raw food (read data) is transformed into a fine meal
(read data fit for user queries and consumption)
 DSA is accessible only to professional chefs (read
skilled professionals)
 Customers (read end users) are not invited to eat in
the kitchen (query in the DSA)
Data Staging Area
 Key architectural requirement for the DSA is that it is
Off-limits to business users and does not provide
query and presentation services
Data Staging Area
 Steps involved:
 Extraction
 Transformation
 Cleansing the data
 Combining data from multiple sources
 Deduplicating data
 Assigning Surrogate keys
 Load or Transfer
Data Staging Area
 DSA is dominated by sorting and sequential
processing
 DSA is not typically based on the relational model,
but rather a collection of flat files
 Many times the data arrives in the DSA in 3rd normal
form, which is acceptable
 But, is not recommended because the data has to be
loaded into the presentation server in the
dimensional model
 Normalized database in the staging area is
acceptable for supporting the staging process but
must be off-limits to user queries as they defeat
understandability and performance
Presentation Server
 A target physical machine on which DW data is
organized for
 Direct querying by end users using OLAP
 Report writers
 Data Visualization tools
 Data mining tools
 Data stored in Dimensional framework
 Presentation area is the DW for the end users
 Analogy – Sitting area of a restaurant
Presentation Server
 In Kimball’s approach, presentation area is a series of
integrated data marts (super marts)
 A data mart is a wedge of the overall presentation
area pie
 Data is presented, stored, and accessed in
dimensional schema
 Dimensional modeling is very different from 3NF
modeling (normalized models)
 Normalized modeling is quite useful in OLTP systems
 Not suitable for DW queries!!
Presentation Server
 Data Marts must contain detailed and atomic data
 May also contain aggregated data
 All Data Marts be built using common dimensions and
facts
 Conformed dimensions and facts
 Concept of SUPERMARTS!!
Presentation Server
 Data Warehouse Bus Architecture
 Building a DW in step is too daunting a task
 Architected, incremental approach to building a DW is
the DW Bus Architecture
 Define a standard bus for the DW environment
 Separate data marts, developed by different groups
at different times, can be plugged together and can
usefully coexist if they conform to the standard
Presentation Server
 According to Kimball –

“Data in the queryable presentation area of the DW


must be dimensional, must be atomic, and must
adhere to the data warehouse bus architecture”
Data Access Tools
 Data access tools query the data in the presentation
area of the DW
 OLAP (On-Line Analytical Processing)
OLAP

• What is OLAP
• Need for OLAP
• Features & functions of OLAP
• Different OLAP models
• OLAP implementations

4/20/2018 16 Prof. Navneet Goyal, BITS Pilani


OLAP
• Term coined in mid 1990s
• Main Goal: support ad-hoc but complex
querying by business analysts
• Extends worksheet like analysis to work with
huge amounts of data in a DW

4/20/2018 17 Prof. Navneet Goyal, BITS Pilani


Demand for OLAP
• Need for Multidimensional Analysis
• Fast Access & Powerful Calculations
• Limitations of other analysis methods like:
– SQL
– Spreadsheets
– Report Writers

4/20/2018 18 Prof. Navneet Goyal, BITS Pilani


OLAP is the Answer!
OLAP is a category of software technology
that enables analysts, managers, and
executives to gain insight into the data through
fast, consistent, interactive, access in a wide
variety of possible views of information that
has been transformed from raw data to reflect
the real dimensionality of the enterprise as
understood by the user.

4/20/2018 19 Prof. Navneet Goyal, BITS Pilani


What is OLAP?

OLAP software provides the ability to


analyze large volumes of information to
improve decision making at all levels of an
organization.

4/20/2018 20 Prof. Navneet Goyal, BITS Pilani


What is OLAP?

A wide spectrum of multidimensional analysis


involving intricate calculations and requiring
fast response times.

4/20/2018 21 Prof. Navneet Goyal, BITS Pilani


What is OLAP?

OLAP has two immediate consequences:


online part requires the answers of queries to
be fast, the analytical part is a hint that the
queries itself are complex

i.e., Complex questions with Fast Answers!

4/20/2018 22 Prof. Navneet Goyal, BITS Pilani


Why a separate OLAP tool?

o Empowers end users to do own analysis


o Frees up IS backlog of report requests
o Ease of use
o No knowledge of tables or SQL required

4/20/2018 23 Prof. Navneet Goyal, BITS Pilani


OLAP Characteristics
o Multi-user environment

o Client-server architecture

o Rapid response to queries,


regardless of DB size and complexity

4/20/2018 24 Prof. Navneet Goyal, BITS Pilani


Data Warehouse & OLAP
o OLAP is a software system that works on top of a
DW

o A front-end tool for a DW

o Information delivery system for the DW

o Compliments the information delivery capacities of


a DW

4/20/2018 25 Prof. Navneet Goyal, BITS Pilani


Why is OLAP useful?
 Facilitates multidimensional data
analysis by pre-computing aggregates
across many sets of dimensions
 Provides for:
 Greater speed and responsiveness
 Improved user interactivity

4/20/2018 26 Prof. Navneet Goyal, BITS Pilani


Warehouse Models
& Operators
• Data Models
– relations
– stars & snowflakes
– cubes
• Operators
– slice & dice
– roll-up, drill down
– pivoting
– other

4/20/2018 27 Prof. Navneet Goyal, BITS Pilani


The Complete Picture
• Source Systems
• Data Staging Area
• Presentation Servers
• Data Mart/Super Marts
• Data Warehouse
• Operational Data Store
• Data Access Tools

Need to understand the interconnect between the


above elements of a Data Warehouse System
4/20/2018 28 Prof. Navneet Goyal, BITS Pilani
Q&A

20-Apr-18 29
Thank You

20-Apr-18 30
EXTRACT, TRANSFORM , & LOAD

Prof. Navneet Goyal


BITS, Pilani
• Sources used for this lecture
– Ralph Kimball, Joe Caserta, The Data Warehouse ETL
Toolkit: Practical Techniques for Extracting, Cleaning,
Conforming and Delivering Data
Introduction
• Extract
– What data you want in the DW?
• Transform
– In what form you want the extracted data in the DW?
• Load
– Load the transformed extracted data onto the DW
Introduction

• Extract
– Extract relevant data
• Transform
– Transform data to DW format
– Build keys, etc.
– Cleansing of data
• Load
– Load data into DW
– Build aggregates, etc.
ETL System
• Back room or Green room of the DW
• Analogy - Kitchen of a restaurant
– A restaurant s kitchen is designed for efficiency, quality &
integrity
– Throughput is critical when the restaurant is packed
– Meals coming out should be consistent and hygienic
– Skilled chefs
– Patrons not allowed inside
• Dangerous place to be in – sharp knives and hot plates
• Trade secrets
ETL Design & Development
• Most challenging problem faced by the DW project
team
• 70% of the risk & effort in a DW project comes from
ETL
• Has 34 subsystems!!
• Not a one time effort!
– Initial load
– Subsequent loads (periodic refresh of the DW)
• Automation is critical!
Back Room Architecture
• ETL processing happens here
• Availability of right data from point A to point B with
appropriate transformations applied at the
appropriate time
• ETL tools are largely automated, but are still very
complex systems
General ETL Requirements
• Productivity support
– Basic development environment capabilities like code library
management, check in/check out, version control etc.
• Usability
– Must be as usable as possible
– GUI based
– System documentation: developers should easily capture
information about processes they are creating
– This metadata should be available to all
– Data compliance
• Metadata Driven
– Services that support ETL process must be metadata driven
General ETL Requirements
• Business needs – Users informntation requirement
• Compliance – must provide proof that the data reported is not
manipulated in any way
• Data Quality – garbage in garbage out!!
• Security – do not publish data widely to all decision makers
• Data Integration – Master Data Management System (MDM).
Conforming dimensions and facts
• Data Latency – huge effect on ETL architecture
– Use efficient data processing algorithms, parallelization, and powerful
hardware to speed up batch-oriented data flows
– If the requirement is for Real-time, then architecture must make a
switch from batch to microbatch or stream-oriented
• Archiving & Lineage – must for compliance & security reasons
– After ever major activity of the ETL pipeline, writing the data to disk
(staging) is recommended
– All staged data should be archived
Choice of Architecture 
Tool Based ETL

Simpler, Cheaper & Faster development


• People with business skills & not much technical
skills can use it.
• Automatically generate Metadata
• Automatically generates data Lineage & data
dependency analysis
• Offers in-line encryption & compression capabilities
• Manage complex load balancing across servers
Choice of Architecture 
Hand-Coded ETL

• Quality of tool by exhaustive unit testing


• Better metadata
• Requirement may be just file based processes not
database-stored procedures
• Use of existing legacy routines
• Use of in-house programmers
• Unlimited flexibility
• Cheaper
Middleware & Connectivity Tools

• Provide transparent access to source systems in


heterogeneous computing environments
• Expensive but often prove invaluable because they
provide transparent access to DBs of different types,
residing on different platforms
• Examples:
– IBM: Data Joiner
– Oracle: Transparent Gateway
– SAS: SAS/Connect
– Sybase: Enterprise Connect
Extract
• Extract
– What data you want in the DW?
• Remember that not all data generated by
operational systems is required in the DW
– For example: credit card no. is mandatory in operational
systems but is not required in a DW
• Typically implemented as a GUI
• Table names and their schema is displayed for
extraction
• Tables are selected and then their attributes are
selected
Extraction Tools

• Lot of tools available in the market


• Tool selection tedious
• Choice of tool depends on following factors:
– Source system platform and DB
– Built-in extraction or duplication functionality
– Batch windows of the operational systems
Extraction Methods

• Bulk Extractions
– Entire DW is refreshed periodically
– Heavily taxes the network connections between the source
& target DBs
– Easier to set up & maintain
• Change-based Extractions
– Only data that have been newly inserted or updated in the
source systems are extracted & loaded into the DW
– Places less stress on the network but requires more complex
programming to determine when a new DW record must be
inserted or when an existing DW record must be updated
Transformation Tools

• Transform extracted data into the appropriate format,


data structure, and values that are required by the DW
• Features provided:
– Field splitting & consolidation
– Standardization
• Abbreviations, date formats, data types, character formats, time
zones, currencies, metric systems, product keys, coding, etc.
– Deduplication
• A major non trivial task
Source System Type of DW
transformation

Address Field: Field Splitting No: 123


#123 ABC Street Street: ABC
XYZ City 1000 City: XYZ
Republic of MN Country: Republic
of MN
Postal Code: 1000
System A Field Consolidation Customer title:
Customer title: President President & CEO
System B
Customer title: CEO
Order Date:05 August 1998 Standardization Order Date:
Order Date: 08/08/98 05 August 1998
Order Date:
08 August 1998
System A Deduplication Customer Name:
Customer Name: John W. Smith John William Smith
System B
Customer Name: John William Smith
Mission of ETL team

To build the back room of the DW


– Deliver data most effectively to end user tools
– Add value to the data in the cleaning & conforming steps
– Protect & document the lineage of data (data provenance)
Mission of ETL team

The back room must support 4 key steps


– Extracting data from original sources
– Quality assuring & cleaning data
– Conforming the labels & measures in the data to achieve
consistency across the original sources
– Delivering the data in a physical format that can be used by
query tools and report writers
ETL Data Structures

Data Flow
Extract  Clean  Conform  Deliver
• Back room of a DW is often called the data staging
area
• Staging means writing to disk
• ETL team needs a number of different data structures
for all kinds of staging needs
To stage or not to stage

• Decision to store data in physical staging area versus


processing it in memory is ultimately the choice of the
ETL architect
To stage or not to stage
• A conflict between
– getting the data from the operational systems as fast as
possible
– having the ability to restart without repeating the
process from the beginning
• Reasons for staging
– Recoverability: stage the data as soon as it has been
extracted from the source systems and immediately
after major processing (cleaning, transformation, etc).
– Backup: can reload the data warehouse from the
staging tables without going to the sources
– Auditing: lineage between the source data and the
underlying transformations before the load to the data
warehouse
Designing the Back Room
• The back room is owned by the ETL team
– no indexes, no aggregations, no presentation access, no
querying, no service level agreements

• Users are not allowed in the back room for any


reason
– Back room is a construction site

• Reports cannot access data in the back room


– tables can be added, or dropped without notifying the
user community
– Controlled environment
Designing the Back Room (contd…)

• Only ETL processes can read/write in the back room


(ETL developers must capture table names, update
strategies, load frequency, ETL jobs, expected
growth and other details about the staging area)
• The back room consists of both RDBMS tables and
data files
Data Structures in the ETL System

• Flat files
– fast to write, append to, sort and filter (grep) but slow to
update, access or join
• XML Data Sets
– Used as a medium of data transfer between incompatible
data sources
• Relational Tables
Coming up next …

• 34 subsystems of ETL
ETL Subsystems (contd…)
Prof. Navneet Goyal
BITS, Pilani
• Sources used for this lecture
– Ralph Kimball, Joe Caserta, The Data Warehouse ETL
Toolkit: Practical Techniques for Extracting, Cleaning,
Conforming and Delivering Data
– fdfw
34 Subsystems of ETL
• Extracting (1-3)
• Cleaning & Conforming Data (4-8)
• Prepare for Presentation (9-21)
• Managing the ETL Environment (22-34)
Prepare for Presentation (Subsystems 9-21)
• Primary mission of the ETL system
• Delivery subsystems are the most critical subsystems in
the ETL architecture
• Despite variations in source data structures, & cleaning
& conforming logic, the delivery processing techniques
are quite more defined & disciplined
• Many subsystems focus on dimension table processing
– Dimension tables are at the core of any DW
– Provide context for fact tables
• Fact tables are huge and contain critical measurements
of the business, but preparing them for presentation is
striaghtforward
Prepare for Presentation (Subsystems 9-21)
9. Slowly Changing Dimension (SCD) Manager
– Implements SCD logic
– Handling of update of a dimension attribute value
– Type I, Type II, & Type III responses to updates
– Type IV – Add a mini-dimension
– Type V – Add a mini-dimension & a Type I outrigger
– Type VI – Add Type I attributes to Type II dimensions
– Type VII – Dual Type I & Type II dimensions
(to be covered in Module 5 on Advanced Dimensional Modeling)

10. Surrogate Key Generator*


– Responsible for generating surrogate keys for dimension tables
– Primary keys for dimension tables
– 4 byte Integer keys that coexist with natural keys in dimension tables
(*Surrogate Keys to be covered in Module 5 on Advanced Dimensional
Modeling)
Prepare for Presentation (Subsystems 9-21)
11. Hierarchy Manager
– Hierarchical information is embedded in dimension tables
– Fixed or ragged hierarchies
– Slightly ragged hierarchies like postal address is often modeled
as a fixed hierarchy
– Profoundly ragged hierarchies ( one found in organization
structure) are modeled using bridge tables
– Snowflakes or normalized structures used by ETL for
populating & maintaining hierarchical attributes
(to be covered in Module 5 on Advanced Dimensional Modeling)

12. Special Dimensions Manager


– Date/time dimensions: No source!!
– Junk dimensions
– Mini Dimensions
(to be covered in Module 5 on Advanced Dimensional Modeling)
Prepare for Presentation (Subsystems 9-21)
13. Fact Table Builders
– Three types of fact tables:
• Transaction, Periodic, Accumulating snapshot
• Maintaining referential integrity with associated dimension
tables
• Surrogate key pipeline subsystem is designed to support
this need

14. Surrogate Key Pipeline*


– Natural keys in incoming FT records must be replaced with
surrogate keys
– Support needed in ETL for this need
(*Surrogate Key Pipeline to be covered in Module 5 on Advanced
Dimensional Modeling)
Prepare for Presentation (Subsystems 9-21)
15. Multivalued Dimension* Bridge Table Builder
– Sometimes a FT must support a dimension that takes on
multiple values at lowest granularity of the FT
• Multiple diagnosis of a patient
• Multiple sales persons for daily sales fact table
– Bridge tables act as a link between multivalued dimension
and FT
16. Late Arriving Data Handler
– Late arriving facts
– Late arriving dimensions

(*Multivalued dimension to be covered in Module 5 on


Advanced Dimensional Modeling)
Prepare for Presentation (Subsystems 9-21)
17. Dimension Manager System
– Centralized authority who prepares & publishes
conformed dimensions
– Easier to manage conformed dimensions in a single
tablespace DBMS on a single machine as there is only one
copy of the dimension table

18. Fact Provider System


– Creation, maintenance, and use of FTs
– If FTs are used in any drill-across applications, then by
definition the fact provider must be using conformed
dimensions provided by the dimension manager
Prepare for Presentation (Subsystems 9-21)
19. Aggregate Builder
– Aggregates* are the single most way to improve the
performance in a large DW
– Aggregates (precomputed) are like indexes
– Aggregate builder is responsible for populating building and
maintaining aggregates
• Incremental maintenance
• Dropping & rebuilding
* To be covered in module 7 on Query performance enhancing
techniques
20. OLAP Cube Builder
– OLAP cubes** present dimensional data in an intuitive way,
enabling analytical users to slice and dice the data
– Relational dimensional schema is a foundation for OLAP cubes
** To be covered in module 6 on OLAP & Multidimensional
Databases
Prepare for Presentation (Subsystems 9-21)
21. Data Propagation Manager
– Responsible for the ETL processes required to transfer
conformed, integrated enterprise data from the DW
presentation server to other environments for special
purposes like Data Mining
Managing the ETL Environment (22-34)
• A DW system can have great dimensional model &
well deployed applications
• Will not be successful until it is relied upon as a
dependable source for decision making
• DW must build a reputation for providing timely,
consistent and reliable data to empower the
business
• To achieve this goal, ETL system must fulfill three
criteria:
– Reliability
– Availability
– Manageability
Managing the ETL Environment (22-34)
• For a first course on Data Warehousing, managing
the ETL environment is not required
• Will discuss the subsystems 22-34 as and when
need arises as we move forward
• Subsystems are being enumerated for the sake of
completeness
Managing the ETL Environment (22-34)
22. Job Scheduler
23. Backup Systems
24. Recovery & Restart System
25. Version Control System
26. Version Migration System
27. Workflow Monitor
28. Sorting System
29. Lineage & Dependency Analyzer
30. Problem Escalating System
31. Parallelizing/Pipelining System
32. Security System
33. Compliance Manage
34. Metadata Repository Manager
Conclusion
• Key building blocks of ETL system introduced
• Next is to use these building blocks to assemble the
ETL system
• Building an ETL system is unusually challenging
• 34 subsystems working in tandem
What’s left in ETL
• Designing & Developing the ETL Systems – Key
Steps!!
• Will be appropriate after we have covered modules
5, 6, 7 & 8
Surrogate Keys &
Changing Dimensions
Prof. Navneet Goyal
Computer Science Department
BITS, Pilani
Lecture Objectives
• Surrogate keys
– Advantages
– Generation
• Changing dimensions
• Why we need to handle them?
• Role of surrogate keys in handling changing
dimensions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 2


OLTP – Natural Keys
• Production Keys
• Intelligent Keys
• Smart Keys

NKs tell us something about the


record they represent
For eg. Student IDNO 2003B4A7290

4/20/2018 Dr. Navneet Goyal, BITS Pilani 3


DW - Surrogate Keys

• Integer keys
• Artificial Keys
• Non-intelligent Keys
• Meaningless Keys

SKs do not tell us anything about


the record they represent

4/20/2018 Dr. Navneet Goyal, BITS Pilani 4


Surrogate Keys - Advantages
• Saves Space
• Faster Joins
• Buffering DW from operational changes
• Allows proper handling of changing
dimensions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 5


Surrogate Keys - Advantages
Space Saving
• Surrogate Keys are integers
• 4 bytes of space
• Are 4 bytes enough?
• Nearly 4 billion values!!!
• For example
– Date data type occupies 8 bytes
– 10 million records in fact table
– Space saving=4x10million bytes =38.15 MB

4/20/2018 Dr. Navneet Goyal, BITS Pilani 6


Surrogate Keys - Advantages
Faster Joins
• Every join between dimension table and fact
table is based on SKs and not on NKs
• Which is faster?
– Comparing 2 strings
– Comparing 2 integers

4/20/2018 Dr. Navneet Goyal, BITS Pilani 7


Surrogate Keys - Advantages
Buffering DW from
operational changes
• Production keys are often reused
– For Eg. Inactive account numbers or obsolete
product codes are reassigned after a period of
dormancy
– Not a problem in operational system, but can cause
problems in a DW
• SKs allow the DW to differentiate between the
two instances of the same production key

4/20/2018 Dr. Navneet Goyal, BITS Pilani 8


Surrogate Keys - Advantages
Handling Changing
Dimensions
• What are changing dimensions?
• Why we need to handle them?
• How we can handle them?

4/20/2018 Dr. Navneet Goyal, BITS Pilani 9


Changing Dimensions
• Slowly Changing Dimensions
• Rapidly Changing Dimensions
• Small Dimensions
• Monster Dimensions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 10


Slowly Changing Dimensions (SCDs)
• Type I Change (overwrite)
• Type II Change (new record)
• Type III Change (new attribute)
• Hybrid Approach
– Predictable changes with multiple version overlays
– Unpredictable changes with single version overlays

4/20/2018 Dr. Navneet Goyal, BITS Pilani 11


Slowly Changing Dimensions
Type I Change

 Overwrite Old Value


 Used in cases where old values have no
significance
 Error correction
 Spelling error
 Example
244807
1234567 Navneet Goyal Pilani 7644807 CUST111

4/20/2018 Dr. Navneet Goyal, BITS Pilani 12


Slowly Changing Dimensions
Type I Change
 Fast & Easy to implement
 Attribute value always reflect the latest
assignment
 No history of prior attribute values
 In DW environment, can we afford to do
that?
 NO!!!

4/20/2018 Dr. Navneet Goyal, BITS Pilani 13


Slowly Changing Dimensions
Type I Change: PROBLEMS
 Example
SK Description Department NK

12345 Intellikidz1 Education ABC922Z

12345 Intellikidz1 Strategy ABC922Z

 History of attribute changes is lost


 If there is any increase in sale of the
product, the management would not
know the reason
 Aggregates over department have to be
rebuilt

4/20/2018 Dr. Navneet Goyal, BITS Pilani 14


Slowly Changing Dimensions
Type II Change

 Add a dimension row


 Keeps track of history
 New record approach
 Cannot be implemented without the
help of SKs!!

4/20/2018 Dr. Navneet Goyal, BITS Pilani 15


Slowly Changing Dimensions
Type II Change: Examples

Fact Table Dimension Table


1234567 ….
1234567 1234567 Navneet Goyal Single CUST11111
1234567
….

1234600 1234600 Navneet Goyal Married CUST11111

1234600 …..

SK Product Department NK
12345 Intellikidz1 Education ABC922Z

12467 Intellikidz1 Strategy ABC922Z


4/20/2018 Dr. Navneet Goyal, BITS Pilani 16
Slowly Changing Dimensions
Type II Change: Examples

SK Product Department NK
12345 Intellikidz1 Education ABC922Z

12467 Intellikidz1 Strategy ABC922Z

Adding the following columns help:


• Row Effective date
• Row Expiration date
• Current Row indicator
4/20/2018 Dr. Navneet Goyal, BITS Pilani 17
Slowly Changing Dimensions
Type II Change: Advantages
 Automatically partitions history in the fact
table
 Customer profile is easily differentiated
 Tracks as many dimension changes as
required
 No need to rebuild aggregates

4/20/2018 Dr. Navneet Goyal, BITS Pilani 18


Slowly Changing Dimensions
Type II Change: Disadvantages
 Dimension table can become big

 Does not allow association of the new


attribute value with old fact history & vice-
versa

 When we constraint on Dept=Strategy, we


will not see Intellikidz1 facts from before
the change date

4/20/2018 Dr. Navneet Goyal, BITS Pilani 19


Slowly Changing Dimensions
Type III Change:
 Add a dimension column

 Alternate Reality

 Both current & prior values can be regarded


as true at the same time

 New and historical fact data can be seen


either with the new or prior attribute values

 Not used very often


4/20/2018 Dr. Navneet Goyal, BITS Pilani 20
Slowly Changing Dimensions
Type III Change: Example

SK Product Department NK
12345 Intellikidz1 Education ABC922Z

SK Product Old_Dept New_Dept NK


12345 Intellikidz1 Education Strategy ABC922Z

4/20/2018 Dr. Navneet Goyal, BITS Pilani 21


Slowly Changing Dimensions
Type III Change: Problems
 Good for handling predictable changes

 Can lead to lot of wastage of space

 Myriad of unpredictable changes

 Cannot track the impact of numerous


intermediate attribute values

4/20/2018 Dr. Navneet Goyal, BITS Pilani 22


Generating Surrogate Keys
• Initial Historic Load
– Straight forward

• Subsequent Load
– Relatively complex

4/20/2018 Dr. Navneet Goyal, BITS Pilani 23


Generating Surrogate Keys
Initial Load
Assumption: Deduplication has been done in DSA
Production Key Surrogate Key
Prod 1 1
Prod 2 2
Prod 3 3
and so on…

4/20/2018 Dr. Navneet Goyal, BITS Pilani 24


Generating Surrogate Keys
Subsequent Load
– Data Warehouse Refresh
– New data has to be brought into the DW
– Old data is to be archived
– Rolling window
– Every incoming NK is to be compared with the existing NK in
dimension table
– If it does not exit – simply assign a new SK
– If it exists – do field by field comparison to see if any attributes
have changed
• If no change – simply ignore it
• If any change – assign new SK (Type 2 change)

4/20/2018 Dr. Navneet Goyal, BITS Pilani 25


Generating Surrogate Keys

Figure taken from


Kimball’s article

The original loading of a dimension. Surrogate keys are


4/20/2018 Dr. Navneet Goyal, BITS Pilani 26
just assigned sequentially to every input record. The
original production key becomes an ordinary attribute.
Generating Surrogate Keys
Subsequent Load: LOOK UP Tables
• Dimension tables may have hundreds of attributes
• Only a few records can be loaded in memory for
comparison
• LOOK UP tables contain a mapping between NK & SK
Production Key Surrogate Key
Prod1 1
Prod2 2
Prod3 3
Prod4 4
Prod5 5

4/20/2018 Dr. Navneet Goyal, BITS Pilani 27


Lookup Tables: Advantages
• Makes generation of SKs faster
– Can be indexed suitably for further speeding up the
process
• Refreshing of Dimension Tables speeds up
• Populating Fact Tables becomes faster
• Always points to the latest dimension record
Production Key Surrogate Key
Prod1 6
Prod2 2
Prod3 3
Prod4 4
Prod5 5
4/20/2018 Dr. Navneet Goyal, BITS Pilani 28
Order of Load

• Dimension tables
• Fact Tables
• Fact Table Loading:
– In the FT record, simply replace the natural key with the
surrogate key

4/20/2018 Dr. Navneet Goyal, BITS Pilani 29


Lookup Tables

Figure taken from


Kimball’s article

The lookup table for a typical dimension. There are as many rows as there
are unique production keys. The second column is the currently in-force
surrogate key used with each production key.

4/20/2018 Dr. Navneet Goyal, BITS Pilani 30


Figure taken from
Kimball’s article

Dimension processing logic for all refreshes of a


4/20/2018 Dr. Navneet Goyal, BITS Pilani 31
dimension table after the original load. .
Figure taken from
Kimball’s article

The pipelined, multithreaded fact table processing logic for


4/20/2018 replacing all production
Dr. Navneet keys (designated
Goyal, BITS Pilani here32as IDs)
with current surrogate keys.
Coming up Next…

• Type 4 – Add a Mini-dimension


• Type 5 – Mini-dimension and Type 1 outrigger
• Type 6 – Add Type 1 attributes to Type 2 dimensions
• Type 7 – Dual Type 1 & Type 2 dimensions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 33


Q&A
Thank You
Mini-dimensions &
Outriggers

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Mini-dimensions

 Outriggers

 Type 4 – Add a Mini-dimension

Hybrid Techniques:

 Type 5 – Mini-dimension and Type 1 outrigger

 Type 6 – Add Type 1 attributes to Type 2 dimensions

 Type 7 – Dual Type 1 & Type 2 dimensions


Monster Dimension
Customer dimensions can be very wide
- Dozens or hundreds of attributes
Customer dimensions can be very large
- Tens of millions of rows in some warehouses
- Sometimes includes prospects as well as
actual customers
Size can lead to performance challenges
- One case when performance concerns can
trump simplicity
- Can we reduce width of dimension table?
- Can we reduce number of rows that get added
by implementing Type 2?
Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 3
Rapidly Changing Monster Dimensions:
The Worst Scenario

 Multi-million customer dimension present


unique challenges that warrant special
treatment
Type-2 SCD not recommended
Business users often want to track the myriad
of changes to customer attributes
Insurance companies must update information
about their customers and their specific insured
automobiles & homes
Throws browsing performance and change-
tracking challenges
SOLUTION!!!
Mini-Dimensions
 Single technique to handle browsing-
performance & change tracking problems

 Separate out frequently analyzed or


frequently changing attributes into a
separate dimension, called mini-dimension
Mini-Dimensions
 Separate out a package of demographic
attributes into a demographic mini-
dimension
 Age, gender, marital status, no. of
children, income level, etc.
 One row in mini-dimension for each unique
combination of these attributes
Mini-Dimensions

Demographic AGE GENDER INCOME


Key LEVEL
1 20-24 M < 20000

2 20-24 M 20K-24999

3 20-24 M 25K-29999

18 25-29 M 20K-24999

10 25-29 M 25K-29999
Mini-Dimensions
 Mini-dimension can not be itself allowed
to grow very large
 5 demographic attributes
 Each attribute can take 10 distinct values
 How many rows in mini-dimension? 10,0000
Creating Mini-Dimensions

Include foreign keys to both customer dimension &


mini-dimension in fact table
Mini-Dimensions
Advantages
- History preserved without space blow-up
- FT captures historical record of attribute values
- Mini-dimension has small no. of rows
- # of unique combinations of MD attributes is small
- Consequence of discretization
- 5 attributes with 10 possible values has 100000 rows
- Limit no. of attributes in a single MD
- Improves erformance for queries that use MD
- Atleast those queries that don’t use main customer dim.

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 10


Mini-Dimensions
Disadvantages
- Fact table width increases
- Due to increase in no. of dimension foreign keys
- Information lost due to discretization
- Less detail is available
- Impractical to change bucket/band boundaries
- Additional tables introduced
- Users must remember which attributes are in mini-
dimension vs. main customer dimension
Snowflaking & Outriggers
Snowflaking is removal of low cardinality columns
from dimension tables to separate normalized
tables.
Snowflaking not recommended
User presentation becomes difficult
Negative impact on browsing performance
Query response time suffers
Prohibits the use of Bitmap Indexes
Some situations permit the use of dimension
outriggers
Outriggers have special characteristics that make
them permissible snowflakes

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 12
Outrigger Tables
Limited normalization of large dimension
table to save space
Identify attribute sets with these
properties:
- Highly correlated
- Different grain than the dimension
(# of customers)
- Change in unison

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 13
Outriggers
Example:
A set of data from an external data provider consisting of
150 demographic & socioeconomic attributes regarding the
customer’s district of residence
Data for all customers residing in a particular district is
identical
Instead of repeating this large block of data for all
customers, we model it as an outrigger

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 14
Outriggers
How To:
Follow these steps for each attribute set:
1. Create a separate “outrigger dimension” for
each attribute set
2. Remove the attributes from the customer
dimension
3. Replace with a foreign key to the outrigger
table
4. No foreign key from fact row to outrigger
- Outrigger attributes indirectly associated with facts via
customer dimension

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 15
Outriggers
Reasons:
Demographic data is available at a significantly different
grain than the primary dimension data (district vs. individual
customer)

The data is administered & loaded at different times than the


rest of the data in the customer dimension

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 16


Outriggers
Advantages:
Space savings
- Customer dimension table becomes narrower
- Outrigger table has relatively few rows
- One copy per district vs. one copy per customer
Disadvantages:
Additional tables introduced
- Accessing outrigger attributes requires an extra join
- Users must remember which attributes are in outrigger
vs. main customer dimension
- Creating a view can solve this problem

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 17


Outriggers
Dimension outriggers are permissible, but
they should be exceptions rather than the
rule
Avoid having too many outriggers in your
schema
If query tool insists on a classic star
schema, we can hide the outrigger under a
view declaration

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 18
Outriggers vs. Mini-Dimensions
Customer Customer
Fact Fact Dimension
Dimension Outrigger

Mini-dimension

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 19
Type 4: Add a Mini-dimension

Figure Taken from Kimball’s book – The DW toolkit, 3e

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 20
Type 5: Mini-dimension & Type 1 Outrigger
4+1=5!!
Type 4 mini-dimension with type 1 outrigger
Both mini-dimension & outrigger exist

Figure Taken from Kimball’s book – The DW toolkit, 3e

Prof. Navneet Goyal, BITS,


4/20/2018 Pilani 21
Type 5: Mini-dimension & Type 1 Outrigger
Why not implement it as Type 2 outrigger?
We would then be capturing volatile changes in
the monster customer dimension
A problem that we set out to solve using
Type 4!!

Figure Taken from Kimball’s book – The DW toolkit, 3e


Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 22
Type 5: Mini-dimension & Type 1 Outrigger
Current profile count can be found out in the
absence of FT metrics

Figure Taken from Kimball’s book – The DW toolkit, 3e


Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 23
Type 6: Add Type 1 attributes to Type 2 dimension
Combines Type 1, 2, & 3!!
1+2+3=6 and also 1*2*3=6!!

Figure Taken from Kimball’s book – The DW toolkit, 3e


Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 24
Type 7: Dual Type 1 & Type 2 dimensions
7 is just a number here!!
Is it feasible to support both the current &
historic perspectives for 150 attributes in a
large dimension table?
Enter Type 7 hybrid
Include the Natural Key as a FK in the FT along
with the Surrogate Key for Type 2 tracking
If NK is unwieldy or ever reassigned, use a
separate supernatural key instead

Figure Taken from Kimball’s book – The DW toolkit, 3e


Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 25
Type 7: Dual Type 1 & Type 2 dimensions

Figure Taken from Kimball’s book – The DW toolkit, 3e


Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 26
Summary: Type 1 to Type 7

Figure Taken from Kimball’s book – The DW toolkit, 3e


Prof. Navneet Goyal, BITS,
4/20/2018 Pilani 27
Q&A
Thank You
Time Dimension

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Time Dimension
 Time is a unique & powerful
dimension in every DM & EDW
 Time dimension is very special &
should be treated differently
from other dimensions
 Example of a Star Schema
• Fact table records daily orders received
by a manufacturing company
• Time dimension designates calendar
days
FAQs About Time Dimension

 Why can’t I just leave out the


time dimension?
• Dimension tables serve as the source of
constraints and as the source of report
row headers
• A data mart is only as good as its
dimension tables
• SQL provides some minimal assistance in
navigating dates
• SQL certainly doesn't know anything
about your corporate calendar, your
fiscal periods, or your seasons
FAQs About Time Dimension
 If I have to have a time
dimension, where do I get it?
• Build it in a spreadsheet
• Some data marts additionally track time
of day to the nearest minute or even
the nearest second. For these cases
separate the time of day measure out as
a separate "numeric fact." It should not
be combined into one key with the
calendar day dimension. This would
make for an impossibly large time table.
Time Dimension

 Guard against incompatible rollups


like weeks & months
 Separate fact tables denominated
in weeks and months should be
avoided at all costs
 Uniform time rollups should be
used in all the separate data marts
 Daily data rolls up to every
possible calendar
Time Dimension

 Be careful about aggregating


non-additive facts wrt time
 Examples: inventory levels &
account balances
 We need “average over time”
 SQL AVG ?
 Moving Avg is now supported
in some RDBMSs
Time Dimension

Figure Taken from Kimball’s article


Time Dimension

Source: http://www.yellowfinbi.com/
Q&A
Thank You
Conformed Dimensions
& Facts

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Types of queries in a DW
 Drill Across
 Outer Join
 Conformed Dimensions
 Conformed Facts
Types of Queries
 Outside-in queries
 Inside-out queries
 Standard OLAP queries are fact-focused
 Query touches one fact table and its associated
dimensions
 Some types of analysis are dimension-focused
 Bring together data from different fact tables that have
a dimension in common
 Common dimension used to coordinate facts
 Common dimension must be “conformed”
 Sometimes referred to as “drilling across”
Drill-Across Example
 Example scenario:
 Sales fact with dimensions (Date, Customer, Product, Store)
 CustomerSupport fact with dimensions (Date, Customer, Product,
ServiceRep)
 Question: How does frequency of support calls by California
customers affect their purchases of Product X?
 Step 1: Query CustomerSupport fact
 Group by Customer SSN
 Filter on State = California
 Compute COUNT
 Query result has schema (Customer SSN, SupportCallCount)
 Step 2: Query Sales fact
 Group by Customer SSN
 Filter on State = California, Product Name = Product X
 Compute SUM(TotalSalesAmt)
 Query result has schema (Customer SSN, TotalSalesAmt)
 Step 3: Combine query results
 Join Result 1 and Result 2 based on Customer SSN
 Group by SupportCallCount
 Compute COUNT, AVG(TotalSalesAmt)
A Problem with the Example
 What if some customers don’t make any support
calls?
 No rows for these customers in CustomerSupport fact
 No rows for these customers in result of Step 1
 No data for these customers in result of Step 3
 Solution: use outer join in Step 3
 Customers who are in Step 2 but not Step 1 will be
included in result of Step 3
 Attributes from Step 1 result table will be NULL for
these customers
 Convert these NULLs to an appropriate value before
presenting results
• Using SQL NVL() function
Outer Join
 Left join, or inner join, select rows common to
the participating tables to a join

 Selecting elements in a table regardless of


whether they are present in the second table
 OUTER JOIN is the solution

 In Oracle, we will place an "(+)" in the WHERE


clause on the other side of the table for which
we want to include all the rows.
Outer Join
Store_Information Geography
store_name Sales Date region_name store_name

Los Angeles $1500 Jan-05-1999 East Boston

San Diego $250 Jan-07-1999 East New York

West Los Angeles


Los Angeles $300 Jan-08-1999
West San Diego
Boston $700 Jan-08-1999

- We want to find out the sales amount for all of the stores
- If we do a regular join, we will not be able to get what we want because we will
have missed "New York," since it does not appear in the Store_Information table
SELECT A1.store_name, SUM(A2.Sales) SALES store_name SALES
FROM Geography A1, Store_Information A2 Boston $700
WHERE A1.store_name = A2.store_name (+) New York
GROUP BY A1.store_name Los Angeles $1800
San Diego $250
NVL Function
 In Oracle/PLSQL, the NVL function lets you substitutes a value when a null value is
encountered.
NVL (string1, replace_with )
string1 is the string to test for a null value. Replace_with is the value returned if
string1 is null.
Example #1:
select NVL (supplier_city, 'n/a')
from suppliers;
The SQL statement above would return 'n/a' if the supplier_city field contained a
null value. Otherwise, it would return the supplier_city value.
Example #2:
select supplier_id,
NVL (supplier_desc, supplier_name)
from suppliers;
This SQL statement would return the supplier_name field if the supplier_desc
contained a null value. Otherwise, it would return the supplier_desc.
Example #3:
select NVL (commission, 0)
from sales;
This SQL statement would return 0 if the commission field contained a null
value. Otherwise, it would return the commission field.
Conformed Dimensions
 Dimension tables conform when attributes in
separate dimension tables have the same
column names and domain contents
 Information form separate fact tables can be
combined in a single report by using conformed
dimension attributes that are associates with
each fact table
 Conformed dimensions are reused across fact
tables
 Refer to ETL subsystem 8: Conforming System
Conformed Dimensions
 Bottom-up data warehousing approach builds one data
mart at a time
 Drill-across between data marts requires common
dimension tables
 Common dimensions and attributes should be
standardized across data marts
 Create master copy of each common dimension table
 Three types of “conformed” dimensions:
 Dimension table identical to master copy
 Dimension table has subset of rows from the master copy
• Can improve performance when many dimension rows are not
relevant to a particular process
 Dimension table has subset of attributes from master copy
• Allows for roll-up dimensions at different grains (used in
Aggregation)
Conformed Dimension Example
 Monthly sales forecasts
 Predicted sales for each brand in each district in each month
 POS Sales fact recorded at finer-grained detail
• Product SKU vs. Brand
• Date vs. Month
• Store vs. District
 Use roll-up dimensions
 Brand dimension is rolled-up version of master Product
dimension
• One row per brand
• Only include attributes relevant at brand level or higher
 Month dimension is rolled-up Date
 District dimension is rolled-up Store
 Brand, Month, & District are conformed dimensions
Conformed Facts
 If the same measurement appears in separate fact
tables, care must be taken to make sure that
technical definitions of the facts are identical if they
are to be compared or computed together
 If separate fact definitions are consistent, the
conformed facts should be identically named,
otherwise they should be differently named
 Examples: Revenue, profit, standard prices & costs,
measures of quality and customer satisfaction and
other KPIs are facts that must conform
 95% of data architecture effort goes in designing
conformed dimensions and only 5% effort goes into
establishing conformed facts definitions
Drill-Across Example
 Question: How did actual sales diverge from forecasted sales in
Sept. ‘14?
 Drill-across between Forecast and Sales
 Step 1: Query Forecast fact
 Group by Brand Name, District Name
 Filter on MonthAndYear =‘Sept 04’
 Calculate SUM(ForecastAmt)
 Query result has schema (Brand Name, District Name, ForecastAmt)
 Step 2: Query Sales fact
 Group by Brand Name, District Name
 Filter on MonthAndYear =‘Sept 04’
 Calculate SUM(TotalSalesAmt)
 Query result has schema (Brand Name, District Name, TotalSalesAmt)
 Step 3: Combine query results
 Join Result 1 and Result 2 on Brand Name and District Name
 Result has schema (Brand Name, District Name, ForecastAmt,
TotalSalesAmt)
 Outer join unnecessary assuming:
• Forecast exists for every brand, district, and month
• Every brand has some sales in every district during every month
Multi-valued Dimensions

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Multi-valued Dimensions

 Bridge & helper Tables

 Example modeling situations


Multi-valued Dimensions
 Declaring grain of the fact table is one
of the important design decisions
 Grain declares the exact meaning of a
single fact record
 If the grain of the FT is clear, choosing
Dimensions becomes easy
Multi-valued Dimensions
 John & Mary Smith a single household
 John has a current account
 Mary has a savings account
 John & Mary have a joint current
account, & credit card
 An account can have one, two or more
customers associated with it
Multi-valued Dimensions
 Customer as an account dimension attribute?
 Doing so violates the granularity of the
dimension table as more than one customer
could be associated with an account
 Customer as an additional dimension in the
FT?
 Doing so violates the granularity of the FT
(one row per account per month)
 Classic example of a multi-valued dimension
 How to model multi-valued dimensions?
Bridge Tables
 Account to Customer BRIDGE table
Fact
Table
Account
Dimension Bridge Customer
Table Dimension
account_id
account_id account_id customer_id
Account-
customer_id
related Customer-
attributes related
weight
attributes
Bridge Tables

Figure taken from Kimball’s book – The Data Warehouse Toolkit, 3e


Multi-valued Dimensions:
Example
 In the classical Grocery Store Sales Data
Mart, the following dimensions are obvious for
the daily grain:
• Calendar Date
• Product
• Store
• Promotion (assuming the promotion remains in
effect the entire day)
 What about the Customer & Check-out clerk
dimensions?
 Many values at the daily grain!
Multi-valued Dimensions
 We disqualified the customer and check-out
clerk dimension
 Many dimensions can get disqualified if we
are dealing with an aggregated FT
 The more the FT is summarized, the fewer no.
of dimensions we can attach to the fact
records
 What about the converse?
 The more granular the data, the more
dimensions make sense!!
Multi-valued Dimensions
 Single-valued dimensions are welcome
 Any legitimate exceptions?
 If yes, how to handle them?
 Example: Healthcare Billing
 Grain is individual line item on a
doctor/hospital bill
 The individual line items guide us
through to the dimensions
Multi-valued Dimensions:
Healthcare Example
 Dimensions
• Calendar Date (of incurred charges)
• Patient
• Doctor (usually called ‘provider’)
• Location
• Service Performed
• Diagnosis
• Payer (insurance co., employer, self)
 In many healthcare situations, there may be
multiple values for diagnosis
 Really sick people having 10 different
diagnoses!!
 How to model the Diagnosis Dimension?
Multi-valued Dimensions:
Healthcare Example
1. Disqualify the Diagnosis Dimension because it is MV
• Easy way out but not recommended
2. Choose primary diagnosis & ignore others
• Primary or admitting diagnosis
• Modeling problem taken care of, but is the diagnosis
information useful in any way?
3. Extend the dimension list to have a fixed number of Diagnosis
dimensions Location (positional design)
• Create a fixed number of additional Diagnosis dimension slots
in the fact table key
• There will be some complicated example of a very sick patient
who exceeds the number of Diagnosis slots you have allocated
• Multiple separate Diagnosis dimensions can not be queried
easily
• If "headache" is a diagnosis, which Diagnosis dimension should
be constrained?
• Avoid the multiple dimensions style of design as logic across
dimension is notoriously slow on relational databases
Multi-valued Dimensions:
Healthcare Example
 Modeling Diagnoses Dimension:
4 ways:
1. Disqualify the Diagnosis Dimension
because it is MV
2. Choose primary diagnosis & ignore
others
3. Extend the dimension list to have a
fixed number of Diagnosis dimensions
Location
4. Put a helper table in between this fact
table and the Diagnosis dimension
table
Multi-valued Dimensions
Bridge Table Approach
• Helper table clearly violates the classic star
join design where all the dimension tables
have a simple one-to-many relationship to
the fact table
• But it is the only viable solution to
handling MV dimensions
• Positional design is not scalable
• We can preserve the star join illusion in
end-user interfaces by creating a view that
prejoins the fact table to the helper table
Coming up next…
 Role Playing Dimensions
 Dimension Hierarchies
 Factless Fact Tables
Dimension Hierarchies
Prof. Navneet Goyal
Department of Computer Science & Information Systems
BITS, Pilani
Most of the material for the presentation is based on
the book:
The Data Warehouse Toolkit, 3e by
Ralph Kimball
Margy Ross
Dimension Hierarchies
 Hierarchies are present in dimensions
 Hierarchies are of different types:
 Fixed depth positional hierarchies
 Slightly ragged/variable depth hierarchies
 Ragged/variable depth hierarchies with hierarchy bridge
tables
 Ragged/variable depth hierarchies with Pathstring
attributes

ETL Reference: Subsystem #11 – Hierarchy Manager


Fixed depth positional hierarchies
 A series of many-to-one relationships
 Examples:
 Product to brand to subcategory to category to dept.
 Day to week to year
 In fixed depth hierarchy, hierarchy levels have
predefined names
 Hierarchy levels appear as separate positional
attributes in a dimension table
 Easiest to understand and navigate
Slightly ragged/variable depth hierarchies
 When the hierarchy is not a series of many-to-one
relationships and the number of levels varies such
that they do not have agreed upon names –
ragged/variable depth hierarchies
 Example:
 Geographic hierarchies – levels vary from 3-6
 Forcefit slightly ragged hierarchies into a fixed
depth positional design
Ragged/variable depth hierarchies:
Bridge Tables

 Ragged hierarchies of indeterminate depth are


difficult to model and query in a relational setup
 SQL extensions and OLAP tools provide support
for recursive parent-child relationships but are still
not sufficient
 Ragged hierarchy is modeled using Bridge tables
 Contains a row for every possible path in the ragged
hierarchy and enables all kinds of hierarchical traversals
using SQL
Ragged/variable depth hierarchies:
Pathstring Attributes

 Use of bridge tables can be avoided by using a


pathstring attribute in the dimension
 For each row, the pathstring attribute consists of a
special encoded text string which is complete path
description from the topmost node of the
hierarchy down to the node described by the row
 No need to resorting to SQL extensions
 Vulnerable to structural changes which can lead to
relabeling the entire hierarchy
Fixed depth positional hierarchies
 Product Dimension
 Resist Normalization and keep the many-to-one
relationships flat!
 Store dimension – multiple hierarchies
 Geographic hierarchy
 Internal organization hierarchy
Slightly ragged variable depth hierarchies
 Geographic hierarchies
 For medium location, there is no concept of district
 For simple location, there is no concept of either district
or zone
 City name propagated down into both these attributes
Simple Loc Medium Loc Complex Loc
Loc Key (PK) Loc Key (PK) Loc Key (PK)
Address + Address + Address +
City City City
City City District
City Zone Zone
State State State
Country Country Country
… … …
Slightly ragged variable depth hierarchies
 Geographic hierarchies
 If we want to include all 3 types of locations in a single
hierarchy, we have a slightly ragged hierarchy
 Narrow range from 4 to 6 levels
 Not recommended for if the range is broader, say from
4-8 or higher Simple Loc Medium Loc Complex Loc
Loc Key (PK) Loc Key (PK) Loc Key (PK)
Address + Address + Address +
City City City
City City District
City Zone Zone
State State State
Country Country Country
… … …
Ragged variable depth hierarchies
 Organizational Structure is a perfect example
Consulting Invoices DM:
• Consulting services are sold at different
organizational levels
• Need for reports that show consulting
sold not only to ind. Departments, but
also to division, subsidiaries and overall
enterprise
• The report must still add up the
separate consulting revenues for each
organization structure

Figure source: www.ioutsource.com


Ragged Variable Depth Hierarchies

 Examples
• Parts composed of subparts

Part1

Subpart Subpart Subpart


2 3 4

Subpart Subpart Subpart


5 6 7
Ragged Variable Depth Hierarchies
• Enterprise consists of 13
organizations with rollup
structure shown in Fig. 7-8.
• Each organization has its own
budget, commitments and
payments
• For a single org. one can
request a specific budget for
an account using a simple join
to the fact table as shown in
Fig. 7-9.
• What if you want to rollup the
budget across portions of the
tree or even entire tree?
• Fig. 7-9 contains no
information about
organizational rollup!!

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


Ragged Variable Depth Hierarchies
• Standard way to represent a parent/child tree structure is to
use a recursive pointer in the organization dimension from
each row to its parent
• Employee hierarchy
• Employee dimension has boss attribute which is FK to
Employee (self referencing)
• The CEO has NULL value for boss
• CONNECT BY function of Oracle traverses these pointers in
a downward fashion starting at a parent and enumerating all
its children
• Oracle’s CTE common table expressions is also not found
suitable*
• Entangled with organization/employee dimension
• Depend upon the recursive pointer embedded in the data
• Recommended Solution – Bridge Table that is independent
of the primary dimension table

*Kimball’s Article: Building the Hierarchy Bridge Table


Ragged Variable Depth Hierarchies
Organization Dimension
Organization_key (PK)
Organization_name
.
.
.
OrganizationParent Key (FK)

Parent/child recursive design

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


Ragged Variable Depth Hierarchies:
Bridge Table Approach
• Bridge Table is independent of the primary
dimension table
• Contains all information about rollup
• Grain – each path in the tree from a parent to all the
children below the parent (see Fig. 7-11)

*Kimball’s Article: Building the Hierarchy Bridge Table


Ragged Variable Depth Hierarchies:
Bridge Table Approach
• First column is the PK of the
parent
• Second column is the PK of
the child
• A row is there for each
possible parent to each
possible child, including a
row that connects the paren
to itself
• Highest parent flag:
particular path comes from
the highest parent node in
the tree
• Lowest child flag: particular
path ends in a leaf node of
the tree

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e Back
Variable Depth Hierarchies
How many Records in the
Bridge Table?

One record for each separate


path from each node in the org.
tree to itself & to every node
below it
13 nodes (to themselves)
12 nodes below root
4+6 nodes below 1st level
2+4 nodes below 2nd level
2 nodes below 3rd level
TOTAL = 43

Figure source: www.ioutsource.com


Ragged Variable Depth Hierarchies:
Bridge Table Approach

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


Thank You
Role Playing Dimensions
Prof. Navneet Goyal
Department of Computer Science & Information Systems
BITS, Pilani
Topics
• Role Playing Dimensions
• Problems and Solutions
• SYNONYM
• Example Modeling Situations
Most of the material for the presentation is based on
the book:
The Data Warehouse Toolkit, 3e by
Ralph Kimball
Margy Ross
Role-Playing Dimension
• A single physical dimension can be referenced
multiple times in the same fact table
• Each reference linking to a logically distinct role for
the dimension
• For example, a fact table can have several dates,
each of which is represented by a FK to the date
dimension
• It is essential that each FK refers to a separate view
of the date dimension so that references are
independent
• Separate dimension views (with unique attribute
column names) are called ROLES
Example

Could not have joined them to the same date dimension- SQL
would interpret it as a two –way simultaneous join

Figure taken from http://sqlmag.com


Role-Playing Dimension
• Consider a fact table to record the status and final
disposition of a customer order
• Dimensions of this table could be Order Date,
Packaging Date, Shipping Date, Delivery Date,
Payment Date, Return Date, Refer to Collection
Date, Order Status, Customer, Product, Warehouse,
and Promotion
• Date dimension appears multiple times in the fact
table
Role-Playing Dimension
• Note that the first 7 dimensions are all time
• 7 FKs from the FT to the time dimension!!
• We can not join these 7 FKs to the same table
• SQL would interpret such a seven-way
simultaneous join as requiring that all of the
dates be the same
• Is this what we want?
Role-Playing Dimension
• We cannot literally use a single time table
• We want to administer (build & maintain) a
single physical date dimension
• Create an illusion of 7 independent date
dimensions by using view or aliases
• The column labels in each of these tables should
also be different!
• WHY?
• We will not be able to tell the columns apart if
several of them have been dragged into a report
Role-Playing Dimension
create view order_date
(order_date_key, order_day_of_week,
order_month, …..)
as select date_key, day_of_week, month, …
from date_dimension

create view package_date


(package_date, package_day_of_week,
package_month, …..)
as select date_key, day_of_week, month, …
from date_dimension
Role-Playing Dimension
• Now that we have 7 differently described
Time dimensions, they can be used as if they
were independent
• Can have completely unrelated constraints,
and they can play different roles in a report
Role-Playing Dimension
Another Example
• Frequent Flyer flight segment FT need to
include Flight Date, Segment Origin Airport,
Segment Destination Airport, Trip Origin
Airport, Trip Destination Airport, Flight, Fare
Class, and Customer.
• The 4 Airport dimensions are 4 different roles
played by a single underlying Airport table
A word about OLAP Tools
• Some OLAP products do not support multiple
roles of the same dimension
– You need to create multiple separate dimensions for
multiple roles
• OLAP tools that enable multiple roles, do not enable
attribute renaming for each role
• OLAP environments may consequently be littered
with a plethora of separate dimensions
Coming up next…

• Dimension Hierarchies
• Factless Fact Tables
Thank You
Factless Fact Tables
Prof. Navneet Goyal
Department of Computer Science & Information Systems
BITS, Pilani
Most of the material for the presentation is based on
the book:
The Data Warehouse Toolkit, 3e by
Ralph Kimball
Margy Ross
Factless Fact Tables
• Facts are typically numeric measures
• Events which record merely the coming together of
dimensional entities at a particular moment
– Student attending a class
– A particular product on promotion
• Can also be used to analyze what did not happen
– Factless coverage fact table about all possibilities
– Activity table about events that did happen
– Subtract activity from coverage
– Example: products that were on promotion but did not
sell
Factless Fact Tables
• Case studies that employ factless fact tables
– Retail sales
– Order management
– Education
Retail sales
• Retail sales schema can not answer an important
question – What products were on promotion but did
not sell?
• Sale FT records only those SKUs that actually got sold
• Not advisable to keep those SKUs in sales FT that did
not sell (it is already huge!!)
• Introduce promotion coverage fact table
– Same keys as ales fact table
– Grain is differernt
– FT row represents a product that was on promotion regardless
of whether the product sold
– Factless fact table
Retail sales
• What products were on promotion but did not sell?
• Two step process:
– Query the promotion coverage FFT to determine all the
products that were on promotion on a given day
– Find out all products that sold on a given day
– Difference of these two lists!!
– Try writing SQL query for this!
Order Management
• Customer/representative assignment
• Representatives are assigned to customers and it is
not necessary that every assignment would lead to a
sale
Sales Rep-Customer Assignment
Fact
Date Dimension (views for 2 roles) Assignment Effective Data Key(FK)
Assignment Expiration Data Key(FK)
Sales Rep Dimension
Sales Rep Key (FK)
Customer Key (FK)
Customer Dimension
Customer Assignment Counter = 1

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


Order Management
• Sales rep coverage factless fact table
• Allows us to answer queries like which assignments
never resulted in sales
Sales Rep-Customer Assignment
Fact
Date Dimension (views for 2 roles) Assignment Effective Data Key(FK)
Assignment Expiration Data Key(FK)
Sales Rep Dimension
Sales Rep Key (FK)
Customer Key (FK)
Customer Dimension
Customer Assignment Counter = 1

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


Education
• Student Registration

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 2e


Education
• Student Attendance
• What about events that did not happen?
– Attendance count = 0 or 1
– Ceases to be factless fact table
– Reasonable approach in this case
Student Attendance Fact
Day_Hour Dimension
Day Hour Key(FK)
Student Dimension
Student Key(FK)

Course Dimension Course Key (FK)


Faculty Key (FK)
Faculty Dimension
Facility Key (FK)
Facility Dimension
Attendance count = 1
Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e
Factless Fact Tables: Summary
• Records events which do not have associated facts
• Dummy fact = 1 to increase readability of SQL
queries
select faculty, SUM registration count ….
Group By Faculty
• Used in retails sales, order management, education
etc.
• In some situations, events that did not happen can
also be recorded, but then the fact table ceases to
be factless
Thank You
Case Study:
Academic Data Warehouse

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Academic Data Warehouse
BITS Education Model:
• Student Registration
• Attendance
• Performance

4/20/2018 Dr. Navneet Goyal, BITS Pilani 2


Requirements

• Analyze student registration


• Analyze student attendance
• Relate attendance to performance
• Analyze facility utilization

4/20/2018 Dr. Navneet Goyal, BITS Pilani 3


Scenario
• Many Disciplines
• Many Departments
• First degree & Higher degree
• Multiple campuses
• 4000 on campus students (Pilani)
• 350 courses offered each semester
• Each student doing 6 courses/sem
• 40 lectures per course
• 5 years data

4/20/2018 Dr. Navneet Goyal, BITS Pilani 4


Analysis Requirements
• Top/Bottom 5 electives
• Correlation between attendance and performance
• Variation in MGPA in courses
• Variation in CGPA of students discipline wise/campus wise
• Average CGPA/MGPA over the last few semesters at different
campuse
• Attendance/performance in CDCs of first & second disciplines
• Attendance in forenoon/afternoon sessions
• Attendance for UG/PG students
• Most popular discipline as choice for dual at different
campuses
• Performance of dualites vs. single degree students

4/20/2018 Dr. Navneet Goyal, BITS Pilani 5


Student Registration Event
• Grain of the FT would be one row for each
registered course by student & semester
• Semester is the lowest level available for the
registration events
• Semester dimension should conform to the
calendar date dimension
• Student dimension should have demographic data
+ on campus information like part-time, full-time,
involvement in athletics, major, UG/PG,

4/20/2018 Dr. Navneet Goyal, BITS Pilani 6


Student Registration Event Star Schema
Dimensions
• Student
• Course
• Term/semester
• Instructor
• Campus
Factless fact table!!!

4/20/2018 Dr. Navneet Goyal, BITS Pilani 7


Student Registration Event Star Schema

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


4/20/2018 Dr. Navneet Goyal, BITS Pilani 8
Student Performance Star Schema
Dimensions
• Student
• Course
• Term/semester
• Instructor
• Campus
Same dimensions and granularity as Registration

Grade as fact or dimension??

4/20/2018 Dr. Navneet Goyal, BITS Pilani 9


Student Performance Star Schema

4/20/2018 Dr. Navneet Goyal, BITS Pilani 10


Student Attendance Star Schema
Dimensions
• Student
• Course
• Day_hour
• Instructor
• Facility
• Campus

4/20/2018 Dr. Navneet Goyal, BITS Pilani 11


Student Attendance Star Schema

Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e


4/20/2018 Dr. Navneet Goyal, BITS Pilani 12
Dimensional Modeling Concepts
• Factless fact tables
• Minidimensions
• Events that did not occur
• Conformed dimensions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 13


Thank You
Case Study: Banks

Prof. Navneet Goyal


Computer Science Department
BITS, Pilani
Banks

Role of DW in Financial Service Industry


We will concentrate on Retail Banks
Most of us understand the basic functioning of a bank

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 2


Banks: Services Offered
Checking accounts
Saving accounts
Mortgage loans
Investment loans
Personal loans
Credit card
Safe deposit boxes

“heterogeneous” set of “products” “sold" by the


bank.
4/20/2018 Prof. Navneet Goyal, BITS, Pilani 3
Banks: Households
Each account belongs to a household
Major goal of the bank is to market more effectively to
households that already have one or more accounts with the
bank
Household DW
Track all the accounts owned by the bank
See all the individual holders
See the residential and commercial household groupings
to which they belong

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 4


Banks: Requirements
Users want to see five years of historical monthly
snapshot on every account.
Every type of account has a primary balance. There is a
significant need to group different kinds of accounts in
the same analyses and compare primary balances
Every type pf account (known as product within the bank)
has a set of custom dimension attributes and numeric
facts that tend to be different from product to product

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 5


Banks: Requirements
Every account is deemed to belong to a household.
Upon studying the historical production data, we
conclude that accounts come and go from household as
much as several times per year for each household due
to changes in marital status and other life-stage factors
In addition to the household identification, we are very
interested in demographic information as it pertains to
both the individuals and the households.

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 6


Crucial Observations
A large bank can have as much as 10 million accounts
and 3 million households. The accounts can be more
volatile than household
There are different types of products that this bank
provides as discussed earlier. In addition to that it can
also provide many customized products for a specific
customer
We also keep track of the status of the account, which
can be alive or dead and would like to store information
related to reason behind closing of any account.
Needless to say there are enormous new accounts
created in a day to be stored
4/20/2018 Prof. Navneet Goyal, BITS, Pilani 7
Issues in Designing DW
Heterogeneous Products. How to model?
Grain. Finest grain data?
Highly volatile demographic profile of customers
Type 2 change?

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 8


Dimensional Modeling Features
Core/supertype and Custom/subtype Fact
Tables
Rapidly Changing Monster Dimensions
Outriggers
Mini-dimensions
Multi-valued dimensions & Bridge Tables

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 9


Monster Dimension
Customer dimensions can be very wide
- Dozens or hundreds of attributes
Customer dimensions can be very large
- Tens of millions of rows in some warehouses
- Sometimes includes prospects as well as actual
customers
Size can lead to performance challenges
- One case when performance concerns can trump
simplicity
- Can we reduce width of dimension table?
- Can we reduce number of rows caused by
preserving history for slowly changing
dimension?
4/20/2018 Prof. Navneet Goyal, BITS, Pilani 10
Snowflaking & Outriggers
Snowflaking is removal of low cardinality columns
from dimension tables to separate normalized
tables.
Snowflaking not recommended
User presentation becomes difficult
Negative impact on browsing performance
Query response time suffers
Prohibits the use of Bitmap Indexes
Some situations permit the use of dimension
outriggers
Outriggers have special characteristics that make
them permissible snowflakes
4/20/2018 Prof. Navneet Goyal, BITS, Pilani 11
Outrigger Tables
Limited normalization of large dimension table to
save space
Identify attribute sets with these properties:
- Highly correlated
- Different grain than the dimension (# of customers)
- Change in unison

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 12


Dimension Triage
Most dimensional models have between 5 and
15 dimensions
Core fact table containing only the primary
balance of every account at the end of each
month
Only two dimensions – Month & Account

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 13


Dimension Triage

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 14


Too Few Dimensions
Account dimension is a HUGE entry table to the
fact table, thereby slowing queries
For a large bank, # of customers could touch 10
m and using type 2 SCD could render it
unworkable
Products and branches could be thought of as
two separate dimensions as there is a M:N
relationship between the two

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 15


Dimensions
Account
Time (month in this case)
Branch
Household
Product
Status

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 16


Status Dimension
Records the status of the account at the
end of each month
Status could be active or inactive
Status change, such as new account
opening or closure occurring during the
month, is also recorded
Reasons for status change are also
stored

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 17


Household Dimension
Household as a separate dimension-
designer’s prerogative
Account & household dimension closely
related
Still it is a good idea to treat HH as a
separate dimension
- Size of the account dimension (~10m)
- Smaller entry point to the fact table (~3m HHs)
- Account can change HH many times

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 18


Mini-Dimensions Revisited

4/20/2018 Prof. Navneet Goyal, BITS, Pilani 20


Thank You
Data Warehousing
M6: OLAP & Multidimensional Databases (MDB)
BITS Pilani T V Rao, BITS, Pilani (off-campus)
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Cube Complexity &Optimization


Data Cube: A Lattice of Cuboids
all
0-D(apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


2-D cuboids
time,item time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,locationtime,item,supplier item,location,supplier

4-D(base) cuboid

time, item, location, supplierc


Data Warehousing
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Data
• Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
Data Warehousing
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cube Materialization:
Full Cube vs. Iceberg Cube
 Computing only the cuboid cells whose measure satisfies the
iceberg condition i.e. greater than minimum support value
 O ly a s all portio of ells ay e a o e the ater’’ i a
sparse cube

 Avoid explosive growth: A cube with 100 dimensions


 2 base cells: (a1, a2, …., a100), (b1, b2, …, 100)

 Ho a y aggregate ells if ha i g ou t >= ?


 What a out ha i g ou t >= ?

Data Warehousing
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cube Materialization:
Full Cube vs. Iceberg Cube

• Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a
data cube. However, we could still end up with a large number of uninteresting cells to
compute.
• For example, suppose that there are 2 base cells for a database of 100 dimensions, denoted as
{(a1, a2, a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10}, where each has a cell count of 10. If the minimum
support is set to 10, there will still be an impermissible number of cells to compute and store,
although most of them are not interesting.
• For example, there are about 2101 distinct aggregate cells, like {(a1, a2, a3, a4, …, a99, ∗ : , …, a1,
a2, ∗, a4, …, a99, a100 : , …, a1, a2, a3, ∗, …, ∗, ∗) : 10}, but most of them do not contain much new
information.
• If we ignore all the aggregate cells that can be obtained by replacing some constants by
∗'s while keeping the same measure value, there are only three distinct cells left: {(a1, a2,
a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10, (a1, a2, ∗, …, ∗) : 20}. That is, out of about 2101
distinct base and aggregate cells, only three really offer valuable information

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Strategies for Data Cube Computation

• Sorting, hashing, and grouping. Sorting, hashing, and grouping


operations should be applied to the dimension attributes to reorder
and cluster related tuples.
• In cube computation, aggregation is performed on the tuples (or cells) that
share the same set of dimension values. Thus, it is important to explore
sorting, hashing, and grouping operations to access and group such data
together to facilitate computation of such aggregates.
• To compute total sales by branch, day, and item, for example, it can be more
efficient to sort tuples or cells by branch, and then by day, and then group
them according to the item name.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Strategies for Data Cube Computation

• Simultaneous aggregation and caching of intermediate results. In


cube computation, it is efficient to compute higher-level aggregates
from previously computed lower-level aggregates, rather than from
the base fact table. Moreover, simultaneous aggregation from cached
intermediate computation results may lead to the reduction of
expensive disk input/output (I/O) operations.
• To compute sales by branch, for example, we can use the intermediate results
derived from the computation of a lower-level cuboid such as sales by branch
and day. This technique can be further extended to perform amortized scans
(i.e., computing as many cuboids as possible at the same time to amortize
disk reads).

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Strategies for Data Cube Computation
• Aggregation from the smallest child when there exist multiple child cuboids.
When there exist multiple child cuboids, it is usually more efficient to compute
the desired parent (i.e., more generalized) cuboid from the smallest, previously
computed child cuboid.
• To compute a sales cuboid, Cbranch, when there exist two previously computed cuboids,
C{branch, year} and C{branch, item}, for example, it is obviously more efficient to compute Cbranch from
the former than from the latter if there are many more distinct items than distinct years.
• The Apriori pruning method can be explored to compute iceberg cubes
efficiently. The Apriori property,[3] in the context of data cubes, states as follows:
If a given cell does not satisfy minimum support, then no descendant of the cell
(i.e., more specialized cell) will satisfy minimum support either. This property can
be used to substantially reduce the computation of iceberg cubes.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Ponniah P, Data Warehousing Fu da e tals , Wiley Student Edition, 2012

T2 Kimball R, The Data Warehouse Toolkit , 3e, John Wiley, 2013

References

Author(s), Title, Edition, Publishing House


Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline Kamber and
Jian Pei Morgan Kaufmann Publishers

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Warehousing
BITS Pilani M6: OLAP & Multidimensional Databases (MDDB)
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

6.1 Introduction to OLAP


Characteristics of Strategic Information
Integrated Must have a single, enterprise-wide view
Data Integrity Information must be accurate and must conform to
business rules
Accessible Easily accessible with intuitive access paths, and
responsive for analysis
Credible Every business factor must have one and only one
value
Timely Information must be available within the stipulated
time frame

Data Warehousing
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is OLAP?
OLAP is a category of software technology that
enables analysts, managers, and executives to gain
insight into data through fast, consistent,
interactive access to a wide variety of possible
views of information that has been transformed
from raw data to reflect the real dimensionality of
the enterprise as understood by the user.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Codd’s Rules for OLAP
1. Multidimensional Conceptual View. Provide a multidimensional data model that is intuitively analytical and easy to use.
Business users' view of an enterprise is multidimensional in nature. Therefore, a multidimensional data model conforms to
how the users perceive business problems.
2. Transparency Make the technology, underlying data repository, computing architecture, and the diverse nature of source
data totally transparent to users. Such transparency, supporting a true open system approach, helps to enhance the
efficiency and productivity of the users through front-end tools that are familiar to them.
3. Accessibility Provide access only to the data that is actually needed to perform the specific analysis, presenting a single,
coherent, and consistent view to the users. The OLAP system must map its own logical schema to the heterogeneous physical
data stores and perform any necessary transformations.
4. Consistent Reporting Performance Ensure that the users do not experience any significant degradation in reporting
performance as the number of dimensions or the size of the database increases. Users must perceive consistent run time,
response time, or machine utilization every time a given query is run.
5. Client/Server Architecture Conform the system to the principles of client/server architecture for optimum performance,
flexibility, adaptability, and interoperability. Make the server component sufficiently intelligent to enable various clients to be
attached with a minimum of effort and integration programming.
6. Generic Dimensionality Ensure that every data dimension is equivalent in both structure and operational capabilities. Have
one logical structure for all dimensions. The basic data structure or the access techniques must not be biased toward any
single data dimension.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Codd’s Rules for OLAP
7. Dynamic Sparse Matrix Handling Adapt the physical schema to the specific analytical model being created and loaded that
optimizes sparse matrix handling. When encountering a sparse matrix, the system must be able to dynamically deduce the
distribution of the data and adjust the storage and access to achieve and maintain consistent level of performance.
8. Multiuser Support Provide support for end users to work concurrently with either the same analytical model or to create
different models from the same data. In short, provide concurrent data access, data integrity, and access security.
9. Unrestricted Cross-dimensional Operations Provide ability for the system to recognize dimensional hierarchies and automatically
perform roll-up and drill-down operations within a dimension or across dimensions. Have the interface language allow
calculations and data manipulations across any number of data dimensions, without restricting any relations between data cells,
regardless of the number of common data attributes each cell contains.
10. Intuitive Data Manipulation Enable consolidation path reorientation (pivoting), drill-down and roll-up, and other manipulations
to be accomplished intuitively and directly via point-and-click and drag-and-drop actions on the cells of the analytical model.
Avoid the use of a menu or multiple trips to a user interface.
11. Flexible Reporting Provide capabilities to the business user to arrange columns, rows, and cells in a manner that facilitates easy
manipulation, analysis, and synthesis of information. Every dimension, including any subsets, must be able to be displayed with
equal ease.
12. Unlimited Dimensions and Aggregation Levels Accommodate at least fifteen, preferably twenty, data dimensions within a
common analytical model. Each of these generic dimensions must allow a practically unlimited number of user-defined
aggregation levels within any given consolidation path.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
An Analysis Session
Sharp enterprise- Countrywide monthly
Sales OK, wide profitability sales for last 3 months?
profitability dip
down last 3
Monthly sales by
months
worldwide regions?
Sharp reduction
in European European sales by
region countries ?

Increase in a few European sales by


countries, flat in countries, products?
others, sharp
decline in some
Additional tax on
Direct and indirect
Sharp decline in some products in costs for EU countries?
EU countries, last Direct costs OK, EU
2 months indirect costs up
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Typical Calculations
• Roll-ups to provide summaries and aggregations along the hierarchies of
the dimensions.
• Drill-downs from the top level to the lowest along the hierarchies of the
dimensions,
• in combinations among the dimensions.
• Simple calculations, such as computation of margins (sales minus costs).
• Share calculations to compute the percentage of parts to the whole.
• Algebraic equations involving key performance indicators.
• Moving averages and growth percentages.
• Trend analysis using statistical methods.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OLAP operations
LINE 1998 1999 2000 TOTAL
Clothing $3,457,000 $3,590,050 $5,789,400 $12,836,450
Electronics $5,894,800 $4,078,900 $6,094,600 $16,068,300
Video $7,198,700 $6,057,890 $8,005,600 $21,262,190
Drill down
Kitchen $4,875,400 $5,894,500 $6,934,500 $17,704,400
LINE TOTAL SALES
Appliances $5,947,300 $6,104,500 $7,549,000 $19,600,800
Clothing $12,836,450 Total $27,373,200 $25,725,840 $34,373,100 $87,472,140
Electronics $16,068,300
Video $21,262,190
Rotate / Pivot
Kitchen $17,704,400
Appliances $19,600,800
Total $87,472,140 YEAR Clothing Electronics Video Kitchen Appliances TOTAL
1998 $3,457,000 $5,894,800 $7,198,700 $4,875,400 $5,947,300 $27,373,200
1999 $3,590,050 $4,078,900 $6,057,890 $5,894,500 $6,104,500 $25,725,840
2000 $5,789,400 $6,094,600 $8,005,600 $6,934,500 $7,549,000 $34,373,100
Total $12,836,450 $16,068,300 $21,262,190 $17,704,400 $19,600,800 $87,472,140

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Limitations of Other Tools
• Users need ability to analyse the data along multiple dimensions and
their hierarchies rapidly
• Spreadsheets can be cumbersome to use, particularly for large
volumes of data.
• Multidimensional data entered in spreadsheet has lot of redundancy
• It will require enormous effort to do create multidimensional view

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Limitations of Other Tools
• SQL was originally meant to be end-user query language
• Except for very simple operations, the syntax is not easy to
conceptualize for end-users
• The vocabulary is not suitable for analysis, comparisons are a
challenge
• SQL is not good with complex calculations and time-series data.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Limitations of Other Tools
• A real-world analysis session requires many queries following one
after the other.
• Each query may translate into a number of statements invoking full
table scans, multiple joins, aggregations, groupings, and sorting.
• The overhead on the systems would be enormous and seriously
impact the response times

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Features of OLAP
Fast response times for
Multidimensional analysis Consistent performance
interactive queries
Navigation in and out of
Drill-down and roll-up Slice-and-dice or rotation
details
Time intelligence (yearto-date,
Multiple view modes Easy scalability
fiscal period)
Basic Features
Advanced Features

Cross-dimensional Pre-calculation or pre-


Powerful calculations
calculations consolidation
Drill-through across Sophisticated presentation Collaborative decision
dimensions or details & displays making
Derived data values through Application of alert Report generation with
formulas technology agent technology

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CUBE Operator in SQL
• A cube aggregates the facts in each level of each dimension in a given
OLAP schema
• Data cubes are not "cubes" in the strictly mathematical sense because they do not have
equal sides.
• Most likely, there are more than 3 dimensions
• Major SQL vendors provide cube operator in their products
• Typical sequence for Cube computation:
 Identify physical sources of data
 Specify logical views built upon physical source
 Build cube for specified measures and dimensions

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Ponniah P, Data Warehousing Fu da e tals , Wiley Student Edition, 2012

T2 Kimball R, The Data Warehouse Toolkit , 3e, John Wiley, 2013

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Warehousing
BITS Pilani M6: OLAP & Multidimensional Databases (MDDB)
Pilani|Dubai|Goa|Hyderabad
T V Rao, BITS, Pilani (off-campus)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Multidimensional Databases
Why Multidimensional Database
• In 1960s, when a research scholar at MIT was doing analytical work
on product sales, he realized that
‒ he spent most of his time wrestling with reformatting the data for his analysis,
‒ not on the statistical algorithms or the true data analysis
• Once he had modeled the data in a multidimensional form, he was
able to report the data in many different formats
• By abstracting the data model from the data itself, the user could
work with the data in an ad hoc fashion, asking questions that had
not been formulated when developing the specifications

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Database

• A multidimensional database, or MDB, stores dimensional


information in a format called a cube.
• The basic concept of a cube is to precompute the various
combinations of dimension values and fact values so they can be
studied interactively
• Data in an MDB is accessed through an interface, which is often
proprietary, although MDX (MultiDimensionalEXpressions) has gained
wide acceptance as a standard
• MDBs support many statistical functions

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Motivating Example
An automobile manufacturer wants to increase sale volumes by examining sales
data collected throughout the organization. The evaluation would require viewing
historical sales volume figures from multiple dimensions such as
• Sales volume by model
• Sales volume by color
• Sales volume by dealer
• Sales volume over time

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sales Data in Relational Form

MODEL COLOR SALES VOLUME


Hatchback BLUE 6
Hatchback RED 5
Hatchback WHITE 4
SUV BLUE 3
SUV RED 5
SUV WHITE 5
SEDAN BLUE 4
SEDAN RED 3
SEDAN WHITE 2

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Structure

Measurement
Dimension

M Hatchback 6 5 4
O
D SUV 3 5 5
E
Sedan
L 4 3 2

Blue Red White


Positions
COLOR Dimension
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comments on Multidimensional Structure

• The relational structure tells us nothing about the possible contents of


those fields.
• The structure of the array, on the other hand, tells us not only that there
are two dimensions, COLOR and MODEL, but it also presents all possible
values of each dimension as positions along the dimension.
• Because of this structured presentation, all possible combinations of
perspectives containing a specific attribute (the color BLUE, for example)
line up along the dimension position for that attribute.
• This makes data browsing and manipulation highly intuitive to the end-
user. As a result, this "intelligent" array structure lends itself very well to
data analysis.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Let us add Dealer to the table
MODEL COLOR DEALERSHIP VOLUME
Hatchback BLUE Mitra 6
Hatchback BLUE Patel 6
Hatchback BLUE Singh 2
Hatchback RED Mitra 3
Hatchback RED Patel 5
Hatchback RED Singh 5
Hatchback WHITE Mitra 2
Hatchback WHITE Patel 4
Hatchback WHITE Singh 3
SUV BLUE Mitra 2
SUV BLUE Patel 3
SUV BLUE Singh 2
SUV RED Mitra 7
SUV RED Patel 5
SUV RED Singh 2
SUV WHITE Mitra 4
SUV WHITE Patel 5
SUV WHITE Singh 1
SEDAN BLUE Mitra 6
SEDAN BLUE Patel 4
SEDAN BLUE Singh 2
SEDAN RED Mitra 1
SEDAN RED Patel 3
SEDAN RED Singh 4
SEDAN WHITE Mitra 2
SEDAN WHITE Patel 2
SEDAN WHITE Singh 3

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Structure

Hatchback 2 5 3
M
O
D SUV n n
n
E
L Mitra
Sedan
n n n Patel
Singh DEALER
Blue Red White
COLOR

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Structure

If each dimension
has 10 positions
the relational table
M
requires
O
10*10*10 i.e. 1000
D
records
E
L
DEALER

COLOR Data Warehousing


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Performance Advantages with MDB
Volume figure when car type = SEDAN, color=BLUE, & dealer=PATEL?

• RDBMS – all 1000 records might need to be searched to find the right
record

• MDB has ore k owledge about where the data lies


• Maximum of 30 position searches

• Average case
• MDB 15 vs. 500 for RDBMS

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Performance Advantages with MDB
• To generalize the performance advantage
• In case of RDBMS, size of search space gets multiplied, each time a new dimension is
added; accordingly access time is affected
• In case of MDB, the search space increases by the size of new dimension, each time a
new dimension is added.

• At what cost?
• MDB is a separate proprietary implementation from SQL
• Since all business data is in RDBMS, the MDB has to be precomputed. Larger the
data, more the dimensions, higher the precomputation effort.
• Higher the interval of precomputation, higher the latency in MDB

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OLAP Operations
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed
data, or introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational
April 20, 2018
tables (using SQL) Data Warehousing
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OLAP
Operations
Distinct OLAP models

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Storage
ROLAP MOLAP
Data stored as relational tables in the Data stored as relational tables in the
warehouse. warehouse.
Detailed and light summary data available. Various summary data kept in proprietary
databases (MDBs)
Very large data volumes.
Moderate data volumes.
All data access from the warehouse
storage. Summary data access from MDB, detailed
data access from warehouse.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underlying Technologies
ROLAP MOLAP
Use of complex SQL to fetch data Creation of pre-fabricated data cubes
from warehouse. by MOLAP engine.
ROLAP engine in analytical server Proprietary technology to store
creates data cubes on the fly. multidimensional views in arrays, not
tables.
Multidimensional views by
presentation layer. High speed matrix data retrieval.
Sparse matrix technology to manage
data sparsity in summaries.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Functions and Features
ROLAP MOLAP
Known environment and availability Faster access.
of many tools.
Large library of functions for complex
Limitations on complex analysis calculations.
functions.
Easy analysis irrespective of the
Drill-through to lowest level easier. number of dimensions.
Drill-across not always easy.
Extensive drill-down and slice-and-
dice capabilities.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
HOLAP (Hybrid OLAP)
The intermediate architecture type, HOLAP, aims at mixing the advantages of
both basic solutions. It takes advantage of the standardization level and the
ability to manage large amounts of data from ROLAP implementations, and
the query speed typical of MOLAP systems. HOLAP has the largest amount of
data stored in an RDBMS, and a multidimensional system stores only the
information users most frequently need to access. If that information is not
enough to solve queries, the system will transparently access the part of the
data managed by the relational system. Important market actors have
adopted HOLAP solutions to improve their platform performance.

Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Ponniah P, Data Warehousing Fu da e tals , Wiley Student Edition, 2012

T2 Kimball R, The Data Warehouse Toolkit , 3e, John Wiley, 2013

References
Author(s), Title, Edition, Publishing House

Star Schema: The Complete Reference by Christopher Adamson McGraw-Hill/Osborne

Kenan Technologies, An introduction to Multidimensional Database Technology

Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline Kamber and
Jian Pei Morgan Kaufmann Publishers
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
Aggregation
Contd…

Prof. Navneet Goyal


Computer Science Department
BITS, Pilani
Aggregate Navigator
• How queries are directed to the appropriate
aggregates?
• Do end user query tools have to be hardcoded to
take advantage of aggregates?
• If DBA changed the aggregates all end user
applications have to be recoded

How do we overcome this problem?

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 2


Aggregate Navigator
• Aggregate Navigator (AN) is the solution
• So what is an AN?
• A middleware sitting between user queries and DBMS
• With AN, user applications speak just base level SQL
• AN uses metadata to transform base level SQL into
Aggregate Aware SQL

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 3


4/20/2018 Dr. Navneet Goyal, BITS, Pilani 4

Figure taken from Kimball’s articles on Aggregate Navigator


AN Algorithm
1. Rank Order all the aggregate fact tables from the
smallest to the largest
2. For the smallest FT, look in associated DTs to
verify that all the dimensional attributes of the
current query can be found.If found, we are
through. Replace the base-level FT with the
aggregate fact and aggregate DTs
3. If step 2 fails, find the next smallest aggregate FT
and try step 2 again. If we run out of aggregate
FTs, then we must use base tables

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 5


Design Requirements
#1 Aggregates must be stored in their own fact tables,
separate from the base-level data. In addition, each
distinct aggregation level must occupy its own unique
fact table
#2 The dimension tables attached to the aggregate fact
tables must, wherever possible, be shrunken versions
of the dimension tables associated with the base fact
table.

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 6


Design Requirements
#3 The base Fact table and all of its related aggregate
Fact tables must be associated together as a "family
of schemas" so that the aggregate navigator knows
which tables are related to one another.
#4 Force all SQL created by any end user or application
to refer exclusively to the base fact table and its
associated full-size dimension tables.

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 7


Storing Aggregates
• New fact and dimension table approach (Approach 1)
• New Level Field approach (Approach 2)
• Both require same space?
• Approach 1 is recommended
• Reasons?

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 8


Q&A
Thank You
Aggregation

Prof. Navneet Goyal


Computer Science Department
BITS, Pilani
Aggregation

The single most dramatic way to affect


performance in a large data warehouse
is to provide a proper set of aggregate
(summary) records ... in some cases speeding
queries by a factor of 100 or even 1,000.
No other means exist to harvest such
spectacular gains."

- Ralph kimball
4/20/2018 Dr. Navneet Goyal, BITS, Pilani 2
Aggregation
• Still aggregations is so underused. Why?
• We are still not comfortable with redundancy!
• Requires extra space
• Most of us are not sure of what aggregates to store
• A bizarre phenomena called
SPARSITY FAILURE

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 3


Aggregates & Indexes
• Analogous??
• Indexes duplicate the information content of indexed columns
• We don’t disparage this duplication as redundancy
because of the benefits
• Aggregates duplicate the information content of aggregated
columns
• Traditional indexes very quickly retrieve a small no.
of qualifying records in OLTP systems
• In DW, queries require millions of records to be summarized
• Bypassing indexes and performing table scans

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 4


Aggregates
• Need new indexes that can quickly and logically
get us to millions of records
• Logically because we need only the summarized
result and not the individual records
• Aggregates as Summary Indexes!

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 5


Aggregates
• Aggregates belong to the same DB as the low level
atomic data that is indexed (unlike data marts)
• Queries should always target the atomic data
• Aggregate Navigation automatically rewrites
queries to access the best presently available
aggregates
• Aggregate navigation is a form of query
optimization
• Should be offered by DB query optimizers
• Intelligent Middleware

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 6


Aggregation: Thumb Rule

The size of the database should not


become more than double of its original
size

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 7


Aggregates: Trade-Offs
• Query performance vs. Costs
• Costs
– Storing
– Building
– Maintaining
– Administrating
• Imbalance: Retail DW that collapsed under the
weight of more that 2500 aggregates and that took
more than 24 hours to refresh!!!

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 8


Aggregates: Guidelines
• Set an aggregate storage limit (not more than double
the original size of the DB)
• Dynamic portfolio of aggregates that change with
changing demands
• Define small aggregates: 10 to 20 times smaller than
the FT or aggregate on which it is based
– Monthly product sales aggregate: How many times smaller
than daily product sales table?
• If your answer is 30…you are forgiven, but you are
likely to be wrong
– Reason: Sparstiy Failure

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 9


Aggregates: Guidelines
Spread aggregates: Goal should be to
accelerate a broad spectrum of queries

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 10


Aggregates: Guidelines
• Spread aggregates: Goal should be to
accelerate a broad spectrum of queries

Figure 1 Poor use of the space allocated for aggregates.


4/20/2018 Dr. Navneet Goyal, BITS, Pilani 11

Source of Figure not known


Aggregation

Figure taken from Neil Raden article (www.hiredbrains.com/artic9.html)


4/20/2018 Dr. Navneet Goyal, BITS, Pilani 12
Aggregates
Issues
Which aggregates to create?
How to guard against sparsity failure?
How to store them?
• New Fact Table approach
• New Level Field approach
How queries are directed to appropriate aggregates?

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 13


Aggregates: Example
• Grocery Store
• 3 dimensions – Product, Location, & Time
• 10000 products
• 1000 stores
• 100 time periods
• 10% Sparsity

Total no. of records = 100 million


4/20/2018 Dr. Navneet Goyal, BITS, Pilani 14
Aggregates: Example

Hierarchies
• 10000 products in 2000 categories
• 1000 stores in 100 districts
• 30 aggregates in 100 time periods

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 15


Aggregates: Example
How many aggregates are possible?
• 1-way: Category by Store by Day
• 1-way: Product by District by Day
• 1-way: Product by Store by Month
• 2-way: Category by District by Day
• 2-way: Category by Store by Month
• 2-way: Product by District by Month
• 3-way: Category by District by Month

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 16


Aggregates: Example
What is Sparsity?
• Fact tables are sparse in their keys!
• 10% sparsity at base level means that only 10% of the
products are sold on any given day (average)
• As we move from base level to 1-way the sparsity
Increases!
________
• What affect sparsity will have on the size of the
aggregate fact table?

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 17


Aggregates: Example
• Let us assume that sparsity for 1-way
aggregates is 50%
• For 2-way 80%
• For 3-way 100%
• Do you agree with this?
• Is it logical?

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 18


Aggregates: Example
Table Prod. Store Time Sparsity # Records
Base 10000 1000 100 10% 100 million
1-way 2000 1000 100 50% 100 million
1-way 10000 100 100 50% 50 million
1-way 10000 1000 30 50% 150 million
2-way 2000 100 100 80% 16 million
2-way 2000 1000 30 80% 48 million
2-way 10000 100 30 80% 24 million
3-way 2000 100 30 100% 6 million
Grand Total 494 million
4/20/2018 Dr. Navneet Goyal, BITS, Pilani 19
Aggregates: Example
• An increase of almost 400%
• Why it happened?
• Look at the aggregates involving Location and
Time!
• How can we control this aggregate explosion?
• Do the calculations again with 500 categories
and 5 time aggregates

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 20


Aggregates: Example
Table Prod. Store Time Sparsity # Records
Base 10000 1000 100 10% 100 million
1-way 500 1000 100 50% 25 million
1-way 10000 100 100 50% 50 million
1-way 10000 1000 5 50% 25 million
2-way 500 100 100 80% 4 million
2-way 500 1000 5 80% 2 million
2-way 10000 100 5 80% 4 million
3-way 500 100 5 100% 0.25 million
Grand Total 210.25 million

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 21


Aggregate Design Principle

Each aggregate must summarize at


least 10 and preferably 20 or more
lower-level items

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 22


What’s Next?
• Aggregate Navigator
• How to store Aggregates?
– New Fact Table approach

4/20/2018 Dr. Navneet Goyal, BITS, Pilani 23


Q&A
Thank You
Lost, Shrunken, & Collapsed
Dimension Aggregates
Prof. Navneet Goyal
BITS-Pilani
Introduction
• While discussing aggregates, we mainly focused on
fact tables
• Need to understand what happens to dimension
tables when we create aggregates
• Lost, shrunken, & collapsed dimensions*

* Most of the material for this presentation has been taken from Lawrence Corr’s article “Lost ,
Shrunken, and Collapsed”
Introduction
Three ways of creating aggregates:
• Lost Dimension Aggregate
• Shrunken Dimension Aggregate
• Collapsed Dimension Aggregate

When we create aggregates, dimensions either get


lost, shrunken, or collapsed!!
Lost Dimension Aggregates
• Lost dimension aggregates are created when one
or more dimensions is(are) completely excluded
while summarizing a fact table
• In SQL, you can create lost dimension aggregates
by grouping directly on a subset of the
dimensional surrogate keys in the fact table
• Simplest aggregate to create and maintain as it
requires no join
• This type of aggregate can be very small
• Offers significant performance gains over the base
fact table if several high-cardinality dimensions are
lost and therefore aggregated
Lost Dimension Aggregates
• Reduced dimensionality of such an aggregate
significantly affects its applicability for
accelerating a broad range of queries.
• For example, an aggregate that lost the time
dimension would be of little use.
Lost Dimensions
Shrunken
Dimension

Shrunken
Dimension

Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Shrunken Dimension Aggregates
• Shrunken dimension aggregates have one or more
dimensions replaced by shrunken or rolled versions of
themselves.
• This technique can be combined with lost dimensions as well
(see Figure on slide #6)
• One shrunken dimension and one lost dimension
• Could represent a monthly-product-sales-by-store summary
of the original fact table. In this example, a monthly-grain
time dimension replaces the daily-grain time dimension
• The aggregate would be significantly smaller than the fact
table (though probably not by the full factor of 30 you might
expect, because not every product is sold every day) but
would still allow dimensional analysis by time at the month,
quarter, and year levels
Aggregates:
Shrunken Dimensions

Shrunken
Dimension

Lost
Dimension

A lost and a Shrunken Dimension


Back 8

Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Shrunken Dimension Aggregates
• Before you can build a shrunken dimension aggregate, one or
more shrunken dimension keys must replace original
surrogate keys in the fact table.
• If the shrunken dimension keys are carried in the original
dimensions, the aggregate can be populated by a SQL query
that joins the fact table to these dimensions and groups on a
combination of shrunken dimension keys and surviving
atomic-level surrogate keys.
Sales Star Schema

10

Figure taken from Kimball’s articles on Aggregate Navigator


Aggregates:
Shrunken Dimensions

4/20/2018 11 Dr. Navneet Goyal, BITS, Pilani

Figure taken from Kimball’s articles on Aggregate Navigator


Aggregates:
Shrunken Dimensions

4/20/2018 12 Dr. Navneet Goyal, BITS, Pilani

Figure taken from Kimball’s articles on Aggregate Navigator


Shrunken Dimension: Snowflaked Schemas
• For a snowflake schema, you can often create a shrunken
aggregate by replacing the surrogate key of a snowflaked
dimension with the key to one of the dimensions' outriggers.
(see Figure on slide # 14)

• Notice how an outrigger of the fact table becomes a directly


attached shrunken dimension of the aggregate
Aggregates:
Shrunken Dimensions

Back

Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Collapsed Dimension
• Collapsed dimension aggregates are created when dimensional
keys have been replaced with high-level dimensional attributes,
resulting in a single, fully denormalized summary table.
• Figure on slide #17 shows a collapsed aggregate with a small
number of surviving dimensional attributes from two
dimensions.
• This example could be a quarterly product category sales
summary.
• Collapsed dimension aggregates have the performance and
usability advantages of shrunken dimension aggregates,
without requiring you to maintain shrunken physical
dimensions and keys.
• In addition, they can offer further query acceleration because
they cut out join processing for rewritten queries.
Collapsed Dimension
• Can only be considered for high-level summaries only where few
dimensional attributes remain and those attributes are relatively
short.
• Otherwise, the increased record length may contribute to the
collapsed table being too large, especially if many attributes are
included.
• A collapsed dimension aggregate might well have 10 times fewer
records than the fact table but its record length could easily be
three to five times longer, leaving the overall performance gain at
only two or three times.
• Collapsed dimension aggregates would be tenable only for high-
level summaries built from already moderately sized aggregates.
Collapsed Dimensions

Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Q&A
Thank You
VIEW
MATERIALIZATION

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
Query Modification
View Materialization
How to exploit Materialized
Views to answer queries?

Examples taken from Chapter 25 of Ramakrishna book on Database Management System, 3e.
Query Modification
(Evaluate On Demand)
View CREATE VIEW RegionalSales(category,sales,state)
AS SELECT P.category, S.sales, L.state
FROM Products P, Sales S, Locations L
WHERE P.pid=S.pid AND S.locid=L.locid
Query
SELECT R.category, R.state, SUM(R.sales)
FROM RegionalSales AS R GROUP BY R.category, R.state

Modified SELECT R.category, R.state, SUM(R.sales)


Query FROM (SELECT P.category, S.sales, L.state
FROM Products P, Sales S, Locations L
WHERE P.pid=S.pid AND S.locid=L.locid
GROUP BY R.category, R.state
Materialized Views
Conventional views are “virtual”
Only definition is stored in Metadata
View definitions are executed during runtime
If views are complex, the queries using them
will have large response times
Not a desired situation in a DW/OLAP
environment where the amount of data is
huge
Materialized Views
A view whose tuples are stored in the
database is said to be materialized
Materialized views provide fast access,
like a (very high-level) cache!
But would require to be maintained as
the underlying tables change
Play a critical role in reducing response
time of OLAP queries
Materialized Views
Pre-computing some important, expensive, and
frequently required results
Should be based on expected workload
Just like doing some prior preparation for
dishes in a restaurant
If everything is done after receiving the order, the
waiting time of customers will be more
Better to do some prior preparation to minimize
waiting time (read response time of queries)
Issues in View
Materialization
Which views should we materialize?
(RL 7.5.2)
Given a query and a set of materialized
views, can we use the materialized views
to answer the query?
How frequently should we refresh
materialized views to make them
consistent with the underlying tables?
(RL 7.6.1)
How can we do this incrementally?
(RL 7.6.2)
View Materialization:
Example
Both queries
SELECT P.Category, SUM(S.sales) require us to
FROM Product P, Sales S join the Sales
WHERE P.pid=S.pid table with
GROUP BY P.Category another table &
aggregate the
result
SELECT L.State, SUM(S.sales)
FROM Location L, Sales S How can we use
WHERE L.locid=S.locid materialization
GROUP BY L.State to speed up
these queries?
View Materialization:
Example
Pre-compute the two joins involved
( product & sales & Location & sales)
Pre-compute each query in its entirety
OR let us define the following view:

CREATE VIEW TOTALSALES (pid, lid, total)


AS Select S.pid, S.locid, SUM(S.sales)
FROM Sales S
GROUP BY S.pid, S.locid
View Materialization:
Example
The View TOTALSALES can be
materialized & used instead of Sales in
our two example queries
SELECT P.Category, SUM(T.Total)
FROM Product P, TOTALSALES T
WHERE P.pid=T.pid
GROUP BY P.Category

SELECT L.State, SUM(T.Total)


FROM Location L, TOTALSALES T
WHERE L.locid=T.locid
GROUP BY L.State
Try problems 25.9 & 25.10 from the
Ramakrishna book (ch. 25)
SELECTION OF VIEWS
TO MATERIALIZE

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
The material in this presentation
has been taken from the following
survey paper on view selection
methods:
“A survey on View Selection
Methods” by

SIGMOD Record March, 2012 (vol. 41, no. 1)


View Selection Problem
 For a given set of queries and a
space constraint, find the
optimal set of views to
materialize such that the benefit
is maximized
View Selection Problem
 Objectives:
 Minimize query processing cost
 Minimize view maintenance cost and

 Minimize space requirement

NP-complete problem!!
Cost Model
 Weighted query processing cost:

where fQi is the query frequency of the query Qi


and Qc(Qi,M) is the processing cost corresponding
to Qi given a set of materialized views M.
Cost Model
 View Maintenance cost:

where fu(Vi) is the update frequency of the view Vi


and Mc(Vi,M) is the maintenance cost of Vi given
a set of materialized views M.
Static vs. Dynamic View Selection
 Static view selection approach is based on a given
workload
 In a dynamic view selection approach, the view selection is
applied as a query arrives
 The workload is built incrementally and changes over time
 Because the view selection has to be in synchronization
with the workload, any change to the workload should be
reflected to the view selection as well.
 In the dynamic set up, the portfolio of materialized views
can be changed over time and replaced with more
beneficial views in case of changing the query workload
Static vs. Dynamic View Selection
 We can start with a static approach and when the average
query cost goes above a threshold, we change the set of
materialized views
 By doing view selection all over again
 Incrementally (preferred)
 A dynamic view selection is often referred to as “view
caching”
View Selection
 Two main steps:
1. Identifies the candidate views which are promising for
materialization. Techniques used:
• Multiquery DAG,
• Query rewriting
• Syntactical analysis of the workload

2. The second step selects the set of views to materialize


under the resource constraints and by applying heuristic
algorithms
Multiquery DAG
 Detecting common sub-expressions between the different
queries and capturing the dependencies among them
 The dependence relation on queries (or views) has been
represented by using a directed acyclic graph also called a
DAG
 Most commonly used DAG in literature is the AND/OR view
graph
Multiquery DAG
 The union of all possible execution plans of each query forms
an AND/OR view graph [1/40].
 The AND-OR view graph described by Roy [2/42] is a DAG
composed of two types of nodes: Operation nodes and
Equivalence nodes
 Each operation node represents an algebraic expression (SPJ)
with possible aggregate function.
 An equivalence node represents a set of logical expressions
that are equivalent (i.e., that yield the same result).
 The operation nodes have only equivalence nodes as children
and equivalence nodes have only operation nodes as children.
 The root nodes are the query results and the leaf nodes
represent the base relations.
Multiquery DAG
 Example AND/OR View Graph
 View V1 corresponding to a
single query Q1, can be
computed from V6 and V3 or
R1 and V4
 If there is only one way to
answer or update a given
query, the graph becomes an
AND view graph
Multiquery DAG
 Multi-View Processing Plan
(MVPP)
 MVPP defined by Yang et al
[3] is a directed acyclic graph
in which the root nodes are
the queries, the leaf nodes
are the base relations and all
other intermediate nodes are
SPJ or aggregation views that
contribute to the construction
of a given query
Multiquery DAG
 The MVPP is obtained after
merging into a single plan
either individual optimal query
plans (similar to the AND view
graph) or all possible plans for
each query (similar to AND-OR
view graph)
 The difference between the
MVPP representation and the
AND-OR view graph or the AND
view graph representation is
that all intermediate nodes in
the MVPP represent operation
nodes.
Query Rewriting
 Query rewriting based approaches not only compute the set
of materialized views but also find a complete rewriting of
the queries over it.
 Input to view selection is not a multiquery DAG but the
query definitions.
 The view selection problem is modeled as a state search
problem using a set of transformation rules. These rules
detect and exploit common subexpressions between the
queries of the workload and guarantee that all the queries
can be answered using exclusively the materialized views.
 The completeness of the transformation rules may make the
complexity of state space search strategies exponential.
Syntactical Analysis of Queries
 The query workload is analyzed and a subset of relations is
picked from which one or more views are materialized, if
only it has the potential to reduce the cost of the workload
significantly.
 However, the search space for computing the optimal set of
views to be materialized may be very large.
Resource Constraints
 Three main models in literature:
 Unbounded
 Space constrained
 Maintenance cost constrained
Unbounded
 There is no limit on available resources (storage,
computation etc.)
 View selection problem thus reduces to minimization of
query cost and view maintenance cost

 Two major problems with this approach:


 Selected views may be too large to fit into available space
 The view maintenance cost may offset the performance advantage
provided by materialized views
Space Constrained
 The space constrained model minimizes the query
processing cost plus the view maintenance cost under a
space constraint:
Maintenance Cost Constrained
 This model constrains the time that can be allotted to keep
up to date the materialized views in response to updates
on base relations.
 The maintenance cost constrained model minimizes the
query processing cost under a maintenance cost constraint
Algorithms for View Selection
 Four types of heuristic algorithms have been proposed
in literature for view selection:
 Deterministic Algorithms [4/41, 5/37 – A* Algorithm]
 Randomized Algorithms [6/14 – Genetic, 7/30 – Simulated
Annealing]
 Hybrid Algorithms [8/56]
 Constraint Programming [9/49, 10/35]
Classification of View Selection Methods
Some Important References
1. N. Roussopoulos. The logical access path schema of a database.
IEEE Trans. Software Eng., 8(6):563–573,1982.
2. P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. Efficient and
extensible algorithms for multi query optimization. In SIGMOD
Conference, pages 249–260, 2000.
3. J. Yang, K. Karlapalem, and Q. Li. A framework for designing
materialized views in data warehousing environment. In ICDCS,
1997
4. N. Roussopoulos. View indexing in relational databases. ACM
Trans. Database Syst., 7(2):258–290, 1982.
5. N.J. Nilsson. Problem-Solving Methods in Artificial Intelligence.
McGraw-Hill Pub. Co., 1971.
6. D.E. Goldberg. Genetic Algorithms in Search Optimization and
Machine Learning. Addison-Wesley, 1989.
Some Important References
7. P.J.M. Laarhoven and E.H.L. Aarts, editors. Simulated
annealing: theory and applications. Kluwer Academic Publishers,
Norwell, MA, USA, 1987.
8. C. Zhang, X. Yao, and J. Yang. An evolutionary approach to
materialized views selection in a data warehouse environment.
IEEE Transactions on Systems, Man, and Cybernetics, Part C,
31(3):282–294, 2001.
9. M. Wallace. Practical applications of constraint programming.
Constraints, 1:139–168, 1996. 10.1007/BF00143881.
10. I. Mami, R. Coletta, and Z. Bellahsene. Modeling view selection
as a constraint satisfaction problem. In DEXA, pages 396–410,
2011.
VIEW MAINTENANCE

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 View Maintenance
 Immediate
 Deferred
• Lazy
• Periodic (snapshot)
• Event-based
View Maintenance
 A materialized view is said to be refreshed
when it is made consistent with changes ot
its underlying tables
 Often referred to as VIEW MAINTENANCE
 Two issues:
 HOW do we refresh a view when an underlying
table is refreshed? Can we do it incrementally?
 WHEN should we refresh a view in response to a
change in the underlying table?
View Maintenance
 The task of keeping a materialized view up-to-date with
the underlying data is known as materialized view
maintenance
 Materialized views can be maintained by recomputation on
every update
 A better option is to use incremental view maintenance
 Changes to database relations are used to compute
changes to materialized view, which is then updated
 View maintenance can be done by
 Manually defining triggers on insert, delete, and update of
each relation in the view definition
 Manually written code to update the view whenever database
relations are updated
 Supported directly by the database
View Maintenance
 Two steps:
 Propagate: Compute changes to view when data
changes.
 Refresh: Apply changes to the materialized view
table.
 Maintenance policy: Controls when we do
refresh.
 Immediate: As part of the transaction that
modifies the underlying data tables. (+
Materialized view is always consistent; - updates
are slowed)
 Deferred: Some time later, in a separate
transaction. (- View becomes inconsistent; + can
scale to maintain many views without slowing
updates)
Deferred Maintenance
Three flavors:
 Lazy: Delay refresh until next query on view;
then refresh before answering the query.
 Periodic (Snapshot): Refresh periodically.
Queries possibly answered using outdated
version of view tuples. Widely used, especially
for asynchronous replication in distributed
databases, and for warehouse applications.
 Event-based: E.g., Refresh after a fixed number
of updates to underlying data tables.
View Maintenance:
Incremental Algorithms
 Recomputing the view when an
underlying table is modified –
straightforward approach
 Not feasible to do so for all changes
made
 Ideally, algorithms for refreshing a
view should be incremental
 Cost of refresh is proportional to the
extent of the change
To be covered in the next lecture!!
Incremental View
Maintenance Algorithms

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
Incremental View Maintenance
Motivation
Joins
Selections & Projections
Aggregate Operators
Prerequisite – Relational Algebra

Material adapted from Chapter 13 of Silberschatz book on Database System Concepts, 6e.
View Maintenance
The changes (inserts and deletes) to a relation
or expressions are referred to as its differential
Set of tuples inserted to and deleted from r are denoted
ir and dr respectively
To simplify our description, we only consider
inserts and deletes
We replace updates to a tuple by deletion of the tuple
followed by insertion of the update tuple
How to compute the change to the result of each
relational operation, given changes to its inputs?
Join Operation
Consider the materialized view v = r s and an
update to r
Let rold and rnew denote the old and new states
of relation r
Consider the case of an insert to r:
We can write rnew s as (rold ∪ ir) s
And rewrite the above to (rold s) ∪ (ir s)
But (rold s) is simply the old value of the materialized
view, so the incremental change to the view is just ir s
Thus, for inserts vnew = vold ∪(ir s)
Similarly for deletes vnew = vold – (dr s)
Selection & Projection
Operations
Selection: Consider a view v = σθ(r).
vnew = vold ∪σθ(ir)
vnew = vold - σθ(dr)
Projection is a more difficult operation
R = (A,B), and r(R) = {(a,2), (a,3)}
∏A(r) has a single tuple (a).
If we delete the tuple (a,2) from r, we should not delete the
tuple (a) from ∏A(r), but if we then delete (a,3) as well, we
should delete the tuple
For each tuple in a projection ∏A(r) , we will keep a count
of how many times it was derived
On insert of a tuple to r, if the resultant tuple is already in
∏A(r) we increment its count, else we add a new tuple with
count = 1
On delete of a tuple from r, we decrement the count of the
corresponding tuple in ∏A(r)
• if the count becomes 0, we delete the tuple from ∏A(r)
Aggregate Operations
count : v = g (r)
A count(B)
( count of the attribute B, after grouping r by attribute A)

When a set of tuples ir is inserted


• For each tuple t in ir, if the group t.A is present
in v, we increment its count, else we add a new
tuple (t.A, 1) with count = 1
When a set of tuples dr is deleted
• for each tuple t in dr, we look for the group t.A in
v, and subtract 1 from the count for the group.
• If the count becomes 0, we delete from v the tuple
for the group t.A
Example

Relation account grouped by branch-name:


branch_name account_number balance
Perryridge A-102 400
Perryridge A-201 900
Brighton A-217 750
Brighton A-215 750
Redwood A-222 700

branch_name g sum(balance) (account)


branch_name sum(balance)
Perryridge 1300
Brighton 1500
Redwood 700
Aggregate Operations
SUM – more complicated than COUNT
sum: v = Agsum (B)(r)

Maintain the sum in a manner similar to count,


except we add/subtract the B value instead of
adding/subtracting 1 for the count
Additionally, maintain the count in order to detect
groups with no tuples. Such groups are deleted
from v
• Cannot simply test for sum = 0 (why?)
• Actual sum may be zero, but there are tuples in the group

To handle the case of avg, maintain the sum


and count aggregate values separately, and
divide at the end
Aggregate Operations
min, max: v = Agmin (B) (r).
Handling insertions on r is straightforward.
Maintaining the aggregate values min and
max on deletions may be more expensive.
Need to look at the other tuples of r that are
in the same group to find the new minimumt,
when the tuple corresponding to the min
value of the group is deleted
Example

TEST IDNO Marks


T1 A-102 15
T1 A-103 20
T2 A-102 25
T2 A-103 10
T3 A-104 10

Test g Min(marks) (student_record))


TEST Min(marks)

T1 15
T2 10
View Materialization:
Summary
View Materialization
Selection of Views to Materialize
View Maintenance
Incremental View Maintenance
Coming up next …
Bitmap Indexes
Bitmap Indexes

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Bitmap Indexes
 Advantages over hash and tree
based indexes
 AND & OR operations
 Multidimensional Queries
Bitmap Indexes
 A promising indexing approach for
processing complex adhoc queries in
a read-mostly environment.
Indexing Structures
 Indexes are data structures that are used to
retrieve data efficiently from databases
 Indexes reduce search space before going to the
primary source data
 Index entries take less space as compared to
indexed data
 Extensively used in OLTP systems
 Examples:
 Tree-based indexes
 Hash-based indexes
 Suitable for retrieving a small fraction of records
satisfying a given selection condition
Indexing Structures
 Not suitable for queries that require millions of
records to be answered (queries with large foot
prints)
 Better to bypass index & do a complete table scan
 1 – dimensional indexes not suitable for
multidimensional queries in a DW environment
Indexing Structures
 For data warehouse environments we need indexing
structures that satisfy the following criteria:
 Retrieves large number of records from a table efficiently
 Handles multidimensional queries with conjunctive selection
conditions
 Facilitates joins and aggregations
 Space efficient
Bitmap Indexes
 Bitmap Indexes ( P. O. Neil, 1987)
 Suitable for low cardinality attributes
(adaptation to high cardinality attributes also
possible & available)
 Highly amenable to compression & encoding
making them much compact than hash or tree
bases indexes
 Bitmap manipulations using bit-wise operators
AND, OR, XOR, NOT are very efficiently
supported by hardware
Why Bitmap Indices?
 Most multi-dimensional indices suffer from ‘curse of
dimensionality’ problem
 E.g. R-tree, Quad-trees, KD-trees
 Don’t scale to large number of dimensions ( > 10)
 Are efficient only if all dimensions are queried
 Bitmap indices
 Are efficient for multi-dimensional queries
 Query response time scales linearly in the actual
number of dimensions in the query
What is a Bitmap Index?
Data  Compact: one bit per distinct
value per object.
values b0 b1 b2 b3 b4 b5
 Easy and fast to build: O(n) vs.
0 0 O(n log n) for trees.
0 1 0 0 0
1 0  Efficient to query: use bitwise
1 0 0 0 0 logical operations.
5 0 0 0 0 0 1  Example queries on the
3 0 0 0 1 0 0 attribute, say, A
1 0 1 0 0 0 0  One-sided range query: A < 2
2 0 0 1 0 0 0 b0 OR b1
0 1 0 0 0 0 0  Two-sided range query:
2<A<5
4 0 0 0 0 1 0 b3 OR b4
1 0 1 0 0 0 0  Efficient for multidimensional
queries.
 No “curse of dimensionality”
Example:
A relation R with cardinality 12
A(R) B0 B1 B2 B3 B4 B5 B6 B7 B8
3 0 0 0 1 0 0 0 0 0
Value-list 2 0 0 1 0 0 0 0 0 0
Index 1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 1
2 0 0 1 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 1 0
5 0 0 0 0 0 1 0 0 0
6 0 0 0 0 0 0 1 0 0
4 0 0 0 0 1 0 0 0 0
Example:
A relation R with cardinality 12
0 000000010000
1 001000000000
2 010101100000
3 100000000000
4 000000001001
5 000000000100
6 000000000010
7 000000001000
8 000010000000
• 9 bitmaps, each corresponding to a distinct value of A
• In each bitmap, the position of “1” gives the record number
• consider the bitmap corresponding to value 2. It indicate that record
numbers 2, 4, 6, and 7 have value 2
Example 2
Age Salary
25 60
Who buys gold jewelry? Consider 45 60
a database of customers that
contains information about their
50 75
name, address, age and salary. 50 100
Assuming that age and salary are 50 120
the only relevant factors. Suppose 70 110
our database has the following 12 85 140
jewelry customers:
30 260
25 400
45 350
50 275
Example taken from: Hector Garcia-Molina, JD
60 260
Ullman, & J Widom, Database System
Implementations, Pearson Education, 2001
Bitmap on AGE
25 100 000001000
30 000000010000
Example 2 45 010000000100
50 001110000010
Age Salary 60 000000000001
25 60 70 000001000000
45 60 85 000000100000
50 75 Bitmap on SALARY
50 100
60 110000000000
50 120 75 001000000000
70 110 100 000100000000
85 140 110 000001000000
30 260 120 000010000000
25 400 140 000000100000
45 350 260 000000010000
50 275 275 000000000010
60 260 350 000010000100
400 000000001000
Example 2
Suppose we want to find the Age Salary
jewelry buyers having age 50 and 25 60
salary 120. To find this, take the 45 60
bitmap corresponding to age 50
50 75
and salary 120
50 100
Age 50: 0 0 1 1 1 0 0 0 0 0 1 0 50 120
Sal 120: 0 0 0 0 1 0 0 0 0 0 0 0 70 110
AND :000010000000 85 140
30 260
This tells us that only the 5th
record in the table satisfies the 25 400
above condition. The above query 45 350
is an example on equality query 50 275
60 260
Example 2
Find the jewelry buyers with an Age Salary
age in the range 45-55 and a 25 60
salary in the range 100-200. 45 60
50 75
Age 45: 0 1 0 0 0 0 0 0 0 1 0 0
Age 50: 0 0 1 1 1 0 0 0 0 0 1 0 50 100
OR : 011110000110 50 120
 bit-vector representing age 70 110
between 45-55 85 140
30 260
Sal 100: 0 0 0 1 0 0 0 0 0 0 0 0
Sal 110: 0 0 0 0 0 1 0 0 0 0 0 0 25 400
Sal 120: 0 0 0 0 1 0 0 0 0 0 0 0 45 350
Sal 140: 0 0 0 0 0 0 1 0 0 0 0 0 50 275
OR : 000111100000 60 260
 bit-vector representing salary
between 100-200
Example 2
Find the jewelry buyers with an Age Salary
age in the range 45-55 and a 25 60
salary in the range 100-200. 45 60
50 75
011110000110
000111100000 50 100
AND 0 0 0 1 1 0 0 0 0 0 0 0 50 120
70 110
This tells us that only the 4th and 85 140
the 5th records, which are (50,
30 260
100) and (50, 120), are in the
desired range. 25 400
45 350
This is an example of a range 50 275
query! 60 260
Point to Note
 Bitmap indexes only tell us the position of records
that qualify to be included in the result set
 They do not retrieve the records
 To retrieve the records, we need to use a suitable
indexing structure (like B+- tree)
Space Requirement

Unique Cardinality B-tree Bitmap


Column (%) Space Space
Values
500,000 50.00 15.29 12.35
100,000 10.00 15.21 5.25
10,000 1.00 14.34 2.99
100 0.01 13.40 1.38
5 <0.01 13.40 0.78

Reference: Corey M et al., Oracle 8i Data Warehousing, TMH 2001


Bitmap Indexes for Large Datasets
1. Encoding: reduce the number of bitmaps or reduce
the number of operations (RL 7.7.2)
 Basic: equality encoding: generates on bitmap for each bin
 Other: range encoding, interval encoding, …
2. Compression: reduce the size of each bitmap, may
also speedup the logical operations (RL 7.7.2)
 Find an efficient compression scheme to reduce query
processing time
 BBC & WAH
3. Binning: reduce the number of bitmaps (Next)
 Say 0 <= NLb < 4000, we can use 20 equal size bins
[0,200)[200,400)[400,600)
 Equi-width & equi-depth
Bitmaps for High Cardinality Attributes
 Bitmap indexes are well suited for low-cardinality
columns
 In a DW environment, it may be required to use
bitmap indexes for high cardinality attributes
 Space requirements go up as we build a bitmap
index for a high cardinality attribute
 For example, in scientific databases, the
cardinality of certain attributes can be very hign
 In such situations, we use a technique called
“Binning”
 Binning reduces the number of bitmaps
 Say 0 <= NLb < 4000, we can use 20 equal size bins
[0,200)[200,400)[400,600) and so on…
 Equi-width & equi-depth
Binning
 The basic idea of binning is to build a bitmap for a
bin rather than each distinct attribute value
 One bitmap for each bin rather than for each
distinct value
 This strategy disassociates the number of bitmaps
from the attribute cardinality and allows one to
build a bitmap index of a prescribed size,
irrespective of the attribute cardinality
 Reduces the space requirement, but introduces
other problems:
 How to Bin?
 Candidate checking
Internal and Edge Bins
Record Original
0-10 11-20 21-30 31-40 41-50
ID Values
1 5 1 0 0 0 0
2 34 0 0 0 1 0
Edge
3 23 0 0 1 0 0
bin
4 9 1 0 0 0 0 Internal
5 12 0 1 0 0 0 bin
6 6 1 0 0 0 0
7 34 0 0 0 1 0
8 42 0 0 0 0 1
9 11 0 1 0 0 0
10 22 0 0 1 0 0
11 44 0 0 0 0 1
12 23 0 0 1 0 0
13 18 0 0 0 0 1
14 41 0 1 0 0 0
15 39 0 0 0 1 0
Figure 13: Two- sided range query 8 < A < 37 on a bitmap index with binning
Binning Bottleneck
 The main bottleneck of binning strategy is
Candidate Check Problem. The higher the
number of candidates that need to be checked
against the query constraints, the higher the
query processing costs
 Candidate check problem:
 Suppose you have an equality query A=8 and
there is a bin [0-10]. Each tuple that has a 1 for
the bin is a candidate for the result set.
 Need to retrieve each tuple in the bin to check if
its value is 8 or not
 Expensive exercise as it involves disk I/Os
Binning Bottleneck
 A good binning strategy will minimize the
number of candidate checks*
 Equivalently, a good binning strategy will
minimize the number of edge bins by suitably
choosing the bin boundaries*

* Navneet Goyal & Yashvardhan Sharma


New Binning Strategy for Bitmap Indices on High Cardinality Attributes
ACM Compute 2009, Bangalore, 09th-11th Jan. 2009
Related Topic
 Bitmapped Join Indexes
Coming up next …
 Compression Techniques for
Bitmap Indexes
Compression of Bitmap
Indexes

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Data Compression
 Run Length Encoding (RLE)
 Byte-Aligned Bitmap Codes
(BBC)
 Word-Aligned Hybrid (WAH)
Data compression implies sending or storing a smaller
number of bits. Although many methods are used for this
purpose, in general these methods can be divided into two
broad categories: lossless and lossy methods.

Figure 15.1 Data compression methods

Source: Chapter 15: Foundations of Computer Science Cengage Learning


Run-length encoding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols. It does not need to know the
frequency of occurrence of symbols and can be very efficient
if data is represented as 0s and 1s (therefore, suitable for
Bitmap Indexes)
The general idea behind this method is to replace
consecutive repeating occurrences of a symbol by one
occurrence of the symbol followed by the number of
occurrences.
The method can be even more efficient if one symbol
is more frequent than the other in its bit pattern
Source: Chapter 15: Foundations of Computer Science Cengage Learning
Run Length Encoding

Figure 15.2 Run-length encoding example

Source: Chapter 15: Foundations of Computer Science Cengage Learning


Run Length Encoding

Figure 15.3 Run-length encoding for two symbols

Source: Chapter 15: Foundations of Computer Science Cengage Learning


Run Length Encoding
 Attribute F (cardinality m) of a table (cardinality n)
 Number of bits required to represent the bitmap on
F will be mn.
 If block size = 4096 bytes, then we can fit
8*4096=32,768 bits in one block
 Number of blocks required = mn/32768.
 This number increases with an increase in m. But as
m increases the number of 1’s in any bit-vector
would decrease.
 Probability (bit=1)=1/m
Run Length Encoding
 If 1’s are rare then the set of all bit-vector becomes a
sparse matrix and we can encode bit-vectors so that they
take much fewer than n-bits on an average.
 A common encoding technique is run-length encoding
(RLE)
Run Length Encoding
 Represent a run, that is, a sequence of i 0’s
followed by a 1, by some suitable binary
encoding of the integer i
 Concatenate the codes for each run together,
and that sequence of bits represents the entire
bit-vector. This is done for all bit-vectors in a
bitmap
RLE Example
 Consider the bit-vector 000101
 It contains 2 runs, 001 and 01, of lengths 3 & 1
respectively
 3=(11)2 and 1=(1)2
 The run-length encoding of 000101 would become 111
 The bit-vector 010001 would also be represented by 111
 111 cannot be therefore decoded uniquely into one bit-
vector
 As a result, we cannot use binary representation of lengths
for run-length encoding
RLE Example
 Let j = number of bits in the binary representation of i
 j  log2i is represented in unary by j-1 1’s and a single 0.
Then we can follow the i in binary.
 Consider the bit-vector 00000000000001
 i=13, then j=4. Thus the encoding of i begins with 1110.
We follow this by i in binary, or 1101. Thus the encoding
for 13 is 11101101
RLE Example:
Encoding & Decoding
 Encode 000101
 Two runs of 3 and 1. For i=3, j=2 therefore,
0001 is encoded as 1011
 For i=1, j=1 therefore, 01 is encoded as 01.
 So 000101 will be encoded as 1011 01
RLE Example:
Encoding & Decoding
 Now decode 101101
 First zero at 2nd bit therefore, j=2. Next 2 bits are
11. So the first four bits are decoded as 0001.
 Consider the second part. First bit is 0 therefore j=1.
Next bit is 1. So the last 2 bits are decoded as 01
 So 101101 is decoded to 000101
 Note that every decoded bit vector will have a 1 in
the last bit. Trailing zeroes can not be recovered,
which can be recovered by using the cardinality of
the relation
 Do we need to recover trailing zeroes?
Bitmap on AGE
25 100 000001000
30 000000010000
Example 45 010000000100
50 001110000010
Age Salary 60 000000000001
25 60 70 000001000000
45 60 85 000000100000
50 75 Bitmap on SALARY
50 100
60 110000000000
50 120 75 001000000000
70 110 100 000100000000
85 140 110 000001000000
30 260 120 000010000000
25 400 140 000000100000
45 350 260 000000010000
50 275 275 000000000010
60 260 350 000010000100
Example taken from: Hector Garcia-Molina, JD 400 000000001000
Ullman, & J Widom, Database System
Implementations, Pearson Education, 2001
AND/OR Operations
 Decode & operate on the original bit-vectors
 However, we do not need to decode entire bit-
vector at once
 Decode one run at a time
 OR – produce a 1 whenever we encounter a 1 in
either bit-vector
 AND – produce a 1 iff both operands have their
next 1 at the same position
AND/OR Operations
 Age 25 1 0 0 0 0 0 0 0 1 0 0 0
 Age 30 0 0 0 0 0 0 0 1 0 0 0 0
 Corresponding encoded bit-vectors are 00110111
and 110111 respectively (check yourself!!)
AND/OR Operations
 Age 25 1 0 0 0 0 0 0 0 1 0 0 0 (00110111)
OR
 Age 30 0 0 0 0 0 0 0 1 0 0 0 0 (110111)
 First runs of both bit-vectors are easily decoded as 0 & 7
respectively implying that first bit-vector has first 1 at position
1 & second bit-vector has first 1 at position 8 (generate 1 in
position 1)
 Decoding second run of Age 25, we see that the run is 7. So
next 1 is at position 9.
 Therefore, the bit-vector generated by OR operation is 1
followed by six 0’s, followed by 1 at 8th position coming from
Age 30, and followed by 1 coming from Age 25 at position 9
100000011
 Result: 1st, 8th, & 9th records are retrieved!! (Try And Yourself)
Operation-efficient Compression
Methods
 Byte-aligned bitmap code: BBC
(Antoshenkov, G.,1994, 1996)
• Uses run-length encoding
• Byte alignment, optimized for space efficiency
• Encode/decode bitmaps 8 bits (one byte) at a time
• Compresses nearly as well as LZ77 (gzip)
• Bitwise logical operations can be performed on compressed
bitmaps directly
• Adopted since Oracle 7.3
 Word-aligned scheme: WAH
(Wu et al., 2004, 2006)
• Uses run-length encoding
• Word alignment
• Designed for minimal decoding to gain speed
• Used in Lawrence Berkeley Lab for high-energy physics
Trade-off: Compression Schemes
speed

WAH uncompressed
better

BBC

gzip PacBits
ExpGol space
Comparison of different compression Techniques
( Wu et.al, 2001)
Summary of Module 7
 Aggregation
 Sparsity failure
 Aggregate Navigator
 Partitioning
 Partitioning wrt time dimension
 View Materialization
 precomputing
 Bitmap Indexes
Role of Metadata

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Metadata
 Metadata Components
 Role in Data Warehousing
Metadata
 The first image most people have of the data
warehouse is a large collection of historical,
integrated data.
 While that image is correct in many regards,
there is another very important element of the
data warehouse that is vital – metadata
 Metadata is data about data
 Not a new concept – has been around as long as
there have been programs and the data on
which these programs operate
Metadata
Metadata
 While metadata is not new, its role in DW is
certainly new
 Metadata has been continuously ignored in
conventional IT systems
 Ignoring it in a DW environment is to defeat the
very purpose of a DW
 Metadata assumes an important role in a DW
 The importance of metadata in a DW could be
attributed to its difference from operational
systems
Metadata
 The community served and the importance to
that community are the primary reasons as to
why metadata is so important in the DW
environment
 IT professional is the primary community
involved in the operational development and
maintenance
 Computer literate and is able to find its way around
systems
 DW serves the DSS analysts community
 Not necessarily computer literate and needs help in
navigating the system
Metadata
 Second reason – DSS analyst needs to know in order to do
his/her job is what data is available and where it is in the
data warehouse.
 When the DSS analyst receives an assignment, the first
thing the DSS analyst needs to know is what data there is
that might be useful in fulfilling the assignment.
 For this, the metadata for the warehouse is vital to the
preparatory work done by the DSS analyst.
 The information technology professional has been doing
his/her job for many years while treating metadata
passively.
Mapping
 A basic part of the data warehouse environment
is that of mapping from the operational
environment into the data warehouse
 mapping from one attribute to another,
 conversions,
 changes in naming conventions,
 changes in physical characteristics of data,
 filtering of data, etc.
Mapping
 Consider the vice president of marketing asked for a new report.
 Upon inspection, the vice president proclaims the report to be fiction
 The credibility of the DSS analyst who prepared the report goes down
until he/she can prove the data in the report to be valid.
 The DSS analyst first looks to the validity of the data in the warehouse. If
the DW data has not been reported properly, then the reports are
adjusted.
 However, if the reports have been made properly from the DW, the DSS
analyst is in the position of having to go back to the operational source to
salvage credibility.
 At this point, if the mapping data has been carefully stored, then the DSS
analyst can quickly and gracefully go to the operational source
 However, if the mapping has not been stored or has not been stored
properly, then the DSS analyst has a difficult time defending his/her
conclusions to management.
 The metadata is a natural place for the storing of mapping information.
Mapping
Managing Data over Time
 Another important function of the metadata is that of
management of data over time
 Time span of data in a DW is much longer than operational
systems
 5-10 year span is normal vs. few weeks to 3 months in
operational systems
 The storage of multiple data structures for the data
warehouse is contrasted with the storage of a single data
structure as found in the operational environment.
 One of the fundamental concepts of data management in
the operational environment is that there is one and only
one correct definition of data
Managing Data over Time
 This assumption is 180 degrees the opposite of the data
found in the data warehouse.
 Managing data over a long spectrum
 of time then is another reason why metadata in the data
warehouse is so important
Managing Data over Time
Versioning of Data
 Because data must be managed over a long spectrum of
time in the DW environment (and correspondingly, the
associated metadata must be likewise managed),
metadata must be "versioned".
 Versioned data is data that allows changes to be
continuously tracked over a long period of time.
Versioning of Data
Versioning of Data
 In Figure 5 it is seen that both the current status of a
structure can be found AND the history of changes to that
structure can likewise be found.
 This tracking is a necessary feature for ALL types of
metadata in the DW store of metadata.
 One of the characteristics of versioning is that the data
trail be continuous and non-overlapping.
 In other words it is important that for any moment in the
past there be one and only one value or status of
metadata.
 Effective from & Effective to dates on metadata component
Versioning of Data
 When the DSS analyst wants to interrogate a calculation or
report made in the past that the versioning allows the DSS
analyst to understand what data was and where it came
from as it entered the warehouse.
 Without versioning, the only meaningful data a DW
environment has is data written for and managed under
the most current definition and structure of data.
Metadata Components
Basic Components
 The basic components of the DW metadata store include
the tables that are contained in the warehouse, the keys of
those tables, and the attributes
Metadata Components
Mapping
 One of the most important components of DW metadata
 identification of source field(s),
 simple attribute to attribute mapping,
 attributes conversions,
 physical characteristic conversions,
 encoding/reference table conversions,
 naming changes,
 key changes,
 defaults,
 logic to choose from multiple sources,
 algorithmic changes, and so forth.
 Like all other data warehouse metadata, these
components should be versioned
Metadata Components
Mapping
Metadata Components
Extract History
 The actual history of extracts and transformations of data
coming from the operational environment and heading for
the DW is another component that belongs in the DW
metadata store
 It tells the DSS analyst when data entered the data
warehouse
 The DSS analyst has many uses for this type of
information
Metadata Components
Extract History
 when was the last time data in the warehouse was
refreshed.
 if processing and the assertions of analysis have changed.
 whether the results obtained for one analysis are different
from results obtained by an earlier analysis because of a
change in the assertions or a change in the data
Metadata Components
Extract History
Metadata Components
Miscellaneous
 Alias
 Status
 Volumetric
 Aging/Purging criteria
Metadata Components
Miscellaneous
 Alias – recall role playing dimensions
 Status - In some cases a table is undergoing design. In
other cases the table is inactive or may contain misleading
data
 Volumetric –
 the number of rows currently in the table,
 the growth rate of the table,
 the statistical profile of the table,
 the usage characteristics of the table,
 the indexing for the table and its structure,
 Aging/Purging criteria - definition of the life cycle of data
warehouse data
Metadata Components
Miscellaneous
Coming up next…
 Types of Metadata
Types of Metadata

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
The presentation is based on Chapter 9 –
The Significant role of Metadata
Data Warehousing Fundamentals for IT
Professionals
Wiley, 2e, 2010
Types of Metadata
 Data Acquisition
 Data Storage
 Information Delivery

Another Classification:
 Business Metadata
 Technical Metadata
Metadata
 Data about the data
 Table of contents of data
 Catalog for the data
 DW Atlas
 DW Roadmap
 DW Directory
 Glue that holds the DW together
 Tongs to handle the data
 The nerve center
Metadata
 It is all the information that defines and
describes the structures, operations, and
contents of a DW system

- Ralph Kimball, The DW Lifecycle Toolkit, 2e,


Wiley, 2008
Metadata
 Information about the table/entity “Customer”
Entity Name: Customer
Alias Names: Account, Client
Defn.: A person or an organization that purchases goods or services
from the company
Remarks: Regular, current, & past customers
Source Systems: Finished goods orders, Maintenance contracts
Create Date: Jan. 23, 2010
Last update Date: Jan. 21, 2012
Update Cycle: Weekly
Last full refresh cycle: Dec. 29, 2011
Full refresh cycle: every 6 months
Data Quality reviewed: Jan. 25, 2012
Planned archival: every 6 months
Responsible User: ABC
Metadata Classification
 Classification by functional areas in the DW
 Data Acquisition
 Data Storage

 Information Delivery

Every DW process occurs in one of these three areas


Data Acquisition
In this area, the data warehouse processes relate
to the following functions:
 Data Extraction
 Data Transformation
 Data Cleansing
 Data Integration
 Data Staging
Data Acquisition
Data Acquisition Metadata
 Development tools are used to record metadata
 Other tools used in this area or in some other
area may use the metadata recorded by other
tools in this area
 For example: when a query tool is used to create
standard queries, you will be using metadata
recorded by processes in the data acquisition
area (query tool is meant for a process in the
Information Delivery area)
Data Acquisition Metadata
 Used for administering & monitoring the ongoing
functions of the DW
 Monitoring ongoing Extraction & Transformation
 Ongoing load images are created properly using the Data Acquisition
metadata

 Used by users to find the data sources for the


data elements in his/her queries
Data Storage
In this area, the data warehouse processes relate
to the following functions:
 Data loading
 Data archiving
 Data management
Data Storage
Data Storage Metadata
 The tools record the metadata elements during
the development phases as well as while the
data warehouse is in operation after deployment
 Metadata used for development, administration,
and by the users
 Used for full data refreshes and incremental data loads
 DBA will use for processes of recovery, backup, and tuning
 For purging and archiving of data
 User wanting to do a drill-down from total quarterly sales to
sales districts, would want to know about when was the last
time the data on district delineation was loaded
Information Delivery
In this area, the data warehouse processes relate
to the following functions:
 Report generation
 Query processing
 Complex analysis
Information Delivery
Information Delivery Metadata
 Most of the processes in this area are meant for end users
 While using the processes, end-users generally use
metadata recorded in processes of the other two areas of
data acquisition and data storage.
 When a user creates a query with the aid of a query
processing tool, he or she can refer back to metadata
recorded in the data acquisition and data storage areas
 can look up the source data configurations, data
structures, and data transformations from the data
acquisition area metadata
 can look up date of last full refresh and the incremental
loads for various tables from the data storage area
metadata
Information Delivery Metadata
 Mostly, metadata corresponding to
 Predefined queries
 Predefined reports
 Input parameter definitions for queries and reports
 Information for OLAP
Metadata: Another Classification
 Business Metadata
 Technical Metadata
Business Metadata
 Describes the content of the DW in more user accessible
terms
 What data you have, where from it came, what it means
and its relationship with other data in the DW
 Display name & content description fields
 Often serves as documentation for DW system
 May include additional layers of categorization that
simplifies user’s view
 Subsetting tables into business process oriented groups
 Grouping related columns in a dimension
 Metadata models used by major BI tools provide these kinds of
groupings
 When users browse the metadata to see what is there in
the DW, they are primarily viewing business metadata
Business Metadata
 Less structured than technical metadata
 Originates from textual documents, spreadsheets, and
even from business rules and policies that are not written
down completely
 Most business users do not have technical skills to create
their own queries or format their own reports. Interested
in canned queries and reports
 Cryptic names not preferred
Examples of Business Metadata
 Connectivity procedures
 Security and access privileges
 The overall structure of data in business terms
 Source systems
 Source-to-target mappings
 Data transformation business rules
 Summarization and derivations
 Table names and business definitions
 Attribute names and business definitions
Examples of Business Metadata
 Data ownership
 Query and reporting tools
 Predefined queries
 Predefined reports
 Report distribution information
 Common information access routes
 Rules for analysis using OLAP
 Currency of OLAP data
 Data warehouse refresh schedule
Business Metadata
 A representative list of questions a business user can ask
from the business metadata:
 How can I sign onto and connect with the data warehouse?
 Which parts of the data warehouse can I access?
 Can I see all the attributes from a specific table?
 What are the definitions of the attributes I need in my query?
 Are there any queries and reports already predefined to give the
results I need?
 Which source system did the data I want come from?
 What default values were used for the data items retrieved by my
query?
Business Metadata
 A representative list of questions a business user can ask
from the business metadata:
 What types of aggregations are available for the metrics needed?
 How is the value in the data item I need derived from other data
items?
 When was the last update for the data items in my query?
 On which data items can I perform drill down analysis?
 How old is the OLAP data? Should I wait for the next update?
Business Metadata
 Who benefits?
 Managers
 Business analysts
 Power users
 Regular users
 Casual users
 Senior managers/junior executives
Technical Metadata
 Defines the objects and processes that make up the DW
system from a technical perspective
 System metadata:
 Defines data structures
 Table, fields, indexes, and partitions in the relational engine
 Databases, dimensions, and measures
 In the ETL process:
 Source & target for a particular task
 Transformations (including business rules and data quality screens) and
their frequency
 Front room:
 Defines the data model
 How data is displayed to the users
 Some technical metadata elements are useful for the
business users like table and column names
 Definition of a table partition functions is of no interest to business user
Examples of Technical Metadata
 Data models of source systems
 Record layouts of outside sources
 Source-to-staging area mappings
 Staging area-to-data warehouse mappings
 Data extraction rules and schedules
 Data transformation rules and versioning
 Data aggregation rules
 Data cleansing rules
 Summarization and derivations
 Data loading and refresh schedules and controls
 Job dependencies
 Program names and descriptions
 Data warehouse data model
 Database names
Examples of Technical Metadata
 Table/view names
 Column names and descriptions
 Key attributes
 Business rules for entities and relationships
 Mapping between logical and physical models
 Network/server information
 Connectivity data
 Data movement audit controls
 Data purge and archival rules
 Authority/access privileges
 Data usage/timings
 Query and report access patterns
 Query and reporting tools
Technical Metadata
 What databases and tables exist?
 What are the columns for each table?
 What are the keys and indexes?
 What are the physical files?
 Do the business descriptions correspond to the technical
ones?
 When was the last successful update?
 What are the source systems and their data structures?
 What are the data extraction rules for each data source?
 What is source-to-target mapping for each data item in the
data warehouse?
 What are the data transformation rules?
Technical Metadata
 What default values were used for the data items while
cleaning up missing data?
 What types of aggregations are available?
 What are the derived fields and their rules for derivation?
 When was the last update for the data items in my query?
 What are the load and refresh schedules?
 How often data is purged or archived? Which data items?
 What is schedule for creating data for OLAP?
 What query and report tools are available?
Technical Metadata
 Who benefits?
 Project manager
 Data warehouse administrator
 Database administrator
 Metadata manager
 Data warehouse architect
 Data acquisition developer
 Data quality analyst
 Business analyst
 System administrator
 Infrastructure specialist
 Data modeler
 Security architect
Coming up next…
 Design & Implementation of Metadata
Metadata - Desing &
Implementation

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
The presentation is based on Chapter 4 –
Introducing the technical architecture
The Data Warehouse Lifecycle Toolkit by Ralph
Kimball, Wiley, 2e, 2008
Metadata
 Metadata Integration and standardization are
the main challenges to metadata management
Metadata Integration
 Most tools and processes in a DW create and
manage their own metadata and keep it in a
local physical storage location called repository
 Several metadata repositories scattered around
the DW system using different storage types
from relational tables to XML files to documents
to spreadsheets
 Overlapping content
Metadata Integration
 Single integrated repository for DW system
metadata is highly desirable and useful
 Impact Analysis – helps in identify the impact of
making a change to the DW system
 Audit & Documentation – lineage analysis
 Metadata Quality & Management – multiple copies of
metadata going out of sync
Metadata Integration
Options
 Most DW systems are based on products from multiple
vendors
 Standard called the Common Warehouse Metamodel
(CWM), along with XMI, which is an XML based metadata
interchange standard, and the MetaObject Facility (MOF)-
the underlying storage facility for CWM
 These standards are managed by Object Management
Group
 Vendor support for working with a standard metadata
repository is slow in coming
 Unlikely that a single standard repository will emerge in a
multi vendor envoronment
 Doing it yourself is not worth the effort!!
Metadata Integration
Options
Single source DW/BI System Vendors
 Major DB, ERP, & BI tool vendors are building complete,
credible DW/BI system technology stacks to offer end-to-
end BI
 As they build or rebuild their products, they are including
metadata management as part of their toolsets and
designing their tools to share a central metadata
repository
 In this scenario, a vendor will have no excuse for avoiding
a single repository
 Best hope!!
Metadata Integration
Options
Core Vendor Product
 Large organization – multiple vendors in your DW/BI
system
 Some ETL and BI tool vendors are working hard to become
the metadata repository supplier by offering metadata
management and repository tools that interface with their
tools and also with other vendor tools
 If your core vendors are not offering any such solutions,
then there are metadata management systems that
comply with CWM and offer a central metadata repository
and modules that can read/write from/to most of the
major DW/BI vendor metadata repositories
Metadata Implementation
Options
Do it Yourself
 Not unreasonable to pick certain metadata elements to
manage and let the rest manage itself
 Focus on business metadata because technical metadata
has to be correct or the system breaks down
 As an architect of a DW/BI system, you need to ensure
that technical metadata is right
 No one is responsible for business metadata
Metadata Summary*
 Metadata is an amorphous subject if we focus on each
little parcel of metadata because it is spread wide
 Few unifying principles:
 Inventory all your metadata to know what you have
 Subject your metadata to standard software library practices
including version control, version migration, and reliable backup and
recovery
 Document your metadata (Metametadatadata)
 Appreciate the value of your metadata assets

* Chapter 9 –
The Significant role of Metadata
Data Warehousing Fundamentals for IT Professionals
Wiley, 2e, 2010
Support for DW in RDBMS

Prof. Navneet Goyal


Computer Science Department
BITS, Pilani
• Most of the material for this presentation has been
taken or adapted from Oracle 9i & 11g Data
Warehousing guides available on the Oracle site.
• The contents have a bias towards DW features
supported by Oracle. Many other RDBMSs also
support similar features.

4/20/2018 Dr. Navneet Goyal, BITS Pilani 2


Support in RDBMS
• Aggregation & Aggregate Navigator
• Partitioning
• Materialized Views
• Bitmap Index & Bitmap Join Index
• Dimensions
• Online Aggregation
• SQL Extensions (RL 9.1.2)

4/20/2018
Aggregation & Aggregate Navigator
• Most RDBMS now have support for aggregates in
the form of Aggregate Navigator (AN)
• AN helps them to improve performance of
queries requiring aggregates and to match the
performance of Multidimensional Databases
• Aggregate Navigation Algorithm and metadata
helps to target most suitable aggregate for a
given query
• Users/queries need not be aware of existing
aggregates
• Implemented using Materialized Views
4/20/2018
Partitioning
• Data warehouses often contain very large tables
and require techniques both for managing these
large tables and for providing good query
performance across them
• Partitioning is supported in most RDBMSs
• Even if it is not supported, one can create them
using the concept of views
– Manually move data from the table to be partitioned to
its partitions (tables)
– Create a view using union of partitions, giving an
illusion that the original table still exists
4/20/2018
Types of Partitioning
Oracle offers four partitioning methods:
• Range Partitioning (already discussed in Module 7)
• Hash Partitioning
• List Partitioning
• Composite Partitioning

4/20/2018
Hash Partitioning
• Distributes data evenly among the partitions/devices
• Easy to use partitioning
• Oracle uses a linear hashing algorithm to avoid skew
• No. of partitions should be in powers of 2
• Users cannot specify alternate hashing functions or
algorithms

4/20/2018
Hash Partitioning: Example
• CREATE TABLE sales_hash
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_amount NUMBER(10),
week_no NUMBER(2))
PARTITION BY HASH(salesman_id)
PARTITIONS 4
STORE IN (data1, data2, data3, data4);

4/20/2018
List Partitioning
• List partitioning enables you to explicitly control how
rows map to partitions.
• Specify a list of discrete values for the partitioning
column in the description for each partition.
• Different from range partitioning, where a range of
values is associated with a partition and with hash
partitioning, where you have no control of the row-to-
partition mapping
• Advantage of list partitioning is that you can group
and organize unordered and unrelated sets of data in a
natural way
List Partitioning: Example
CREATE TABLE sales_list
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_state VARCHAR2(20),
sales_amount NUMBER(10),
sales_date DATE)
PARTITION BY LIST(sales_state)
(
PARTITION sales_west VALUES IN('California', 'Hawaii'),
PARTITION sales_east VALUES IN ('New York', 'Virginia', 'Florida'),
PARTITION sales_central VALUES IN('Texas', 'Illinois'),
);
Composite Partitioning
• Composite partitioning combines range and hash
partitioning.
• First distribute data into partitions according to
boundaries established by the partition ranges.
• Then use a hashing algorithm to further divide the data
into sub-partitions within each range partition
Materialized Views
• Materialized views are supported in most
RDBMSs
• Types of Materialized views supported*
– With aggregates
– With joins
– With both joins & aggregates
– Nested materialized views

4/20/2018
Materialized Views - Aggregates
• CREATE MATERIALIZED VIEW LOG ON sales WITH SEQUENCE, ROWID
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold)
INCLUDING NEW VALUES;

CREATE MATERIALIZED VIEW sum_sales


PARALLEL
BUILD IMMEDIATE
REFRESH FAST ON COMMIT AS
SELECT s.prod_id, s.time_id, COUNT(*) AS count_grp,
SUM(s.amount_sold) AS sum_dollar_sales,
COUNT(s.amount_sold) AS count_dollar_sales,
SUM(s.quantity_sold) AS sum_quantity_sales,
COUNT(s.quantity_sold) AS count_quantity_sales
FROM sales s
GROUP BY s.prod_id, s.time_id;
4/20/2018
Materialized Views - Aggregates
• This example creates a materialized view that contains
aggregates on a single table
• Because the materialized view log has been created with
all referenced columns in the materialized view's defining
query, the materialized view is fast refreshable
• If DML is applied against the sales table, then the changes
are reflected in the materialized view when the commit is
issued.

4/20/2018
Nested Materialized Views
• A nested materialized view is a materialized view whose
definition is based on another materialized view. A nested
materialized view can reference other relations in the
database in addition to referencing materialized views
• In a data warehouse, you typically create many aggregate
views on a single join (for example, rollups along different
dimensions).
• Incrementally maintaining these distinct materialized
aggregate views can take a long time, because the
underlying join has to be performed many times

4/20/2018
Nested Materialized Views
• Using nested materialized views, you can create multiple
single-table materialized views based on a joins-only
materialized view and the join is performed just once.
• Nesting is possible only with materialized views
containing only joins or aggregates
• Materialized views with joins and aggregates are
implemented using nested materialized views

4/20/2018
Bitmap Indexes
• Bitmap indexes are widely used in data warehousing
environments which typically have large amounts of data
and ad hoc queries
• For such applications, bitmap indexing provides:
– Reduced response time for large classes of ad hoc queries
– Reduced storage requirements compared to other indexing techniques
– Dramatic performance gains even on hardware with a relatively small
number of CPUs or a small amount of memory
• Fully indexing a large table with a traditional B-tree index
can be prohibitively expensive in terms of space because
the indexes can be several times larger than the data in
the table
• Bitmap indexes are typically only a fraction of the size of
4/20/2018
the indexed data in the table.
Bitmap Join Indexes
• In addition to a bitmap index on a single table, you can create
a bitmap join index (BJI), which is a bitmap index for the join of
two or more tables.
• A BJI is a space efficient way of reducing the volume of data
that must be joined by performing restrictions in advance.
• For each value in a column of a table, a bitmap join index
stores the rowids of corresponding rows in one or more other
tables.
Bitmap Join Indexes
• In a data warehousing environment, the join condition is an
equi-inner join between the primary key column or columns of
the dimension tables and the foreign key column or columns in
the fact table.
• Bitmap join indexes are much more efficient in storage than
materialized join views, an alternative for materializing joins in
advance.
• This is because the materialized join views do not compress
the rowids of the fact tables.
Dimensions
• A dimension is a structure that categorizes data in order to
enable users to answer business questions.
• Commonly used dimensions are customers, products, and
time.
• In Oracle Database, the dimensional information itself is stored
in a dimension table.
• In addition, the database object dimension helps to organize
and group dimensional information into hierarchies.
• This represents natural 1:n relationships between columns or
column groups (the levels of a hierarchy) that cannot be
represented with constraint conditions
Dimensions
• Dimensions do not have to be defined. However, if your
application uses dimensional modeling, it is worth spending
time creating them as it can yield significant benefits, because
they help query rewrite perform more complex types of
rewrites
• In spite of the benefits of dimensions, you must not create
dimensions in any schema that does not fully satisfy the
dimensional relationships
Dimensions
• Before you can create a dimension object, the dimension
tables must exist in the database possibly containing the
dimension data
• You create a dimension using the CREATE DIMENSION
statement
• For example, you can declare a dimension products_dim, which
contains levels product, subcategory, and category:
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category) ...
Dimensions
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category) ...

• Next step is to specify the hierarchy:


HIERARCHY prod_rollup
(product CHILD OF
subcategory CHILD OF
category)
Online Aggregation
• Consider an aggregate query, e.g., finding the
average sales by state. Can we provide the user with
some information before the exact average is
computed for all states?
– Can show the current running average for each state as
the computation proceeds.
– Even better, if we use statistical techniques and sample
tuples to aggregate instead of simply scanning the
aggregated table, we can provide bounds such as the
average for Wisconsin is 2000102 with 95% probability.
Online Aggregation
– Use non-blocking algorithms for relational operators
– An algorithm is said to be blocking if it does not produce
output tuples, until it has consumed all its input tuples
– Sort-merge join algorithm blocks because sorting requires
all input tuples before determining the first output tuple
– Which join algorithm for online aggregation?

Nested-loop join & hash join vs. sort merge join?


Hash based agg. Vs. sort based agg.
Online Aggregation
SELECT L.state, AVG(S.sales)
FROM Sales S, Location L
WHERE S.Locid=L.Locid
GROUP BY L.state
Thank You
SQL – New Operators for DW

Prof. Navneet Goyal


Computer Science Department
BITS, Pilani
• Examples used in the presentation are taken from
Raghu Ramakrishna book on Database Systems

4/20/2018 Dr. Navneet Goyal, BITS Pilani 2


SQL
• Rollup
• Cube
• Window Queries
• Top N Queries

4/20/2018 Dr. Navneet Goyal, BITS Pilani 3


SQL
• Typically, a single OLAP operation can lead to
several closely related SQL queries with
aggregation and grouping
• Cross-tabulation is an example!

4/20/2018 Dr. Navneet Goyal, BITS Pilani 4


Cube Operator pid timeid Locid sales
11 1 1 25
Locid City State Country
11 2 1 8
1 Madison WI USA
11 3 1 15
2 Fresno CA USA
12 1 1 30
5 Chennai TN India
12 2 1 20
12 3 1 50
Pid Pname category Price 13 1 1 8
11 Lee Jeans Apparel 25 13 2 1 10
13 3 1 10
12 Zord Toys 18
11 1 2 35
13 Biro Pen Stationery 2
11 2 2 22
11 3 2 10
Timeid Date Month Year Holiday 12 1 2 26

1 10/11/05 Nov 1995 N 12 2 2 45

2 11/11/05 Nov 1996 N 12 3 2 20


13 1 2 20
3 12/11/05 Nov 1997 N
13 2 2 40
13 3 2 5
4/20/2018 Dr. Navneet Goyal, BITS Pilani 5
Cube Operator
Select T.year, L.state, SUM (sales) WI CA Total
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid 1995 63 81 144
Group By T.year, L.state
1996 38 107 145
Select T.year, SUM (sales) 1997 75 35 110
from Sales S, Times T
Where S.timeid=T.timeid Total 176 223 399
Group By T.year
Select SUM (sales)
Select L.state, SUM (sales) from Sales S, Locations L
from Sales S, Locations L Where S.locid=L.locid
Where S.locid=L.locid OR
Group By L.state Select SUM (sales)
from Sales S, Time T
Where S.timeid=T.timeid

How many such SQL queries to build cross-tab?


4/20/2018 Dr. Navneet Goyal, BITS Pilani 6
SQL
• Cross-tab can be thought of as a roll-up on the
entire dataset, on location, on time, and on both
location and time dimensions together
• Each roll-up corresponds to a single SQL query
with grouping
• Given a FACT with k associated dimensions, we
will have a total of 2k such SQL queries

4/20/2018 Dr. Navneet Goyal, BITS Pilani 7


SQL
• GROUP BY construct is extended to provide
better support for roll-up and cross-tabulation
queries
• Collection of GROUP BY statement is equivalent
to GROUP BY with CUBE keyword, with one
GROUP BY statement for each subset of the k
dimensions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 8


Cube Operator
Select T.year, L.state, SUM (sales) T.Year L.State SUM(sales)
from Sales S, Times T, Locations L
1995 WI 63
Where S.timeid=T.timeid & S.locid=L.locid
Group By CUBE (T.year, L.state) 1995 CA 81
1995 All 144
1996 WI 38
1996 CA 107
WI CA Total 1996 All 145
1995 63 81 144 1997 WI 75
1997 CA 35
1996 38 107 145 1997 All 110

1997 75 35 110 All WI 176


All CA 223
Total 176 223 399 All All 399

4/20/2018 Dr. Navneet Goyal, BITS Pilani 9


Rollup Operator
Select T.year, L.state, SUM (sales) T.Year L.State SUM(sales)
from Sales S, Times T, Locations L
1995 WI 63
Where S.timeid=T.timeid & S.locid=L.locid
Group By ROLLUP (T.year, L.state) 1995 CA 81
1995 All 144

WI CA Total 1996 WI 38
1996 CA 107
1995 63 81 144 1996 All 145
1997 WI 75
1996 38 107 145 1997 CA 35
1997 75 35 110 1997 All 110
All All 399
Total 176 223 399
Find out wht the following SQL will generate?
Select T.year, L.state, SUM (sales)
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid
Group By ROLLUP (L.state, T.year)

4/20/2018 Dr. Navneet Goyal, BITS Pilani 10


Window Queries in

• Time dimension is very important in decision


support
• Queries involving trend analysis have been
difficult to express in SQL
• A fundamental extension called a query
window is introduced

4/20/2018 Dr. Navneet Goyal, BITS Pilani 11


Window Queries
• The WINDOW clause in SQL allows us to write such queries
over a table viewed as a sequence (implicitly, based on user-
specified sort keys)
• Also referred to as “querying sequences“
• WINDOW clause intuitively identifies an ordered window of
rows around each tuple in a table
• We can apply a rich collection of aggregate functions to the
window of a row and extend the row with the results
• For example, we can associate the avg. sales over the past 3
days with every sales tuple (daily granularity)
• This gives a 3-day moving avg. of sales
4/20/2018 Dr. Navneet Goyal, BITS Pilani 12
WINDOW & GROUP BY
• Like the WINDOW operator, GROUP BY allows us to
create partitions of rows and apply aggregate function
such as SUM to rows in a partition
• Unlike WINDOW, there is a single output row for each
partition, rather than one output row for each row, and
each partition is an unordered collection of rows
• COMPARE with CUBE!!!!

4/20/2018 Dr. Navneet Goyal, BITS Pilani 13


WINDOW: Example
SELECT L.state, T.month, AVG(S.sales) OVER W AS movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.locid=L.locid
WINDOW W AS (PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL `1’ MONTH PRECEDING
AND INTERVAL `1’ MONTH FOLLOWING)
• FROM & WHERE clauses proceed as usual to generate an intermediate table,
TEMP.
• WINDOWS are created over TEMP
• 3 steps in defining a window
– Define partitions of the table (Partitions are similar to groups created by GROUP
BY)
– Specify the ordering of rows within a partition
– Frame WINDOW: establish the boundaries of the window associated with each
row in terms of ordering of rows within partitions

4/20/2018 Dr. Navneet Goyal, BITS Pilani 14


WINDOW: Example
SELECT L.state, T.month, AVG(S.sales) OVER W AS movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.locid=L.locid
WINDOW W AS (PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL `1’ MONTH PRECEDING
AND INTERVAL `1’ MONTH FOLLOWING)
– Define partitions of the table (Partitions are similar to groups created by
GROUP BY)
– Specify the ordering of rows within a partition
– Frame WINDOW: establish the boundaries of the window associated with
each row in terms of ordering of rows within partitions
– Window for each row includes the row itself, plus all rows whose month values
are within a month before or after.
– A row whose month value is June 2006 has a window containing all rows with
month = May, June, or July 2006

4/20/2018 Dr. Navneet Goyal, BITS Pilani 15


WINDOW: Example
SELECT L.state, T.month, AVG(S.sales) OVER W AS movavg
FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.locid=L.locid
WINDOW W AS (PARTITION BY L.state
ORDER BY T.month
RANGE BETWEEN INTERVAL `1’ MONTH PRECEDING
AND INTERVAL `1’ MONTH FOLLOWING)
– Answer rows to each row is constructed first by identifying its WINDOW
– Then, for each answer column defined using a window agg. fn, we compute
the agg. Using the rows in the WINDOW
– Each row of TEMP is a row of sales, tagged with extra details about time &
location dimensions
– One partition for each state and every row of temp belongs to exactly one
partition.

4/20/2018 Dr. Navneet Goyal, BITS Pilani 16


Top N Queries
• If you want to find the 10 (or so) cheapest cars, it would
be nice if the DB could avoid computing the costs of all
cars before sorting to determine the 10 cheapest.
– Idea: Guess at a cost c such that the 10 cheapest all cost less
than c, and that not too many more cost less. Then add the
selection cost<c and evaluate the query.
• If the guess is right, great, we avoid computation
for cars that cost more than c.
• If the guess is wrong, need to reset the selection
and recompute the original query.
Top N Queries
SELECT P.pid, P.pname, S.sales
FROM Sales S, Products P
WHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3
ORDER BY S.sales DESC
OPTIMIZE FOR 10 ROWS

• OPTIMIZE FOR construct is not in SQL:92 & not even in


SQL:1999!
• Supported by IBM’s DB2 & Otacle 9i has similar constructs
• Compute sales only for those products that are likely to be in
TOP 10
Top N Queries
SELECT P.pid, P.pname, S.sales
FROM Sales S, Products P
WHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3
AND S.sales > c
ORDER BY S.sales DESC

• Cut-off value c is chosen by optimizer using the histogram on


the sales column of the sales relation
• Much faster approach
• Issues:
– How to choose c?
– What if we get more than 10 products?
– What if we get less than 10 products?
Thank You
Real-Time
Data Warehousing

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 What is Real-Time Data Warehousing
(RTDWH)?
 Why we need RTDWH?
 Challenges
 Solutions
Introduction
 Traditionally DWs do not contain today's data. They are
usually loaded with data from operational systems at most
weekly or in some cases nightly, but are in any case a
window on the past.
 The fast pace of business today is quickly making these
historical systems less valuable to the issues facing
managers and government officials in the real world.
 Some Examples:
 Morning sales on the east coast will affect how stores are stocked on
the west coast.
 Airlines and government agencies need to be able to analyze the
most current information when trying to detect suspicious groups of
passengers or potentially illegal activity
 Fast-paced changes in the financial markets may make the
personalized suggestions on a stockbroker's website obsolete by the
time they are viewed.
Introduction
 As today's decisions in the business world become more
real-time, the systems that support those decisions need
to keep up.
 It is only natural that DW/BI systems quickly begin to
incorporate real-time data.
 DW/BI applications are designed to answer exactly the
types of questions that users would like to pose against
real-time data.
 They are able to analyze vast quantities of data over time,
to determine what is the best offer to make to a customer,
or to identify potentially fraudulent, illegal, or suspicious
activity.
Introduction
 Ad-hoc reporting is made easy using today's advanced
OLAP tools
 All that needs to be done is to make these existing
systems and applications work off real-time data.
 Let us examine the challenges of adding real-time data to
these system, and look at solutions for real-time
warehousing
Introduction
 Today’s integration project teams face the daunting
challenge that, while data volumes are exponentially
growing, the need for timely and accurate business
intelligence is also constantly increasing
 Batches for DW loads used to be scheduled daily to
weekly; today’s demand information that is as fresh as
possible
 The value of this real-time business data decreases as it
gets businesses older
 Latency of data integration is essential for the business
value of the DW
 At the same time the concept of “business hours” is
vanishing for a global enterprise, as data warehouses are
in use 24 hours a day, 365 days a year
Introduction
 This means that the traditional nightly batch windows are
becoming harder to accommodate, and interrupting or
slowing down sources is not acceptable at any time during
the day.

 Finally, integration projects have to be completed in


shorter release timeframes, while fully meeting functional,
performance, and quality specifications on time and within
budget

 Real-time data warehousing requires a solid approach to


data integration and most importantly, the ability to
transform and filter data on-the-fly to ensure it meets the
needs of its different users
Introduction
 Business Intelligence (BI) applications and their underlying
data warehouses have been used primarily as strategic
decision-making tools

 Kept Separate from Operational systems that manage


day-to-day business operations

 Significant industry momentum toward using DW/BI for


driving tactical day-to-day business decisions and
operations
Why Real Time Data
Warehousing?
 Active decision support
 Business activity monitoring (BAM)
 Alerting
 Positions information for use by downstream
application
Why Real Time Data
Warehousing?
 Users need to access two different
systems:
 DW for historical picture of what happened
in the past
 Many OLTP systems for what is happening
today
Traditional Vs. Real-Time
Data Warehouse
 Traditional Data Warehouse (EDW)
 Strategic
• Passive
• Historical trends
 Batch
• Offline analysis
 Isolated
• Not interactive
 Best effort
• Guarantees neither availability nor performance
Traditional Vs. Real-Time
Data Warehouse
 Real-Time Data Warehouse (RTDWH)
 Tactical
• Focuses on execution of strategy
 Real-Time
• Information on Demand
• Most up-to-date view of the business
 Integrated
• Integrates data warehousing with business
processes
 Guaranteed
• Guarantees both availability and performance
Real-Time Integration
 Goal of real-time data extraction,
transformation and loading
 Keep warehouse refreshed
 Minimal delay
 Issues
 How does the system identify what data has
been added or changed since the last extract
 Performance impact of extracts on the source
system
RTDWH Lineage
 Operational Data Source (ODS)
 Motivations of the original ODS were
similar to modern RTDWH
 Implementation of RTDWH reflects a new
generation of SW/HW & techniques
Real-Time FT Partitions
 Assuming that an application requires a
true-real time DW, the simplest approach
is to continuously feed the data
warehouse with new data from the source
system
 This can be done by either directly
inserting or updating data in the
warehouse fact tables, or by inserting
data into separate fact tables in a real-
time partition
Kimball’s Approach to
RTDWH
 Real-Time Partitions (Generation 2)
 Separate real-time fact table is created whose
grain & dimensionality matches that of the
corresponding FT in the static (nightly loaded)
DW
 Real-time FT contains only current day’s facts
(those not yet loaded into the static FT)
 Each night, the contents of RTFT are written
to the static FT and the RTFT is purged, ready
to receive the next day’s facts
Kimball’s Approach to
RTDWH
 Real-Time Partitions
 Gives RT reporting benefits of the ODS into
the DW itself, eliminating ODS architectural
overhead
 Facts are trickle fed into the RTFTs throughout
the day
 User queries against the RTFTs are neither
halted nor interrupted by this loading process
Kimball’s Approach to
RTDWH
 Real-Time Partitions
 Indexing is minimal
 Performance is achieved by restricting the
amount of data in RTFTs
 Caching entire RTFT in memory
 Create view to combine data from both static
& real-time FT, providing a virtual star
schema to simplify queries that demand views
of historical measures that extend to the
moment
Kimball’s Approach to
RTDWH
 Real-Time Partitions
 Fact records alone trickled into RTFT
 Any issues?
 What about the changes to DTs that occur
between the nightly bulk loads?
 New customers created during the day!
 Are we focusing only on fresh facts?
Kimball’s Approach to
RTDWH
 Real-Time Partitions
 Hybrid approach to SCD in real time
environment
• Treat intra-day changes to a DT as TYPE 1, where a
special copy of the DT is associated with the RT
partition
• Changes during the trigger simple overwrites
• At the end of the day, any such changes can be
treated as TYPE 2 in the original DT
References
 Real-Time Data Warehousing: Challenges and Solutions by
Justin Langseth
(http://dssresources.com/papers/features/langseth/langse
th02082004.html)

 Best Practices for Real-Time Data Warehousing


An Oracle White Paper, March 2014

 Ralph Kimball in his February 2002 "Real-time Partitions"


article in Intelligent Enterprise
(http://www.intelligententerprise.com/020201).
Coming up Next…
 Real-time ETL
 Role of ODS in RTDWH
Real-Time ETL

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Introduction
 One of the most difficult parts of building any
DW is the process of extracting, transforming,
cleansing, and loading the data from the
source system
 Performing ETL of data in real-time introduces
additional challenges
 Almost all ETL tools and systems, whether
based on off-the-shelf products or custom-
coded, operate in a batch mode
 They assume that the data becomes available
as some sort of extract file on a certain
schedule, usually nightly, weekly, or monthly.
Then the system transforms and cleanses the
data and loads it into the DW
Introduction
 When loading data continuously in real-time,
there can't be any system downtime
 The heaviest periods in terms of DW usage
may very well coincide with the peak periods
of incoming data
 The requirements for continuous updates with
no DW downtime are generally inconsistent
with traditional ETL tools and systems
 Fortunately, there are new tools on the
market that specialize in real-time ETL and
data loading
 There are also ways of modifying existing ETL
systems to perform real-time or near real-
time DW loading
Real-Time ETL
 The main challenge in RTDWH is to bring
integrated and transformed data with zero or
almost zero latency to the DW
 Role of ETL tool becomes critical
Near Real-Time ETL
 Simply increasing the frequency of the existing data
load may be sufficient.
 A data load that currently occurs weekly can perhaps be
performed instead daily, or twice a day
 A daily data load could be converted to an hourly data
load
 Tricke & Flip approach
 While not real-time, near-real time may be a good
inexpensive first step
Trickle & Flip
 The "Trickle & Flip" approach helps avert the scalability
issues associated with querying tables that are being
simultaneously updated
 Instead of loading the data in real-time into the actual
DW tables, the data is continuously fed into staging
tables that are in the exact same format as the target
tables
 Depending on the data modeling approach being used
(see Real-Time FT Partitions), the staging tables either
contain a copy of just the data for the current day, or
for smaller fact tables can contain a complete copy of
all the historical data
Trickle & Flip
 Then on a periodic basis the staging table is duplicated
and the copy is swapped with the fact table, bring the
DW instantly up-to-date
 If the "integrated real-time partition through views"
approach is being used, this operation may simply
consist of changing the view definition to include the
updated table instead of the old table
 Depending on the characteristics of how this swap is
handled by the particular RDBMS, it might be advisable
to temporally pause the OLAP server while this flip
takes place so that no new queries are initiated while
the swap occurs.
Trickle & Flip
 This approach can be used with cycle times ranging
from hourly to every minute
 Generally best performance is obtained with 5-10
minute cycles, but 1-2 minute cycles (or even faster)
are also possible for smaller data sets or with sufficient
database hardware
 It is important to test this approach under full load
before it is brought into production to find the cycle
time that works best for the application
External Real-Time Data
Cache (RTDC)
 All of the solutions discussed so far involve the DW’s
underlying database taking on a lot of additional load to
deal with the incoming real-time data, and making it
available to warehouse users
 The best option in many cases is to store the real-time
data in an external real-time data cache (RTDC) outside
of the traditional DW, completely avoiding any potential
performance problems and leaving the existing
warehouse largely as-is.
External Real-Time Data
Cache (RTDC)
 The RTDC can simply be another dedicated database
server (or a separate instance of a large database
system) dedicated to loading, storing, and processing
the real-time data
 Applications that either deal with large volumes of real-
time data (hundreds or thousands of changes per
second), or those that require extremely fast query
performance, might benefit from using an in-memory
database (IMDB) for the RTDC
 Such IMDBs are provided by companies such as
Angara, Cacheflow, Kx, TimesTen, and InfoCruiser
External Real-Time Data
Cache (RTDC)
 Regardless of the database that is used to hold the
RTDC, its function remains the same
 All the real-time data is loaded into the cache as it
arrives from the source system
 Depending on the approach taken and the analytical
tools being used, either all queries that involve the real-
time data are directed to the RTDC, or the real-time
data required to answer any particular query is
seamlessly imaged to the regular data warehouse on a
temporary basis to process the query
External Real-Time Data
Cache (RTDC)
 Using an RTDC, there's no risk of introducing scalability
or performance problems on the existing DW, which is
particularly important when adding real-time data to an
existing production DW
 Further, queries that access the real-time data will be
extremely fast, as they execute in their own
environment separate from the existing DW
 Good solution for users who need up-to-the-second
data typically don't want to wait too long for their
queries to return
 Also by using just-in-time data merging from the RTDC
into the DW, or reverse just-in-time merging (JIM) from
the warehouse into the RTDC, queries can access both
real-time and historical data seamlessly
RTDC & JIM
 There is a class of applications which tend to
share two or more of the following
characteristics:
 Require true real-time data (not near-real-time)
 Involve rapidly-changing data (10-1000 transactions per
second)
 Need to be accessed by 10s, 100s, or 1000s of concurrent
users
 Analyze real-time data in conjunction with historical
information
 Involve complex, multi-pass, analytical OLAP queries
RTDC & JIM
 These applications require the best aspects of
a traditional DW such as access to large
amounts of data, analytical depth, and
massive scalability
 They also require the access to real-time data
and processing speed provided by a real-time
data cache.
 Necessary to use a hybrid approach.
 The real-time information sits in RTDC and the
historical information sits in a DW, and the two are
efficiently linked together as needed. This can be
accomplished by an approach known as just-in-time
information merging (JIM)
RTDC & RJIM
 A variant of JIM is Reverse Just-in-time Data Merging
(RJIM)
 RJIM is useful for queries that are mainly based on real-
time data, but that contain limited historical information
as well
 In RJIM, a similar process takes place, except the
needed historical information is loaded from the DW
into the RTDC on a temporary basis, and then the
query is run in the data cache
 This only works when the data cache is located in a
RDBMS system with full SQL support, and will not work
with some IMDB systems that do not support many SQL
functions.
Real-Time ETL
 Tool that moves data asynchronously into a
DW with some urgency – within minutes of
execution of the business Tx
 RTDWH demands a different approach to ETL
methods used in batch-oriented DW
 Running ETL batches more frequently is not
practical either to OLTP or to DW
 Including the DW in the commit logic doesn’t
work either
 Locking & 2-phase commit also doesn’t work
across systems with different structures &
granularity
Real-Time ETL
 ETL system has a well defined boundary
where dimensionally prepared data is handed
over to the front room
 A real-time system cannot have this boundary
 Architecture of front-end tools is also affected
at the same time
 3 data delivery paradigms that require an
end-to-end perspective (from original source
to user’s screen)
 Alerts
 Continuous polling
 Non-event notification
Real-Time ETL
 Alerts
 A data condition at the source forces an update to
occur at the user’s screen in real time
 Continuous polling
 The end user’s application continuously probes the
source data in order to update the user’s screen in
real-time
 Non-event notification
 The end user is notified if a specific event does not
occur within a time interval or as the result of a
specific condition
References
 Real-Time Data Warehousing: Challenges and Solutions by
Justin Langseth
(http://dssresources.com/papers/features/langseth/langse
th02082004.html)
 Best Practices for Real-Time Data Warehousing
An Oracle White Paper, March 2014
 Ralph Kimball in his February 2002 "Real-time Partitions"
article in Intelligent Enterprise
(http://www.intelligententerprise.com/020201).
Role of ODS in Real-Time
Data Warehousing

Prof. Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
RTDWH Lineage
 Operational Data Source (ODS)
 Motivations of the original ODS were
similar to modern RTDWH
 Implementation of RTDWH reflects a new
generation of SW/HW & techniques
A Word About ODS
 ODS is also referred to as Generation 1 DW
 Separate system that sat between source
transactional system & DW
 Hot extract used for answering narrow range of
urgent operational questions like:
 Was the order shipped?
 Was the payment made?
 ODS is particularly useful when:
 ETL process of the main DW delayed the availability of
data
 Only aggregated data is available
A Word About ODS
 ODS plays a dual role:
 Serve as a source of data for DW
 Querying
 Supports lower-latency reporting through
creation of a distinct architectural construct &
application separate from DW
 Half operational & half DSS
 A place where data was integrated & fed to a
downstream DW
 Extension of the DW ETL layer
A Word About ODS
 ODS has been absorbed by the DW
 Modern DWs now routinely extract data on a
daily basis
 Real-time techniques allow the DW to always
be completely current
 DWs hav become far more operational than in
the past
 Footprints of conventional DW & ODS now
overlap so completely that it is not fruitful to
make a distinction between the kinds of
systems
A Word About ODS
 Classification of ODS based on:
 Urgency
• Class I - IV
 Position in overall architecture
• Internal or External
A Word About ODS
A Word About ODS
 Urgency
 Class I – Updates of data from operational
systems to ODS are synchronous
 Class II – Updates between operational
environment & ODS occurs between 2-3 hour
frame
 Class III – synchronization of updates occurs
overnight
A Word About ODS
 Urgency
 Class IV – Updates into the ODS from the DW
are unscheduled
• Data in the DW is analyzed, and periodically placed
in the ODS
• For Example –Customer Profile Data
• Customer Name & ID
• Customer Volume – High/low
• Customer Profitability – High/low
• Customer Freq. of activity – very freq./very infreq.
• Customer likes & dislikes
RTDWH
 RTDWH advocates that instead of pulling
operational data from OLTP system in
nightly batch jobs into an ODS, data
should be collected from OLTP systems as
and when events occur move them
directly into the data warehouse.
 This enables the data warehouse to be
updated instantaneously and removes the
necessity of an ODS.
Concluding Remarks
 If an ODS exists (Type 1), then it
can contribute towards RTDWH
 If it does not exist, then there is no
point in building an ODS for either
conventional DW or for RTDWH
World’s Largest Data Warehouse
Navneet Goyal
Department of Computer Science
BITS, Pilani (Pilani Campus)
World’s Largest
Data Warehouse
• SAP in conjunction with NetApp and several other partners
• @ SAP/Intel data center in Santa Clara, California
• 12 petabytes (PB) of addressable storage had been created
• Guinness World Record
• Based on the SAP® HANA in-memory data platform, SAP
IQ (formerly Sybase IQ), and BMMsoft Federated EDMT.
• NetApp® SAN storage
• Contains more than 221 trillion transactional records
• more than 100 billion unstructured documents, including
emails, SMS, and images
• It also contains data from 30 billion sources, including
users, smart sensors, and mobile devices.

Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp
World’s Largest
Data Warehouse
• To achieve these impressive results, a data warehouse
environment was created by ingesting 3 PB per day of
synthetic data for four consecutive days—a feat that
required exceptional storage system performance and
reliability. For that, SAP turned to NetApp® SAN storage
SAP HANA: In-Memory Database
• Companies are always trying to find the best way to store
data in a meaningful format so that they can make better
business decisions
• Since the birth of data warehousing almost 30 years ago,
numerous innovations in data management have been
made, such as Hadoop and NoSQL
• HANA (High-Performance Analytic Appliance)
• A platform for processing high volumes of operational and
transactional data in real time
All material related to HANA taken from:
SAP HANA: In-Memory Database

Today’s Technology requires tradeoff


SAP HANA: In-Memory Database

Delivering across 5 dimensions of decision processing


SAP HANA: In-Memory Database

In-memory computing
SAP HANA: In-Memory Database

In-memory computing
SAP HANA: In-Memory Database

In-memory computing
SAP HANA: In-Memory Database

Storing Data
SAP HANA: In-Memory Database

Storing Data
SAP HANA: In-Memory Database

Storing Data
SAP HANA: In-Memory Database

Using Data
NetApp SAN Storage
• deployment of the storage hardware (weighing nearly two tons!)
occurred over several
• NetApp E-Series storage was deployed for the majority of the
data, to the tune of a total of 5.4 petabytes of physical storage
capacity spread across 20 E5460 storage arrays and 1,800 three
terabyte NL-SAS disk drives
• The record of 12.1 petabytes of total addressable capacity was
achieved thanks in large part to an SAP in-system data
compression rate of 85% for the 50/50 mix of structured and
unstructured data.
• NetApp E-Series storage was selected for the project because of its
proven 99.999% availability and its ability to handle the project’s
data ingest requirement of 34.3 TB per hour
• The E-Series SAN used a Fibre Channel fabric to support SAP IQ’s
large and varied data needs
Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp
SAP IQ
• A highly optimized RDBMS built for extreme-scale Big
Data analytics and warehousing
• Developed by SYBASE. Now a SAP company
• SAP IQ holds the Guinness World Record for fastest
loading and indexing of Big Data
• IQ is a column-based, petabyte scale, relational
database software system used for business intelligence,
data warehousing, and data marts
• Its primary function is to analyze large amounts of data
in a low-cost, highly available environment
• SAP IQ is often credited with pioneering the
commercialization of column-store technology
A word about HIVE
• A data warehouse infrastructure built on top of Hadoop
• Enables easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files.
• provides a mechanism to put structure on this data
• A simple query language called Hive QL (based on SQL)
which enables users familiar with SQL to query this data.
• HIVE QL allows traditional map/reduce programmers to
be able to plug in their custom mappers and reducers to
do more sophisticated analysis which may not be
supported by the built-in capabilities of the language.
A word about HIVE
HIVE provides:
• Tools to enable easy data extract/transform/load
(ETL)
• A mechanism to impose structure on a variety of
data formats
• Access to files stored either directly in Apache
HDFS or in other data storage systems such as
Apache HBase
• Query execution via MapReduce
Thank You
BIG Data Analytics:
An Overview
Navneet Goyal
Department of Computer Science
BITS, Pilani (Pilani Campus)
Topics
• Big Data Analytics
• Extended RDBMS Architecture
• MapReduce/Hadoop
BIG Data Analytics
It is like a mountain of data that you have to climb.
The height of the mountain keeps increasing everyday
and you have to climb it in less time than yesterday!!

• This is the kind of challenge big data throws at us


• Has forced us to go back to the drawing board and
design every aspect of a compute system afresh
• Has spawned research in every sub area of Computer
Science
What Is Big Data?
• There is no consensus as to how to define big data
Big data exceeds the reach of commonly used hardware
environments and software tools to capture, manage, and process it
with in a tolerable elapsed time for its user population. - Teradata
Magazine article, 2011

Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze. - The McKinsey Global Institute, 2011
BIG DATA
o BIG DATA poses a big challenge to our capabilities
o Data scaling outdoing scaling of compute resources
o CPU speed not increasing either
o At the same time, BIG DATA offers a BIGGER opportunity
o Opportunity to
o Understand nature
o Understand evolution
o Understand human behavior/psychology/physiology
o Understand stock markets
o Understand road and network traffic
o …
o Opportunity for us to be more and more innovative!!

BITS -PILANI, PILANI CAMPUS


Big Data - Sources
• Telecom data ( 4.75 bn mobile subscribers)
• There are 3 Billion Telephone Calls in US each day,
30 Billion emails daily, 1 Billion SMS, IMs.
• IP Network Traffic: up to 1 Billion packets per hour
per router. Each ISP has many (hundreds) routers!
• WWW
• Weblog data (160 mn websites)
• Email data
• Satellite imaging data
• Social networking sites data
• Genome data
• CERN s LHC 5 petabytes/year)
BIG DATA

• Just a Hype?
• Or a real Challenge?
• Or a great Opportunity?
• Challenge in terms of how to manage & use this data
• Opportunity in terms of what we can do with this data to
enrich the lives of everybody around us and to make our
mother Earth a better place to live
• We are living in a world of DATA!!
• We are (partly) and will be (fully) driven by DATA

BITS -PILANI, PILANI CAMPUS


BIG DATA

o We are generating more Data that we can


handle!!!
o Using Data to our benefit is a far cry!!!
o In future, everything will be Data driven
o High time we figured out how to tame this
“monster” and use it for the benefit of the society

BITS -PILANI, PILANI CAMPUS


BIG DATA
Best Quote so far…

Dhiraj Rajaram, Founder & CEO of Mu Sigma, leading Data


Analytics co.

"Data is the new 'oil' and there is a growing


need for the ability to refine it,"
BITS -PILANI, PILANI CAMPUS
BIG DATA

Another Interesting Quote

“We don’t have better Algorithms, We


just have more data”

- Peter Norvig, Director of Research, Google

BITS -PILANI, PILANI CAMPUS


Analyzing BIG DATA

Data analysis, organization, retrieval, and


modeling are other foundational challenges.
Data analysis is a clear bottleneck in many
applications, both due to lack of scalability of the
underlying algorithms and due to the complexity
of the data that needs to be analyzed*

*Challenges and Opportunities with Big Data


A community white paper developed by leading researchers across the United
States

BITS -PILANI, PILANI CAMPUS


Source:Challenges and Opportunities with Big Data
A community white paper developed by leading researchers across the United States
BITS -PILANI, PILANI CAMPUS
BIG DATA
o BIG DATA is spawning research in:
o Databases
o Data Analytics (Data Warehousing, Data Mining & Machine Learning)
o Parallel Programming & Programming Models
o Distributed and High Performance Computing
o Domain Specific Languages
o Storage Technologies
o Algorithms & Data Structures
o Data Visualization
o Architecture
o Networks
o Green Computing
o …

BITS -PILANI, PILANI CAMPUS


BIG Data Analytics
• Data Warehousing, Data Mining & Machine Learning
are at the core of BIG Data Analytics
Why Cluster Computing?
• Scalable
• Only way forward to deal with Big Data

• Embrace this technology till any new


disruptive/revolutionary technology surfaces
Extended RDBMS Architecture*
• Support for new data types required by BIG data
– Vectors & matrices
– Multimedia data
– Unstructured and semi-structured data
– Collection of name-value pairs, called as data bags
• Provide support for processing new data types
within the DBMS inner loop by means of user-
defined functions

* The DW Toolkit, Kimball & Ross (Chapter 21), Wiley, 3e


MapReduce*
• What s there in a name?
• Everything!!
• Map + Reduce
• Both are functions used in functional programming
• Has primitives in LISP & other functional PLs
• So what is functional programming?

* Google paper 2004: MapReduce: Simplified Data


Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
OSDI, 2004
Functional Programming
• Functional programming is a programming paradigm
that treats computation as the evaluation of
mathematical functions
• Interactive PL
• Expression Value
• function arg arg …argn)*
• (+ 12 23 34 45)
104
*Scheme – a dialect of LISP, is the 2nd oldest PL that is still in
use
Functional Programming
• In early days computer use was very expensive, it was
obvious to have the programming language resemble
the architecture of the computer as close as possible.
• A computer consists of a central processing unit and a
memory.
• Therefore a program consisted of instructions to
modify the memory, executed by the processing unit
• With that the imperative programming style arose
• Imperative programming language, like Pascal and C,
are characterized by the existence of assignments,
executed sequentially.
Reference: Functional Programming by Jeroen Fokker
Functional Programming
• Functions express the connection between parameters (the
input and the result the output of certain processes.
• In each computation the result depends in a certain way on the
parameters. Therefore a function is a good way of specifying a
computation.
• This is the basis of the functional programming style.
• A program consists of the definition of one or more functions
• With the execution of a program the function is provided with
parameters, and the result must be calculated.
• With this calculation there is still a certain degree of freedom
• For instance, why would the programmer need to prescribe in
what order independent subcalculations must be executed?

Reference: Functional Programming by Jeroen Fokker


Functional Programming
• The theoretical basis of imperative programming was
already founded in the 30s by Alan Turing (in England)
and John von Neuman (in the USA)
• The theory of functions as a model for calculation
comes also from the 20s and 30s. Some of the founders
are M. Sch¨onfinkel (in Germany and Russia), Haskell
Curry (in England) and Alonzo Church (in the USA
• The language Lisp of John McCarthy was the first
functional programming language, and for years it
remained the only one
Reference: Functional Programming by Jeroen Fokker
Functional Programming
• ML, Scheme (an adjustment to Lisp), Miranda and
Clean are other examples of functional programming
languages
• Haskell (first unified PL) & Gofer (a simplified version of
Haskell)
• ML & Schemes have overtones of imperative
programming languages and therefore are not purely
FPLs
• Miranda is a purely FPL!

Reference: Functional Programming by Jeroen Fokker


Gofer
? 5+2*3
11
(5 reductions, 9 cells)
?
• The interpreter calculates the value of the expression
entered, where * denotes multiplication.
• After reporting the result (11) the interpreter reports the
calculation took 5 reductions a measure for the amount of
time needed and 9 cells a measure for the amount of
memory used)
• The question mark shows the interpreter is ready for the
next expression
Reference: Functional Programming by Jeroen Fokker
Gofer
? sum [1..10]
55
(91 reductions, 130 cells)
• In this example [1..10] is the Gofer notation for the list of
numbers from 1 to 10.
• The standard function sum can be applied to such a list to
calculate the sum (55) of those numbers.
• A list is one of the ways to compose data, making it possible to
apply functions to large amounts of data.
• Lists can also be the result of a function:
? sums [1..10]
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55]
(111 reductions, 253 cells)
• The standard function sums returns next to the sum of the
numbers in the list also all the intermediate results.
Reference: Functional Programming by Jeroen Fokker
Gofer
? reverse (sort [1,6,2,9,2,7])
[9, 7, 6, 2, 2, 1]
(52 reductions, 135 cells)

• g (f x) means that f should be applied to x and g should


be applied to the result of that
• Gofer is a largely parenthesis free language

Reference: Functional Programming by Jeroen Fokker


Gofer: Defining New Functions
• The editor is called by typing :edit , followed by the
name of a file, for example:
? :edit new
• Definition of the factorial function can be put in the file
new .
• In Gofer the definition of the function fac could look
like:
fac n = product [1..n]

Reference: Functional Programming by Jeroen Fokker


Gofer: Defining New Functions
? :load new
Reading script file "new":
Parsing..........................................................
Dependency analysis..............................................
Type checking....................................................
Compiling........................................................
Gofer session for:
/usr/staff/lib/gofer/prelude
new
?
• Now fac can be used
? fac 6
720
(59 reductions, 87 cells)
Reference: Functional Programming by Jeroen Fokker
Gofer: Adding fn to a file
• It is possible to add definitions to a file when it is already
loaded. Then it is sufficient to just type :edit; the name of
the file needs not to be specified.
• For example a function which can be added to a file is the
function n choose k : the number of ways in which k objects
can be chosen from a collection of n objects
• This definition can, just as with fac, be almost literally been
written down in Gofer:
choose n k = fac n / (fac k * fac (n-k))
Example:
? choose 10 3
120
(189 reductions, 272 cells)

Reference: Functional Programming by Jeroen Fokker


Gofer: Defining a New Operator
• A operator is a function with two parameters which is
written between the parameters instead o fin front of
them
• In Gofer it is possible to define your own operators
• The function choose from could have been defined as
an operator, for example as !ˆ! :
n !^! k = fac n / (fac k * fac (n-k))

Reference: Functional Programming by Jeroen Fokker


Gofer: Nesting Functions
• Parameter of a function can be a function itself too!
• An example of that is the function map, which takes two
parameters: a function and a list.
• The function map applies the parameter function to all the
elements of the list.
• For example:
? map fac [1,2,3,4,5]
[1, 2, 6, 24, 120]
? map sqrt [1.0,2.0,3.0,4.0]
[1.0, 1.41421, 1.73205, 2.0]
? map even [1..8]
[False, True, False, True, False, True, False, True]
• Functions with functions as a parameter are frequently used in
Gofer (why did you think it was called a functional language?).

Reference: Functional Programming by Jeroen Fokker


MapReduce/Hadoop
o MapReduce* - A programming model & its associated
implementation
o provides a high level of abstraction
o but has limitations
o Only data parallel tasks stand to benefit!
o MapReduce hides parallel/distributed computing
concepts from users/programmers
o Even novice users/programmers can leverage cluster
computing for data-intensive problems
o Cluster, Grid, & MapReduce are intended platforms for
general purpose computing
o Hadoop/PIG combo is very effective!

*MapReduce: Simplified Data Processing on Large Clusters


Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004 BITS -PILANI, PILANI CAMPUS
MapReduce
• MapReduce works by breaking processing into the
following 2 phases:
– Map (inherently parallel – each list el. processed ind.)
– Reduce (inherently sequential)
Map Function
• Applies to a list
• map(function, list) calls function(item) for each of the
list s items and returns a list of the return values. For
example, to compute some cubes:
>>> def cube(x): return x*x*x
...
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000, 1331]
Reduce Function
• Applies to a list
• reduce(function, list) returns a single value constructed
by calling the binary function on the first two items of
the list, then on the result and the next item, and so
on…
• For example, to compute the sum of the numbers 1
through 11:
• >>> def add(x,y): return x+y
...
• >>> reduce(add, range(1, 11))
66
My view of MapReduce
Distribute
Data Distributed file system

Compute
MAP Parallelism

Compile
REDUCE Aggregate
MapReduce
• A programming model & its associated implementation
• MapReduce is a patented software framework
introduced by Google to support distributed computing
on large data sets on clusters of computers (wiki defn.)
• Inspired by the map & reduce functions of functional
programming (in a tweaked form)
MapReduce
• MapReduce is a framework for processing huge
datasets on certain kinds of distributable problems
using a large number of computers
– cluster (if all nodes use the same hardware) or as
– grid (if the nodes use different hardware)
(wiki defn.)
• Computational processing can occur on data stored
either in a filesystem (unstructured) or within a
database (structured).
MapReduce
• Programmer need not worry about:
– Communication between nodes
– Division & scheduling of work
– Fault tolerance
– Monitoring & reporting
• Map Reduce handles and hides all these dirty
details
• Provides a clean abstraction for programmer
Composable Systems
• Processing can be split into smaller computations and
the partial results merged after some post-processing to
give the final result
• MapReduce can be applied to this class of scientific
applications that exhibit composable property.
• We only need to worry about mapping a particular
algorithm to Map & Reduce
• If you can do that, with a little bit of high level
programming, you are through!
• SPMD algorithms
• Data Parallel Problems
SPMD Algorithms
• It s a technique to achieve parallelism.
• Tasks are split up and run on multiple processors
simultaneously with different input data
• The robustness provided by MapReduce
implementations is an important feature for
selecting this technology for such SPMD algorithms.
Hadoop
• MapReduce isn t available outside Google!
• Hadoop/HDFS is an open source implementation of
MapReduce/GFS
• Hadoop is a top-level Apache project being built and
used by a global community of contributors, using Java
• Yahoo! has been the largest contributor to the project,
and uses Hadoop extensively across its businesses
Hadoop & Facebook
– FB uses Hadoop to store copies of internal log and dimension
data sources and use it as a source for reporting/analytics and
machine learning.
– Currently FB has 2 major clusters:
• A 1100-machine cluster with 8800 cores and about 12 PB
raw storage.
• A 300-machine cluster with 2400 cores and about 3 PB raw
storage.
• Each (commodity) node has 8 cores and 12 TB of storage.
• FB has built a higher level data warehousing framework
using these features called Hive
http://hadoop.apache.org/hive/
• Have also developed a FUSE implementation over hdfs.
– First company to abandon RDBMS and adopt Hadoop for
implementation of a DW
A word about HIVE
• A data warehouse infrastructure built on top of Hadoop
• Enables easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files.
• provides a mechanism to put structure on this data
• A simple query language called Hive QL (based on SQL)
which enables users familiar with SQL to query this data.
• HIVE QL allows traditional map/reduce programmers to
be able to plug in their custom mappers and reducers to
do more sophisticated analysis which may not be
supported by the built-in capabilities of the language.
Hadoop & Yahoo
• YAHOO
– More than 100,000 CPUs in >36,000 computers running
Hadoop
– Biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk &
16GB RAM)
• Used to support research for Ad Systems and Web Search
• Also used to do scaling tests to support development of
Hadoop on larger clusters
HDFS
• Data is organized into files & directories
• Files are divided into uniform sized blocks (64 MB
default) & distributed across cluster nodes
• Blocks are replicated ( 3 default)to handle HW failure
• Replication for performance & fault tolerance
• Checksum for data corruption detection & recovery
Word Count using Map Reduce
• Mapper
– Input: <key: offset, value:line of a document>
– Output: for each word w in input line output<key: w,
value:1>
Input: (The quick brown fox jumps over the lazy dog.)
Output: the, , quick, , brown, … fox, , the,
• Reducer
– Input: <key: word, value: list<integer>>
– Output: sum all values from input for the given key input
list of values and output <Key:word value:count>
Input: the, [ , , , , ] , fox, [ , , ] …
Output: (the, 5)
(fox, 3)
Word count

• Input
• Map
• Shuffle
• Reduce
• Output

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
By Milind Bhandarkar
Map Reduce Architecture
• Map Phase
– Map tasks run in parallel – output intermediate key value
pairs
• Shuffle and sort phase
– Map task output is partitioned by hashing the output key
– Number of partitions is equal to number of reducers
– Partitioning ensures all key/value pairs sharing same key
belong to same partition
– The map output partition is sorted by key to group all
values for the same key
• Reduce Phase
– Each partition is assigned to one reducer.
– Reducers also run in parallel.
– No two reducers process the same intermediate key
– Reducer gets all values for a given key at the same time
Applications in Data Warehousing
• Aggregate queries
– Product-wise sales for the year 2010
Strengths of MapReduce
• Provides highest level of abstraction!
(as on date)
• Learning curve – manageable
• Highly scalable
• Highly fault tolerant
• Economical!!
Q&A

You might also like