Professional Documents
Culture Documents
WAREHOUSING
Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp
Background
1980’s to early 1990’s
Focus on computerizing business processes
To gain competitive advantage
By early 1990’s
All companies had operational systems
It no longer offered any advantage
How to get competitive advantage??
OLTP Systems:
Primary Purpose
Run the operations of the business
For example: Banks, Railway reservation etc.
Based on ER Data Modeling
Transaction based system
Data is always current valued
Little history is available
Data is highly volatile
Has “Intelligent keys”
OLTP Systems
Performance
Op dbs designed & tuned for known txs &
workloads.
Complex OLAP queries would degrade
performance for op txs.
Special data organization, access & implementation
methods needed for multidimensional views &
queries.
Current and historical decision support information
that is hard to access or present in traditional
operational systems.
Why Separate Data Warehouse?
Function
Missing data: Decision support requires
historical data, which op dbs do not typically
maintain.
Data consolidation: Decision support requires
consolidation (aggregation, summarization) of
data from many heterogeneous sources: op
dbs, external sources.
Data quality: Different sources typically use
inconsistent data representations, codes, and
formats which have to be reconciled.
Data Warehouse:
Characteristics
Analysis driven
Ad-hoc queries
Complex queries
Used by top managers
Based on Dimensional Modeling
Denormalized structures
Data Warehouse:
Major Players
SAS institute
IBM
Oracle
Sybase
Microsoft
HP
Cognos
Business Objects
Data Warehouse
A decision support database that is maintained
separately from the organization’s operational
databases.
A data warehouse is a
subject-oriented,
integrated,
time-varying,
non-volatile
OLTP Systems
Write
USER
OLTP
Read
Read
USER DW
Time Variant
Most business
analysis has a time Sales
component
Trend Analysis
(historical data is
required)
2001 2002 2003 2004
Data Warehousing
Architecture
Monitoring & Administration
OLAP servers
Metadata
Repository
Analysis
Extract
External
Transform
Query/
Sources Load Reporting
Serve
Refresh
Operational
dbs Data
Mining
Data Marts
Q&A
Thank You
Data Warehousing:
Introduction Continued
Misspelled terms
For example NAMES
Phonetic algorithms – can find
similar sounding names
Based on the six phonetic
classifications of human speech
sounds
Data Warehouse Design
OLTP Systems are Data Capture
Systems
“DATA IN” systems
DW are “DATA OUT” systems
OLTP DW
Analyzing the DATA
Active Analysis – User Queries
User-guided data analysis
Show me how X varies with Y
OLAP
Automated Analysis – Data Mining
What’s in there?
Set the computer FREE on your data
Supervised Learning (classification)
Unsupervised Learning (clustering)
OLAP Queries
How much of product P1 was
sold in 1999 state wise?
Top 5 selling products in 2002
Total Sales in Q1 of FY 2002-03?
Color wise sales figure of cars
from 2000 to 2003
Model wise sales of cars for the
month of Jan from 2000 to 2003
Data Mining Investigations
Which type of customers are more
likely to spend most with us in the
coming year?
What additional products are most
likely to be sold to customers who
buy sportswear?
In which area should we open a new
store in the next year?
What are the characteristics of
customers most likely to default on
their loans before the year is out?
Continuum of Analysis
Specialized
Algorithms
SQL
OLTP DW
Design Requirements
Design of the DW must directly
reflect the way the managers look
at the business
Should capture the measurements of
importance along with parameters by
which these parameters are viewed
It must facilitate data analysis, i.e.,
answering business questions
ER Modeling
A logical design technique that
seeks to eliminate data redundancy
Illuminates the microscopic
relationships among data elements
Perfect for OLTP systems
Responsible for success of
transaction processing in
Relational Databases
Problems with ER Model
ER models are NOT suitable for DW?
End user cannot understand or
remember an ER Model
Many DWs have failed because of
overly complex ER designs
Not optimized for complex, ad-hoc
queries
Data retrieval becomes difficult due to
normalization
Browsing becomes difficult
ER vs Dimensional Modeling
ER models are constituted to
Remove redundant data (normalization)
Facilitate retrieval of individual records
having certain critical identifiers
Thereby optimizing OLTP performance
Dimensional model supports the
reporting and analytical needs of a
data warehouse system.
Dimensional Modeling:
Salient Features
Represents data in a standard
framework
Framework is easily
understandable by end users
Contains same information as ER
model
Packages data in symmetric format
Resilient to change
Facilitates data retrieval/analysis
Dimensional Modeling:
Vocabulary
Measures or facts
Facts are “numeric” & “additive”
For example; Sale Amount, Sale
Units
Factors or dimensions
Star Schemas
Snowflake & Starflake Schemas
Sales Fact
Table
FK FK
Time Promotion
Dimension Dimension
Dimensional Modeling
Goa
Dubai Dimensions:
Pilani Time, Product, Location
Juice 10 Attributes:
Product
• Ordering logistics
• Stocking shelves
• Selling products
• Maximize profits
Data Warehouse:
Design Steps
Product FK FK Location
Dimension Dimension
Sales Fact
Table
FK FK
Time Promotion
Dimension Dimension
The “Classic” Star
SchemaFact Table
Store Dimension STORE KEY
Time Dimension
PRODUCT KEY
STORE KEY
PERIOD KEY PERIOD KEY
Store Description
City Dollars_sold
Units Period Desc
State
Dollars_cost Year
District ID
District Desc. Quarter
Region_ID Month
Region Desc. Day
Product Dimension
Regional Mgr.
PRODUCT KEY
Product Desc.
Brand
Color
Size
Manufacturer
Types of Facts
Fully-additive-all dimensions
Units_sold, Sales_amt
Semi-additive-some dimensions
Account_balance, Customer_count
28/3,tissue paper,store1, 25, 250,20
28/3,paper towel,store1, 35, 350,30
Is no. of customers who bought either tissue paper or
paper towel is 50? NO.
Non-additive-none
Gross margin=Gross profit/amount
Note that GP and Amount are fully additive
Ratio of the sums and not sum of the ratios
Facts for Grocery Store
1. Quantity sold (additive)
2. Dollar revenue (additive)
3. Dollar cost (additive)
4. Customer count (semi-additive, not additive along
the product dimension)
Fact Table for Grocery
Store
Field name Example Description/Remarks
Values
Date key (FK) 1 Surrogate key
Metadata
Repository Analysis
Extract
Query/
External
Sources
Transform Reporting
Load Serve
Operational
Refresh Data
dbs Mining
Data Marts
Data Marts
• What is a data mart?
• Advantages and disadvantages of data marts
• Issues with the development and management of
data marts
20-Apr-18
REWARD
© Prof. Navneet Goyal, Dept. of Comp. Sc. 21
Kimball vs. Inmon
There is no right or wrong between these two
ideas, as they represent different data
warehousing philosophies. In reality, the data
warehouse in most enterprises are closer to Ralph
Kimball's idea. This is because most data
warehouses started out as a departmental effort,
and hence they originated as a data mart. Only
when more data marts are built later do they
evolve into a data warehouse.
• Subject-oriented
• Customer, product, account, vendor etc.
• Integrated
• Data is cleansed, standardized and placed into a consistent
data model
• Volatile
• UPDATEs occur regularly, whereas data warehouses are
refreshed via INSERTs to firmly preserve history
• Current valued
• Changes are made almost with zero latency
Classification of ODS
• Urgency
– Class IV – Updates into the ODS from the DW are
unscheduled
• Data in the DW is analyzed, and periodically placed in the ODS
• For Example –Customer Profile Data
• Customer Name & ID
• Customer Volume – High/low
• Customer Profitability – High/low
• Customer Freq. of activity – very freq./very infreq.
• Customer likes & dislikes
ODS
ODS & Real-Time Data Warehousing
• Which class of ODS can be used for RTDWH?
• HOW?
• Let us first look at what we mean by RTDWH
• Wait till we talk about RTDWH
Q&A
20-Apr-18 41
Thank You
20-Apr-18 42
Basic Elements of a
Data Warehouse
Prof. Navneet Goyal
Department Of Computer Science
BITS, Pilani
20-Apr-18 1
Basic Elements of a DW
• Source Systems
• Data Staging Area
• Presentation Servers
• Data Mart/Super Marts
• Data Warehouse
• Operational Data Store
• OLAP
Kimball vs. Inmon
Metadata
Repository Analysis
Extract
Query/
External
Sources
Transform Reporting
Load Serve
Operational
Refresh Data
dbs Mining
Data Marts
Data Staging Area (DSA)
A storage area where extracted data is
Cleaned
Transformed
Deduplicated
Initial storage for data
Need not be based on Relational model
Spread over a number of machines
Mainly sorting and Sequential processing
COBOL or C code running against flat files
Does not provide data access to users
Analogy – kitchen of a restaurant
Data Staging Area
The Data Warehouse Staging Area is temporary
location where data from source systems is copied
Due to:
varying business cycles
data processing cycles
hardware and network resource limitations and
geographical factors
it is not feasible to extract all the data from all
Operational databases at exactly the same time
For example: Data from Singapore branch will arrive
much earlier than from the NY branch
Data Staging Area
Simplifies the overall management of a Data
Warehousing system
ETL tools work here!
DSA is everything between the source systems and
the presentation server
Raw food (read data) is transformed into a fine meal
(read data fit for user queries and consumption)
DSA is accessible only to professional chefs (read
skilled professionals)
Customers (read end users) are not invited to eat in
the kitchen (query in the DSA)
Data Staging Area
Key architectural requirement for the DSA is that it is
Off-limits to business users and does not provide
query and presentation services
Data Staging Area
Steps involved:
Extraction
Transformation
Cleansing the data
Combining data from multiple sources
Deduplicating data
Assigning Surrogate keys
Load or Transfer
Data Staging Area
DSA is dominated by sorting and sequential
processing
DSA is not typically based on the relational model,
but rather a collection of flat files
Many times the data arrives in the DSA in 3rd normal
form, which is acceptable
But, is not recommended because the data has to be
loaded into the presentation server in the
dimensional model
Normalized database in the staging area is
acceptable for supporting the staging process but
must be off-limits to user queries as they defeat
understandability and performance
Presentation Server
A target physical machine on which DW data is
organized for
Direct querying by end users using OLAP
Report writers
Data Visualization tools
Data mining tools
Data stored in Dimensional framework
Presentation area is the DW for the end users
Analogy – Sitting area of a restaurant
Presentation Server
In Kimball’s approach, presentation area is a series of
integrated data marts (super marts)
A data mart is a wedge of the overall presentation
area pie
Data is presented, stored, and accessed in
dimensional schema
Dimensional modeling is very different from 3NF
modeling (normalized models)
Normalized modeling is quite useful in OLTP systems
Not suitable for DW queries!!
Presentation Server
Data Marts must contain detailed and atomic data
May also contain aggregated data
All Data Marts be built using common dimensions and
facts
Conformed dimensions and facts
Concept of SUPERMARTS!!
Presentation Server
Data Warehouse Bus Architecture
Building a DW in step is too daunting a task
Architected, incremental approach to building a DW is
the DW Bus Architecture
Define a standard bus for the DW environment
Separate data marts, developed by different groups
at different times, can be plugged together and can
usefully coexist if they conform to the standard
Presentation Server
According to Kimball –
• What is OLAP
• Need for OLAP
• Features & functions of OLAP
• Different OLAP models
• OLAP implementations
o Client-server architecture
20-Apr-18 29
Thank You
20-Apr-18 30
EXTRACT, TRANSFORM , & LOAD
• Extract
– Extract relevant data
• Transform
– Transform data to DW format
– Build keys, etc.
– Cleansing of data
• Load
– Load data into DW
– Build aggregates, etc.
ETL System
• Back room or Green room of the DW
• Analogy - Kitchen of a restaurant
– A restaurant s kitchen is designed for efficiency, quality &
integrity
– Throughput is critical when the restaurant is packed
– Meals coming out should be consistent and hygienic
– Skilled chefs
– Patrons not allowed inside
• Dangerous place to be in – sharp knives and hot plates
• Trade secrets
ETL Design & Development
• Most challenging problem faced by the DW project
team
• 70% of the risk & effort in a DW project comes from
ETL
• Has 34 subsystems!!
• Not a one time effort!
– Initial load
– Subsequent loads (periodic refresh of the DW)
• Automation is critical!
Back Room Architecture
• ETL processing happens here
• Availability of right data from point A to point B with
appropriate transformations applied at the
appropriate time
• ETL tools are largely automated, but are still very
complex systems
General ETL Requirements
• Productivity support
– Basic development environment capabilities like code library
management, check in/check out, version control etc.
• Usability
– Must be as usable as possible
– GUI based
– System documentation: developers should easily capture
information about processes they are creating
– This metadata should be available to all
– Data compliance
• Metadata Driven
– Services that support ETL process must be metadata driven
General ETL Requirements
• Business needs – Users informntation requirement
• Compliance – must provide proof that the data reported is not
manipulated in any way
• Data Quality – garbage in garbage out!!
• Security – do not publish data widely to all decision makers
• Data Integration – Master Data Management System (MDM).
Conforming dimensions and facts
• Data Latency – huge effect on ETL architecture
– Use efficient data processing algorithms, parallelization, and powerful
hardware to speed up batch-oriented data flows
– If the requirement is for Real-time, then architecture must make a
switch from batch to microbatch or stream-oriented
• Archiving & Lineage – must for compliance & security reasons
– After ever major activity of the ETL pipeline, writing the data to disk
(staging) is recommended
– All staged data should be archived
Choice of Architecture
Tool Based ETL
• Bulk Extractions
– Entire DW is refreshed periodically
– Heavily taxes the network connections between the source
& target DBs
– Easier to set up & maintain
• Change-based Extractions
– Only data that have been newly inserted or updated in the
source systems are extracted & loaded into the DW
– Places less stress on the network but requires more complex
programming to determine when a new DW record must be
inserted or when an existing DW record must be updated
Transformation Tools
Data Flow
Extract Clean Conform Deliver
• Back room of a DW is often called the data staging
area
• Staging means writing to disk
• ETL team needs a number of different data structures
for all kinds of staging needs
To stage or not to stage
• Flat files
– fast to write, append to, sort and filter (grep) but slow to
update, access or join
• XML Data Sets
– Used as a medium of data transfer between incompatible
data sources
• Relational Tables
Coming up next …
• 34 subsystems of ETL
ETL Subsystems (contd…)
Prof. Navneet Goyal
BITS, Pilani
• Sources used for this lecture
– Ralph Kimball, Joe Caserta, The Data Warehouse ETL
Toolkit: Practical Techniques for Extracting, Cleaning,
Conforming and Delivering Data
– fdfw
34 Subsystems of ETL
• Extracting (1-3)
• Cleaning & Conforming Data (4-8)
• Prepare for Presentation (9-21)
• Managing the ETL Environment (22-34)
Prepare for Presentation (Subsystems 9-21)
• Primary mission of the ETL system
• Delivery subsystems are the most critical subsystems in
the ETL architecture
• Despite variations in source data structures, & cleaning
& conforming logic, the delivery processing techniques
are quite more defined & disciplined
• Many subsystems focus on dimension table processing
– Dimension tables are at the core of any DW
– Provide context for fact tables
• Fact tables are huge and contain critical measurements
of the business, but preparing them for presentation is
striaghtforward
Prepare for Presentation (Subsystems 9-21)
9. Slowly Changing Dimension (SCD) Manager
– Implements SCD logic
– Handling of update of a dimension attribute value
– Type I, Type II, & Type III responses to updates
– Type IV – Add a mini-dimension
– Type V – Add a mini-dimension & a Type I outrigger
– Type VI – Add Type I attributes to Type II dimensions
– Type VII – Dual Type I & Type II dimensions
(to be covered in Module 5 on Advanced Dimensional Modeling)
• Integer keys
• Artificial Keys
• Non-intelligent Keys
• Meaningless Keys
1234600 …..
SK Product Department NK
12345 Intellikidz1 Education ABC922Z
SK Product Department NK
12345 Intellikidz1 Education ABC922Z
Alternate Reality
SK Product Department NK
12345 Intellikidz1 Education ABC922Z
• Subsequent Load
– Relatively complex
• Dimension tables
• Fact Tables
• Fact Table Loading:
– In the FT record, simply replace the natural key with the
surrogate key
The lookup table for a typical dimension. There are as many rows as there
are unique production keys. The second column is the currently in-force
surrogate key used with each production key.
Outriggers
Hybrid Techniques:
2 20-24 M 20K-24999
3 20-24 M 25K-29999
18 25-29 M 20K-24999
10 25-29 M 25K-29999
Mini-Dimensions
Mini-dimension can not be itself allowed
to grow very large
5 demographic attributes
Each attribute can take 10 distinct values
How many rows in mini-dimension? 10,0000
Creating Mini-Dimensions
Mini-dimension
Source: http://www.yellowfinbi.com/
Q&A
Thank You
Conformed Dimensions
& Facts
- We want to find out the sales amount for all of the stores
- If we do a regular join, we will not be able to get what we want because we will
have missed "New York," since it does not appear in the Store_Information table
SELECT A1.store_name, SUM(A2.Sales) SALES store_name SALES
FROM Geography A1, Store_Information A2 Boston $700
WHERE A1.store_name = A2.store_name (+) New York
GROUP BY A1.store_name Los Angeles $1800
San Diego $250
NVL Function
In Oracle/PLSQL, the NVL function lets you substitutes a value when a null value is
encountered.
NVL (string1, replace_with )
string1 is the string to test for a null value. Replace_with is the value returned if
string1 is null.
Example #1:
select NVL (supplier_city, 'n/a')
from suppliers;
The SQL statement above would return 'n/a' if the supplier_city field contained a
null value. Otherwise, it would return the supplier_city value.
Example #2:
select supplier_id,
NVL (supplier_desc, supplier_name)
from suppliers;
This SQL statement would return the supplier_name field if the supplier_desc
contained a null value. Otherwise, it would return the supplier_desc.
Example #3:
select NVL (commission, 0)
from sales;
This SQL statement would return 0 if the commission field contained a null
value. Otherwise, it would return the commission field.
Conformed Dimensions
Dimension tables conform when attributes in
separate dimension tables have the same
column names and domain contents
Information form separate fact tables can be
combined in a single report by using conformed
dimension attributes that are associates with
each fact table
Conformed dimensions are reused across fact
tables
Refer to ETL subsystem 8: Conforming System
Conformed Dimensions
Bottom-up data warehousing approach builds one data
mart at a time
Drill-across between data marts requires common
dimension tables
Common dimensions and attributes should be
standardized across data marts
Create master copy of each common dimension table
Three types of “conformed” dimensions:
Dimension table identical to master copy
Dimension table has subset of rows from the master copy
• Can improve performance when many dimension rows are not
relevant to a particular process
Dimension table has subset of attributes from master copy
• Allows for roll-up dimensions at different grains (used in
Aggregation)
Conformed Dimension Example
Monthly sales forecasts
Predicted sales for each brand in each district in each month
POS Sales fact recorded at finer-grained detail
• Product SKU vs. Brand
• Date vs. Month
• Store vs. District
Use roll-up dimensions
Brand dimension is rolled-up version of master Product
dimension
• One row per brand
• Only include attributes relevant at brand level or higher
Month dimension is rolled-up Date
District dimension is rolled-up Store
Brand, Month, & District are conformed dimensions
Conformed Facts
If the same measurement appears in separate fact
tables, care must be taken to make sure that
technical definitions of the facts are identical if they
are to be compared or computed together
If separate fact definitions are consistent, the
conformed facts should be identically named,
otherwise they should be differently named
Examples: Revenue, profit, standard prices & costs,
measures of quality and customer satisfaction and
other KPIs are facts that must conform
95% of data architecture effort goes in designing
conformed dimensions and only 5% effort goes into
establishing conformed facts definitions
Drill-Across Example
Question: How did actual sales diverge from forecasted sales in
Sept. ‘14?
Drill-across between Forecast and Sales
Step 1: Query Forecast fact
Group by Brand Name, District Name
Filter on MonthAndYear =‘Sept 04’
Calculate SUM(ForecastAmt)
Query result has schema (Brand Name, District Name, ForecastAmt)
Step 2: Query Sales fact
Group by Brand Name, District Name
Filter on MonthAndYear =‘Sept 04’
Calculate SUM(TotalSalesAmt)
Query result has schema (Brand Name, District Name, TotalSalesAmt)
Step 3: Combine query results
Join Result 1 and Result 2 on Brand Name and District Name
Result has schema (Brand Name, District Name, ForecastAmt,
TotalSalesAmt)
Outer join unnecessary assuming:
• Forecast exists for every brand, district, and month
• Every brand has some sales in every district during every month
Multi-valued Dimensions
Examples
• Parts composed of subparts
Part1
Figure taken from Kibmall’s book – The Data Warehouse Toolkit, 3e Back
Variable Depth Hierarchies
How many Records in the
Bridge Table?
Could not have joined them to the same date dimension- SQL
would interpret it as a two –way simultaneous join
• Dimension Hierarchies
• Factless Fact Tables
Thank You
Factless Fact Tables
Prof. Navneet Goyal
Department of Computer Science & Information Systems
BITS, Pilani
Most of the material for the presentation is based on
the book:
The Data Warehouse Toolkit, 3e by
Ralph Kimball
Margy Ross
Factless Fact Tables
• Facts are typically numeric measures
• Events which record merely the coming together of
dimensional entities at a particular moment
– Student attending a class
– A particular product on promotion
• Can also be used to analyze what did not happen
– Factless coverage fact table about all possibilities
– Activity table about events that did happen
– Subtract activity from coverage
– Example: products that were on promotion but did not
sell
Factless Fact Tables
• Case studies that employ factless fact tables
– Retail sales
– Order management
– Education
Retail sales
• Retail sales schema can not answer an important
question – What products were on promotion but did
not sell?
• Sale FT records only those SKUs that actually got sold
• Not advisable to keep those SKUs in sales FT that did
not sell (it is already huge!!)
• Introduce promotion coverage fact table
– Same keys as ales fact table
– Grain is differernt
– FT row represents a product that was on promotion regardless
of whether the product sold
– Factless fact table
Retail sales
• What products were on promotion but did not sell?
• Two step process:
– Query the promotion coverage FFT to determine all the
products that were on promotion on a given day
– Find out all products that sold on a given day
– Difference of these two lists!!
– Try writing SQL query for this!
Order Management
• Customer/representative assignment
• Representatives are assigned to customers and it is
not necessary that every assignment would lead to a
sale
Sales Rep-Customer Assignment
Fact
Date Dimension (views for 2 roles) Assignment Effective Data Key(FK)
Assignment Expiration Data Key(FK)
Sales Rep Dimension
Sales Rep Key (FK)
Customer Key (FK)
Customer Dimension
Customer Assignment Counter = 1
time,location,supplier
3-D cuboids
time,item,locationtime,item,supplier item,location,supplier
4-D(base) cuboid
Office Day
Month
Data Warehousing
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cube Materialization:
Full Cube vs. Iceberg Cube
Computing only the cuboid cells whose measure satisfies the
iceberg condition i.e. greater than minimum support value
O ly a s all portio of ells ay e a o e the ater’’ i a
sparse cube
Data Warehousing
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cube Materialization:
Full Cube vs. Iceberg Cube
• Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a
data cube. However, we could still end up with a large number of uninteresting cells to
compute.
• For example, suppose that there are 2 base cells for a database of 100 dimensions, denoted as
{(a1, a2, a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10}, where each has a cell count of 10. If the minimum
support is set to 10, there will still be an impermissible number of cells to compute and store,
although most of them are not interesting.
• For example, there are about 2101 distinct aggregate cells, like {(a1, a2, a3, a4, …, a99, ∗ : , …, a1,
a2, ∗, a4, …, a99, a100 : , …, a1, a2, a3, ∗, …, ∗, ∗) : 10}, but most of them do not contain much new
information.
• If we ignore all the aggregate cells that can be obtained by replacing some constants by
∗'s while keeping the same measure value, there are only three distinct cells left: {(a1, a2,
a3, …, a100) : 10, (a1, a2, b3, …, b100) : 10, (a1, a2, ∗, …, ∗) : 20}. That is, out of about 2101
distinct base and aggregate cells, only three really offer valuable information
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Strategies for Data Cube Computation
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Strategies for Data Cube Computation
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Strategies for Data Cube Computation
• Aggregation from the smallest child when there exist multiple child cuboids.
When there exist multiple child cuboids, it is usually more efficient to compute
the desired parent (i.e., more generalized) cuboid from the smallest, previously
computed child cuboid.
• To compute a sales cuboid, Cbranch, when there exist two previously computed cuboids,
C{branch, year} and C{branch, item}, for example, it is obviously more efficient to compute Cbranch from
the former than from the latter if there are many more distinct items than distinct years.
• The Apriori pruning method can be explored to compute iceberg cubes
efficiently. The Apriori property,[3] in the context of data cubes, states as follows:
If a given cell does not satisfy minimum support, then no descendant of the cell
(i.e., more specialized cell) will satisfy minimum support either. This property can
be used to substantially reduce the computation of iceberg cubes.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
References
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Warehousing
BITS Pilani M6: OLAP & Multidimensional Databases (MDDB)
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Warehousing
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is OLAP?
OLAP is a category of software technology that
enables analysts, managers, and executives to gain
insight into data through fast, consistent,
interactive access to a wide variety of possible
views of information that has been transformed
from raw data to reflect the real dimensionality of
the enterprise as understood by the user.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Codd’s Rules for OLAP
1. Multidimensional Conceptual View. Provide a multidimensional data model that is intuitively analytical and easy to use.
Business users' view of an enterprise is multidimensional in nature. Therefore, a multidimensional data model conforms to
how the users perceive business problems.
2. Transparency Make the technology, underlying data repository, computing architecture, and the diverse nature of source
data totally transparent to users. Such transparency, supporting a true open system approach, helps to enhance the
efficiency and productivity of the users through front-end tools that are familiar to them.
3. Accessibility Provide access only to the data that is actually needed to perform the specific analysis, presenting a single,
coherent, and consistent view to the users. The OLAP system must map its own logical schema to the heterogeneous physical
data stores and perform any necessary transformations.
4. Consistent Reporting Performance Ensure that the users do not experience any significant degradation in reporting
performance as the number of dimensions or the size of the database increases. Users must perceive consistent run time,
response time, or machine utilization every time a given query is run.
5. Client/Server Architecture Conform the system to the principles of client/server architecture for optimum performance,
flexibility, adaptability, and interoperability. Make the server component sufficiently intelligent to enable various clients to be
attached with a minimum of effort and integration programming.
6. Generic Dimensionality Ensure that every data dimension is equivalent in both structure and operational capabilities. Have
one logical structure for all dimensions. The basic data structure or the access techniques must not be biased toward any
single data dimension.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Codd’s Rules for OLAP
7. Dynamic Sparse Matrix Handling Adapt the physical schema to the specific analytical model being created and loaded that
optimizes sparse matrix handling. When encountering a sparse matrix, the system must be able to dynamically deduce the
distribution of the data and adjust the storage and access to achieve and maintain consistent level of performance.
8. Multiuser Support Provide support for end users to work concurrently with either the same analytical model or to create
different models from the same data. In short, provide concurrent data access, data integrity, and access security.
9. Unrestricted Cross-dimensional Operations Provide ability for the system to recognize dimensional hierarchies and automatically
perform roll-up and drill-down operations within a dimension or across dimensions. Have the interface language allow
calculations and data manipulations across any number of data dimensions, without restricting any relations between data cells,
regardless of the number of common data attributes each cell contains.
10. Intuitive Data Manipulation Enable consolidation path reorientation (pivoting), drill-down and roll-up, and other manipulations
to be accomplished intuitively and directly via point-and-click and drag-and-drop actions on the cells of the analytical model.
Avoid the use of a menu or multiple trips to a user interface.
11. Flexible Reporting Provide capabilities to the business user to arrange columns, rows, and cells in a manner that facilitates easy
manipulation, analysis, and synthesis of information. Every dimension, including any subsets, must be able to be displayed with
equal ease.
12. Unlimited Dimensions and Aggregation Levels Accommodate at least fifteen, preferably twenty, data dimensions within a
common analytical model. Each of these generic dimensions must allow a practically unlimited number of user-defined
aggregation levels within any given consolidation path.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
An Analysis Session
Sharp enterprise- Countrywide monthly
Sales OK, wide profitability sales for last 3 months?
profitability dip
down last 3
Monthly sales by
months
worldwide regions?
Sharp reduction
in European European sales by
region countries ?
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OLAP operations
LINE 1998 1999 2000 TOTAL
Clothing $3,457,000 $3,590,050 $5,789,400 $12,836,450
Electronics $5,894,800 $4,078,900 $6,094,600 $16,068,300
Video $7,198,700 $6,057,890 $8,005,600 $21,262,190
Drill down
Kitchen $4,875,400 $5,894,500 $6,934,500 $17,704,400
LINE TOTAL SALES
Appliances $5,947,300 $6,104,500 $7,549,000 $19,600,800
Clothing $12,836,450 Total $27,373,200 $25,725,840 $34,373,100 $87,472,140
Electronics $16,068,300
Video $21,262,190
Rotate / Pivot
Kitchen $17,704,400
Appliances $19,600,800
Total $87,472,140 YEAR Clothing Electronics Video Kitchen Appliances TOTAL
1998 $3,457,000 $5,894,800 $7,198,700 $4,875,400 $5,947,300 $27,373,200
1999 $3,590,050 $4,078,900 $6,057,890 $5,894,500 $6,104,500 $25,725,840
2000 $5,789,400 $6,094,600 $8,005,600 $6,934,500 $7,549,000 $34,373,100
Total $12,836,450 $16,068,300 $21,262,190 $17,704,400 $19,600,800 $87,472,140
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Limitations of Other Tools
• Users need ability to analyse the data along multiple dimensions and
their hierarchies rapidly
• Spreadsheets can be cumbersome to use, particularly for large
volumes of data.
• Multidimensional data entered in spreadsheet has lot of redundancy
• It will require enormous effort to do create multidimensional view
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Limitations of Other Tools
• SQL was originally meant to be end-user query language
• Except for very simple operations, the syntax is not easy to
conceptualize for end-users
• The vocabulary is not suitable for analysis, comparisons are a
challenge
• SQL is not good with complex calculations and time-series data.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Limitations of Other Tools
• A real-world analysis session requires many queries following one
after the other.
• Each query may translate into a number of statements invoking full
table scans, multiple joins, aggregations, groupings, and sorting.
• The overhead on the systems would be enormous and seriously
impact the response times
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Features of OLAP
Fast response times for
Multidimensional analysis Consistent performance
interactive queries
Navigation in and out of
Drill-down and roll-up Slice-and-dice or rotation
details
Time intelligence (yearto-date,
Multiple view modes Easy scalability
fiscal period)
Basic Features
Advanced Features
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CUBE Operator in SQL
• A cube aggregates the facts in each level of each dimension in a given
OLAP schema
• Data cubes are not "cubes" in the strictly mathematical sense because they do not have
equal sides.
• Most likely, there are more than 3 dimensions
• Major SQL vendors provide cube operator in their products
• Typical sequence for Cube computation:
Identify physical sources of data
Specify logical views built upon physical source
Build cube for specified measures and dimensions
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Warehousing
BITS Pilani M6: OLAP & Multidimensional Databases (MDDB)
Pilani|Dubai|Goa|Hyderabad
T V Rao, BITS, Pilani (off-campus)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Multidimensional Databases
Why Multidimensional Database
• In 1960s, when a research scholar at MIT was doing analytical work
on product sales, he realized that
‒ he spent most of his time wrestling with reformatting the data for his analysis,
‒ not on the statistical algorithms or the true data analysis
• Once he had modeled the data in a multidimensional form, he was
able to report the data in many different formats
• By abstracting the data model from the data itself, the user could
work with the data in an ad hoc fashion, asking questions that had
not been formulated when developing the specifications
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Database
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Motivating Example
An automobile manufacturer wants to increase sale volumes by examining sales
data collected throughout the organization. The evaluation would require viewing
historical sales volume figures from multiple dimensions such as
• Sales volume by model
• Sales volume by color
• Sales volume by dealer
• Sales volume over time
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sales Data in Relational Form
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Structure
Measurement
Dimension
M Hatchback 6 5 4
O
D SUV 3 5 5
E
Sedan
L 4 3 2
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Let us add Dealer to the table
MODEL COLOR DEALERSHIP VOLUME
Hatchback BLUE Mitra 6
Hatchback BLUE Patel 6
Hatchback BLUE Singh 2
Hatchback RED Mitra 3
Hatchback RED Patel 5
Hatchback RED Singh 5
Hatchback WHITE Mitra 2
Hatchback WHITE Patel 4
Hatchback WHITE Singh 3
SUV BLUE Mitra 2
SUV BLUE Patel 3
SUV BLUE Singh 2
SUV RED Mitra 7
SUV RED Patel 5
SUV RED Singh 2
SUV WHITE Mitra 4
SUV WHITE Patel 5
SUV WHITE Singh 1
SEDAN BLUE Mitra 6
SEDAN BLUE Patel 4
SEDAN BLUE Singh 2
SEDAN RED Mitra 1
SEDAN RED Patel 3
SEDAN RED Singh 4
SEDAN WHITE Mitra 2
SEDAN WHITE Patel 2
SEDAN WHITE Singh 3
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Structure
Hatchback 2 5 3
M
O
D SUV n n
n
E
L Mitra
Sedan
n n n Patel
Singh DEALER
Blue Red White
COLOR
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multidimensional Structure
If each dimension
has 10 positions
the relational table
M
requires
O
10*10*10 i.e. 1000
D
records
E
L
DEALER
• RDBMS – all 1000 records might need to be searched to find the right
record
• Average case
• MDB 15 vs. 500 for RDBMS
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Performance Advantages with MDB
• To generalize the performance advantage
• In case of RDBMS, size of search space gets multiplied, each time a new dimension is
added; accordingly access time is affected
• In case of MDB, the search space increases by the size of new dimension, each time a
new dimension is added.
• At what cost?
• MDB is a separate proprietary implementation from SQL
• Since all business data is in RDBMS, the MDB has to be precomputed. Larger the
data, more the dimensions, higher the precomputation effort.
• Higher the interval of precomputation, higher the latency in MDB
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OLAP Operations
• Roll up (drill-up): summarize data
• by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
• from higher level summary to lower level summary or detailed
data, or introducing new dimensions
• Slice and dice: project and select
• Pivot (rotate):
• reorient the cube, visualization, 3D to series of 2D planes
• Other operations
• drill across: involving (across) more than one fact table
• drill through: through the bottom level of the cube to its back-end relational
April 20, 2018
tables (using SQL) Data Warehousing
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OLAP
Operations
Distinct OLAP models
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Storage
ROLAP MOLAP
Data stored as relational tables in the Data stored as relational tables in the
warehouse. warehouse.
Detailed and light summary data available. Various summary data kept in proprietary
databases (MDBs)
Very large data volumes.
Moderate data volumes.
All data access from the warehouse
storage. Summary data access from MDB, detailed
data access from warehouse.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underlying Technologies
ROLAP MOLAP
Use of complex SQL to fetch data Creation of pre-fabricated data cubes
from warehouse. by MOLAP engine.
ROLAP engine in analytical server Proprietary technology to store
creates data cubes on the fly. multidimensional views in arrays, not
tables.
Multidimensional views by
presentation layer. High speed matrix data retrieval.
Sparse matrix technology to manage
data sparsity in summaries.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Functions and Features
ROLAP MOLAP
Known environment and availability Faster access.
of many tools.
Large library of functions for complex
Limitations on complex analysis calculations.
functions.
Easy analysis irrespective of the
Drill-through to lowest level easier. number of dimensions.
Drill-across not always easy.
Extensive drill-down and slice-and-
dice capabilities.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
HOLAP (Hybrid OLAP)
The intermediate architecture type, HOLAP, aims at mixing the advantages of
both basic solutions. It takes advantage of the standardization level and the
ability to manage large amounts of data from ROLAP implementations, and
the query speed typical of MOLAP systems. HOLAP has the largest amount of
data stored in an RDBMS, and a multidimensional system stores only the
information users most frequently need to access. If that information is not
enough to solve queries, the system will transparently access the part of the
data managed by the relational system. Important market actors have
adopted HOLAP solutions to improve their platform performance.
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
References
Author(s), Title, Edition, Publishing House
Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline Kamber and
Jian Pei Morgan Kaufmann Publishers
Data Warehousing
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
Aggregation
Contd…
- Ralph kimball
4/20/2018 Dr. Navneet Goyal, BITS, Pilani 2
Aggregation
• Still aggregations is so underused. Why?
• We are still not comfortable with redundancy!
• Requires extra space
• Most of us are not sure of what aggregates to store
• A bizarre phenomena called
SPARSITY FAILURE
Hierarchies
• 10000 products in 2000 categories
• 1000 stores in 100 districts
• 30 aggregates in 100 time periods
* Most of the material for this presentation has been taken from Lawrence Corr’s article “Lost ,
Shrunken, and Collapsed”
Introduction
Three ways of creating aggregates:
• Lost Dimension Aggregate
• Shrunken Dimension Aggregate
• Collapsed Dimension Aggregate
Shrunken
Dimension
Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Shrunken Dimension Aggregates
• Shrunken dimension aggregates have one or more
dimensions replaced by shrunken or rolled versions of
themselves.
• This technique can be combined with lost dimensions as well
(see Figure on slide #6)
• One shrunken dimension and one lost dimension
• Could represent a monthly-product-sales-by-store summary
of the original fact table. In this example, a monthly-grain
time dimension replaces the daily-grain time dimension
• The aggregate would be significantly smaller than the fact
table (though probably not by the full factor of 30 you might
expect, because not every product is sold every day) but
would still allow dimensional analysis by time at the month,
quarter, and year levels
Aggregates:
Shrunken Dimensions
Shrunken
Dimension
Lost
Dimension
Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Shrunken Dimension Aggregates
• Before you can build a shrunken dimension aggregate, one or
more shrunken dimension keys must replace original
surrogate keys in the fact table.
• If the shrunken dimension keys are carried in the original
dimensions, the aggregate can be populated by a SQL query
that joins the fact table to these dimensions and groups on a
combination of shrunken dimension keys and surviving
atomic-level surrogate keys.
Sales Star Schema
10
Back
Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Collapsed Dimension
• Collapsed dimension aggregates are created when dimensional
keys have been replaced with high-level dimensional attributes,
resulting in a single, fully denormalized summary table.
• Figure on slide #17 shows a collapsed aggregate with a small
number of surviving dimensional attributes from two
dimensions.
• This example could be a quarterly product category sales
summary.
• Collapsed dimension aggregates have the performance and
usability advantages of shrunken dimension aggregates,
without requiring you to maintain shrunken physical
dimensions and keys.
• In addition, they can offer further query acceleration because
they cut out join processing for rewritten queries.
Collapsed Dimension
• Can only be considered for high-level summaries only where few
dimensional attributes remain and those attributes are relatively
short.
• Otherwise, the increased record length may contribute to the
collapsed table being too large, especially if many attributes are
included.
• A collapsed dimension aggregate might well have 10 times fewer
records than the fact table but its record length could easily be
three to five times longer, leaving the overall performance gain at
only two or three times.
• Collapsed dimension aggregates would be tenable only for high-
level summaries built from already moderately sized aggregates.
Collapsed Dimensions
Figure taken from Lawrence Corr’s article “Lost , Shrunken, and Collapsed”
Q&A
Thank You
VIEW
MATERIALIZATION
Examples taken from Chapter 25 of Ramakrishna book on Database Management System, 3e.
Query Modification
(Evaluate On Demand)
View CREATE VIEW RegionalSales(category,sales,state)
AS SELECT P.category, S.sales, L.state
FROM Products P, Sales S, Locations L
WHERE P.pid=S.pid AND S.locid=L.locid
Query
SELECT R.category, R.state, SUM(R.sales)
FROM RegionalSales AS R GROUP BY R.category, R.state
NP-complete problem!!
Cost Model
Weighted query processing cost:
Material adapted from Chapter 13 of Silberschatz book on Database System Concepts, 6e.
View Maintenance
The changes (inserts and deletes) to a relation
or expressions are referred to as its differential
Set of tuples inserted to and deleted from r are denoted
ir and dr respectively
To simplify our description, we only consider
inserts and deletes
We replace updates to a tuple by deletion of the tuple
followed by insertion of the update tuple
How to compute the change to the result of each
relational operation, given changes to its inputs?
Join Operation
Consider the materialized view v = r s and an
update to r
Let rold and rnew denote the old and new states
of relation r
Consider the case of an insert to r:
We can write rnew s as (rold ∪ ir) s
And rewrite the above to (rold s) ∪ (ir s)
But (rold s) is simply the old value of the materialized
view, so the incremental change to the view is just ir s
Thus, for inserts vnew = vold ∪(ir s)
Similarly for deletes vnew = vold – (dr s)
Selection & Projection
Operations
Selection: Consider a view v = σθ(r).
vnew = vold ∪σθ(ir)
vnew = vold - σθ(dr)
Projection is a more difficult operation
R = (A,B), and r(R) = {(a,2), (a,3)}
∏A(r) has a single tuple (a).
If we delete the tuple (a,2) from r, we should not delete the
tuple (a) from ∏A(r), but if we then delete (a,3) as well, we
should delete the tuple
For each tuple in a projection ∏A(r) , we will keep a count
of how many times it was derived
On insert of a tuple to r, if the resultant tuple is already in
∏A(r) we increment its count, else we add a new tuple with
count = 1
On delete of a tuple from r, we decrement the count of the
corresponding tuple in ∏A(r)
• if the count becomes 0, we delete the tuple from ∏A(r)
Aggregate Operations
count : v = g (r)
A count(B)
( count of the attribute B, after grouping r by attribute A)
T1 15
T2 10
View Materialization:
Summary
View Materialization
Selection of Views to Materialize
View Maintenance
Incremental View Maintenance
Coming up next …
Bitmap Indexes
Bitmap Indexes
WAH uncompressed
better
BBC
gzip PacBits
ExpGol space
Comparison of different compression Techniques
( Wu et.al, 2001)
Summary of Module 7
Aggregation
Sparsity failure
Aggregate Navigator
Partitioning
Partitioning wrt time dimension
View Materialization
precomputing
Bitmap Indexes
Role of Metadata
Another Classification:
Business Metadata
Technical Metadata
Metadata
Data about the data
Table of contents of data
Catalog for the data
DW Atlas
DW Roadmap
DW Directory
Glue that holds the DW together
Tongs to handle the data
The nerve center
Metadata
It is all the information that defines and
describes the structures, operations, and
contents of a DW system
Information Delivery
* Chapter 9 –
The Significant role of Metadata
Data Warehousing Fundamentals for IT Professionals
Wiley, 2e, 2010
Support for DW in RDBMS
4/20/2018
Aggregation & Aggregate Navigator
• Most RDBMS now have support for aggregates in
the form of Aggregate Navigator (AN)
• AN helps them to improve performance of
queries requiring aggregates and to match the
performance of Multidimensional Databases
• Aggregate Navigation Algorithm and metadata
helps to target most suitable aggregate for a
given query
• Users/queries need not be aware of existing
aggregates
• Implemented using Materialized Views
4/20/2018
Partitioning
• Data warehouses often contain very large tables
and require techniques both for managing these
large tables and for providing good query
performance across them
• Partitioning is supported in most RDBMSs
• Even if it is not supported, one can create them
using the concept of views
– Manually move data from the table to be partitioned to
its partitions (tables)
– Create a view using union of partitions, giving an
illusion that the original table still exists
4/20/2018
Types of Partitioning
Oracle offers four partitioning methods:
• Range Partitioning (already discussed in Module 7)
• Hash Partitioning
• List Partitioning
• Composite Partitioning
4/20/2018
Hash Partitioning
• Distributes data evenly among the partitions/devices
• Easy to use partitioning
• Oracle uses a linear hashing algorithm to avoid skew
• No. of partitions should be in powers of 2
• Users cannot specify alternate hashing functions or
algorithms
4/20/2018
Hash Partitioning: Example
• CREATE TABLE sales_hash
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_amount NUMBER(10),
week_no NUMBER(2))
PARTITION BY HASH(salesman_id)
PARTITIONS 4
STORE IN (data1, data2, data3, data4);
4/20/2018
List Partitioning
• List partitioning enables you to explicitly control how
rows map to partitions.
• Specify a list of discrete values for the partitioning
column in the description for each partition.
• Different from range partitioning, where a range of
values is associated with a partition and with hash
partitioning, where you have no control of the row-to-
partition mapping
• Advantage of list partitioning is that you can group
and organize unordered and unrelated sets of data in a
natural way
List Partitioning: Example
CREATE TABLE sales_list
(salesman_id NUMBER(5),
salesman_name VARCHAR2(30),
sales_state VARCHAR2(20),
sales_amount NUMBER(10),
sales_date DATE)
PARTITION BY LIST(sales_state)
(
PARTITION sales_west VALUES IN('California', 'Hawaii'),
PARTITION sales_east VALUES IN ('New York', 'Virginia', 'Florida'),
PARTITION sales_central VALUES IN('Texas', 'Illinois'),
);
Composite Partitioning
• Composite partitioning combines range and hash
partitioning.
• First distribute data into partitions according to
boundaries established by the partition ranges.
• Then use a hashing algorithm to further divide the data
into sub-partitions within each range partition
Materialized Views
• Materialized views are supported in most
RDBMSs
• Types of Materialized views supported*
– With aggregates
– With joins
– With both joins & aggregates
– Nested materialized views
4/20/2018
Materialized Views - Aggregates
• CREATE MATERIALIZED VIEW LOG ON sales WITH SEQUENCE, ROWID
(prod_id, cust_id, time_id, channel_id, promo_id, quantity_sold, amount_sold)
INCLUDING NEW VALUES;
4/20/2018
Nested Materialized Views
• A nested materialized view is a materialized view whose
definition is based on another materialized view. A nested
materialized view can reference other relations in the
database in addition to referencing materialized views
• In a data warehouse, you typically create many aggregate
views on a single join (for example, rollups along different
dimensions).
• Incrementally maintaining these distinct materialized
aggregate views can take a long time, because the
underlying join has to be performed many times
4/20/2018
Nested Materialized Views
• Using nested materialized views, you can create multiple
single-table materialized views based on a joins-only
materialized view and the join is performed just once.
• Nesting is possible only with materialized views
containing only joins or aggregates
• Materialized views with joins and aggregates are
implemented using nested materialized views
4/20/2018
Bitmap Indexes
• Bitmap indexes are widely used in data warehousing
environments which typically have large amounts of data
and ad hoc queries
• For such applications, bitmap indexing provides:
– Reduced response time for large classes of ad hoc queries
– Reduced storage requirements compared to other indexing techniques
– Dramatic performance gains even on hardware with a relatively small
number of CPUs or a small amount of memory
• Fully indexing a large table with a traditional B-tree index
can be prohibitively expensive in terms of space because
the indexes can be several times larger than the data in
the table
• Bitmap indexes are typically only a fraction of the size of
4/20/2018
the indexed data in the table.
Bitmap Join Indexes
• In addition to a bitmap index on a single table, you can create
a bitmap join index (BJI), which is a bitmap index for the join of
two or more tables.
• A BJI is a space efficient way of reducing the volume of data
that must be joined by performing restrictions in advance.
• For each value in a column of a table, a bitmap join index
stores the rowids of corresponding rows in one or more other
tables.
Bitmap Join Indexes
• In a data warehousing environment, the join condition is an
equi-inner join between the primary key column or columns of
the dimension tables and the foreign key column or columns in
the fact table.
• Bitmap join indexes are much more efficient in storage than
materialized join views, an alternative for materializing joins in
advance.
• This is because the materialized join views do not compress
the rowids of the fact tables.
Dimensions
• A dimension is a structure that categorizes data in order to
enable users to answer business questions.
• Commonly used dimensions are customers, products, and
time.
• In Oracle Database, the dimensional information itself is stored
in a dimension table.
• In addition, the database object dimension helps to organize
and group dimensional information into hierarchies.
• This represents natural 1:n relationships between columns or
column groups (the levels of a hierarchy) that cannot be
represented with constraint conditions
Dimensions
• Dimensions do not have to be defined. However, if your
application uses dimensional modeling, it is worth spending
time creating them as it can yield significant benefits, because
they help query rewrite perform more complex types of
rewrites
• In spite of the benefits of dimensions, you must not create
dimensions in any schema that does not fully satisfy the
dimensional relationships
Dimensions
• Before you can create a dimension object, the dimension
tables must exist in the database possibly containing the
dimension data
• You create a dimension using the CREATE DIMENSION
statement
• For example, you can declare a dimension products_dim, which
contains levels product, subcategory, and category:
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category) ...
Dimensions
CREATE DIMENSION products_dim
LEVEL product IS (products.prod_id)
LEVEL subcategory IS (products.prod_subcategory)
LEVEL category IS (products.prod_category) ...
WI CA Total 1996 WI 38
1996 CA 107
1995 63 81 144 1996 All 145
1997 WI 75
1996 38 107 145 1997 CA 35
1997 75 35 110 1997 All 110
All All 399
Total 176 223 399
Find out wht the following SQL will generate?
Select T.year, L.state, SUM (sales)
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid
Group By ROLLUP (L.state, T.year)
Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp
World’s Largest
Data Warehouse
• To achieve these impressive results, a data warehouse
environment was created by ingesting 3 PB per day of
synthetic data for four consecutive days—a feat that
required exceptional storage system performance and
reliability. For that, SAP turned to NetApp® SAN storage
SAP HANA: In-Memory Database
• Companies are always trying to find the best way to store
data in a meaningful format so that they can make better
business decisions
• Since the birth of data warehousing almost 30 years ago,
numerous innovations in data management have been
made, such as Hadoop and NoSQL
• HANA (High-Performance Analytic Appliance)
• A platform for processing high volumes of operational and
transactional data in real time
All material related to HANA taken from:
SAP HANA: In-Memory Database
In-memory computing
SAP HANA: In-Memory Database
In-memory computing
SAP HANA: In-Memory Database
In-memory computing
SAP HANA: In-Memory Database
Storing Data
SAP HANA: In-Memory Database
Storing Data
SAP HANA: In-Memory Database
Storing Data
SAP HANA: In-Memory Database
Using Data
NetApp SAN Storage
• deployment of the storage hardware (weighing nearly two tons!)
occurred over several
• NetApp E-Series storage was deployed for the majority of the
data, to the tune of a total of 5.4 petabytes of physical storage
capacity spread across 20 E5460 storage arrays and 1,800 three
terabyte NL-SAS disk drives
• The record of 12.1 petabytes of total addressable capacity was
achieved thanks in large part to an SAP in-system data
compression rate of 85% for the 50/50 mix of structured and
unstructured data.
• NetApp E-Series storage was selected for the project because of its
proven 99.999% availability and its ability to handle the project’s
data ingest requirement of 34.3 TB per hour
• The E-Series SAN used a Fibre Channel fabric to support SAP IQ’s
large and varied data needs
Source:
An Insider’s View into the World’s Largest Data Warehouse
by Larry Freeman, NetApp
SAP IQ
• A highly optimized RDBMS built for extreme-scale Big
Data analytics and warehousing
• Developed by SYBASE. Now a SAP company
• SAP IQ holds the Guinness World Record for fastest
loading and indexing of Big Data
• IQ is a column-based, petabyte scale, relational
database software system used for business intelligence,
data warehousing, and data marts
• Its primary function is to analyze large amounts of data
in a low-cost, highly available environment
• SAP IQ is often credited with pioneering the
commercialization of column-store technology
A word about HIVE
• A data warehouse infrastructure built on top of Hadoop
• Enables easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files.
• provides a mechanism to put structure on this data
• A simple query language called Hive QL (based on SQL)
which enables users familiar with SQL to query this data.
• HIVE QL allows traditional map/reduce programmers to
be able to plug in their custom mappers and reducers to
do more sophisticated analysis which may not be
supported by the built-in capabilities of the language.
A word about HIVE
HIVE provides:
• Tools to enable easy data extract/transform/load
(ETL)
• A mechanism to impose structure on a variety of
data formats
• Access to files stored either directly in Apache
HDFS or in other data storage systems such as
Apache HBase
• Query execution via MapReduce
Thank You
BIG Data Analytics:
An Overview
Navneet Goyal
Department of Computer Science
BITS, Pilani (Pilani Campus)
Topics
• Big Data Analytics
• Extended RDBMS Architecture
• MapReduce/Hadoop
BIG Data Analytics
It is like a mountain of data that you have to climb.
The height of the mountain keeps increasing everyday
and you have to climb it in less time than yesterday!!
Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and
analyze. - The McKinsey Global Institute, 2011
BIG DATA
o BIG DATA poses a big challenge to our capabilities
o Data scaling outdoing scaling of compute resources
o CPU speed not increasing either
o At the same time, BIG DATA offers a BIGGER opportunity
o Opportunity to
o Understand nature
o Understand evolution
o Understand human behavior/psychology/physiology
o Understand stock markets
o Understand road and network traffic
o …
o Opportunity for us to be more and more innovative!!
• Just a Hype?
• Or a real Challenge?
• Or a great Opportunity?
• Challenge in terms of how to manage & use this data
• Opportunity in terms of what we can do with this data to
enrich the lives of everybody around us and to make our
mother Earth a better place to live
• We are living in a world of DATA!!
• We are (partly) and will be (fully) driven by DATA
Compute
MAP Parallelism
Compile
REDUCE Aggregate
MapReduce
• A programming model & its associated implementation
• MapReduce is a patented software framework
introduced by Google to support distributed computing
on large data sets on clusters of computers (wiki defn.)
• Inspired by the map & reduce functions of functional
programming (in a tweaked form)
MapReduce
• MapReduce is a framework for processing huge
datasets on certain kinds of distributable problems
using a large number of computers
– cluster (if all nodes use the same hardware) or as
– grid (if the nodes use different hardware)
(wiki defn.)
• Computational processing can occur on data stored
either in a filesystem (unstructured) or within a
database (structured).
MapReduce
• Programmer need not worry about:
– Communication between nodes
– Division & scheduling of work
– Fault tolerance
– Monitoring & reporting
• Map Reduce handles and hides all these dirty
details
• Provides a clean abstraction for programmer
Composable Systems
• Processing can be split into smaller computations and
the partial results merged after some post-processing to
give the final result
• MapReduce can be applied to this class of scientific
applications that exhibit composable property.
• We only need to worry about mapping a particular
algorithm to Map & Reduce
• If you can do that, with a little bit of high level
programming, you are through!
• SPMD algorithms
• Data Parallel Problems
SPMD Algorithms
• It s a technique to achieve parallelism.
• Tasks are split up and run on multiple processors
simultaneously with different input data
• The robustness provided by MapReduce
implementations is an important feature for
selecting this technology for such SPMD algorithms.
Hadoop
• MapReduce isn t available outside Google!
• Hadoop/HDFS is an open source implementation of
MapReduce/GFS
• Hadoop is a top-level Apache project being built and
used by a global community of contributors, using Java
• Yahoo! has been the largest contributor to the project,
and uses Hadoop extensively across its businesses
Hadoop & Facebook
– FB uses Hadoop to store copies of internal log and dimension
data sources and use it as a source for reporting/analytics and
machine learning.
– Currently FB has 2 major clusters:
• A 1100-machine cluster with 8800 cores and about 12 PB
raw storage.
• A 300-machine cluster with 2400 cores and about 3 PB raw
storage.
• Each (commodity) node has 8 cores and 12 TB of storage.
• FB has built a higher level data warehousing framework
using these features called Hive
http://hadoop.apache.org/hive/
• Have also developed a FUSE implementation over hdfs.
– First company to abandon RDBMS and adopt Hadoop for
implementation of a DW
A word about HIVE
• A data warehouse infrastructure built on top of Hadoop
• Enables easy data summarization, adhoc querying and
analysis of large datasets data stored in Hadoop files.
• provides a mechanism to put structure on this data
• A simple query language called Hive QL (based on SQL)
which enables users familiar with SQL to query this data.
• HIVE QL allows traditional map/reduce programmers to
be able to plug in their custom mappers and reducers to
do more sophisticated analysis which may not be
supported by the built-in capabilities of the language.
Hadoop & Yahoo
• YAHOO
– More than 100,000 CPUs in >36,000 computers running
Hadoop
– Biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk &
16GB RAM)
• Used to support research for Ad Systems and Web Search
• Also used to do scaling tests to support development of
Hadoop on larger clusters
HDFS
• Data is organized into files & directories
• Files are divided into uniform sized blocks (64 MB
default) & distributed across cluster nodes
• Blocks are replicated ( 3 default)to handle HW failure
• Replication for performance & fault tolerance
• Checksum for data corruption detection & recovery
Word Count using Map Reduce
• Mapper
– Input: <key: offset, value:line of a document>
– Output: for each word w in input line output<key: w,
value:1>
Input: (The quick brown fox jumps over the lazy dog.)
Output: the, , quick, , brown, … fox, , the,
• Reducer
– Input: <key: word, value: list<integer>>
– Output: sum all values from input for the given key input
list of values and output <Key:word value:count>
Input: the, [ , , , , ] , fox, [ , , ] …
Output: (the, 5)
(fox, 3)
Word count
• Input
• Map
• Shuffle
• Reduce
• Output
http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
By Milind Bhandarkar
Map Reduce Architecture
• Map Phase
– Map tasks run in parallel – output intermediate key value
pairs
• Shuffle and sort phase
– Map task output is partitioned by hashing the output key
– Number of partitions is equal to number of reducers
– Partitioning ensures all key/value pairs sharing same key
belong to same partition
– The map output partition is sorted by key to group all
values for the same key
• Reduce Phase
– Each partition is assigned to one reducer.
– Reducers also run in parallel.
– No two reducers process the same intermediate key
– Reducer gets all values for a given key at the same time
Applications in Data Warehousing
• Aggregate queries
– Product-wise sales for the year 2010
Strengths of MapReduce
• Provides highest level of abstraction!
(as on date)
• Learning curve – manageable
• Highly scalable
• Highly fault tolerant
• Economical!!
Q&A