You are on page 1of 41

An Overview of Testing and Implementati

Course: 0604M- Testing & Implementation


Course: Datawarehouse
An Overview of Testing and Implementati
OLAP
On-Line Analytical Processing (OLAP)
The use of a set of graphical tools that provides users
with multidimensional views of their data and allows
them to analyze the data using simple windowing
techniques
Relational OLAP (ROLAP)
Traditional relational representation
Multidimensional OLAP (MOLAP)
2
Multidimensional OLAP (MOLAP)
Cube Cube structure
OLAP Operations
Cube slicing come up with 2-D view of data
Drill-down going from summary to more detailed views
Roll-up going from detailed to summary views
Pivot present in other views
Datawarehouse_09/2013
The Multi-Dimensional Data Model
Sales by product line over the past six months
Sales by store between 1990 and 1995
Prod Code Time Code Store Code Sales Qty
Store Info
Numerical Measures
Key columns joining fact table
to dimension tables
Product Info
Time Info
. . .
Fact table for
measures
Dimension tables
OLAP operation roll up & Drill down
Data
Warehouse
Time
Product
Time
Category e.g Electrical Appliance
Sub Category e.g Kitchen
Product e.g Toaster
Category e.g Electrical Appliance
Sub Category e.g Kitchen
Product e.g Toaster
Drill down Roll up
4
Datawarehouse_09/2013
OLAP operation Slicing & pivot
Data
Warehouse
Time
Product
Product=Toaster
Product
Pivot
Slicing
Time
Region
5
Datawarehouse_09/2013
Approaches to OLAP Servers
Three possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to store and
manage warehouse data
OLAP middleware to support missing pieces OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
(3) Hybrid OLAP (HOLAP)
Storing detailed data in RDBMS
Storing aggregated data in MDBMS
User access via MOLAP tools
6
Datawarehouse_09/2013
Points to be noticed about ROLAP
Defines complex, multi-dimensional data with simple
model
Reduces the number of joins a query has to process
Allows the data warehouse to evolve with rel. low
maintenance maintenance
Can contain both detailed and summarized data.
ROLAP is based on familiar, proven, and already selected
technologies.
BUT!!!
SQL for multi-dimensional manipulation of calculations.
7
Datawarehouse_09/2013
ROLAP: Dimensional Modeling Using
Relational DBMS
Special schema design: star, snowflake
Special indexes: bitmap, multi-table join
Proven technology (relational model, DBMS), tend to
outperform specialized MDDB especially on large data
sets
Products
IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
ROLAP example
sale prodId storeId date amt
p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
p1 s2 2 4
Fact table view:
81
9
Datawarehouse_09/2013
Select sum(amt)
From sale
Where date=1
Select date, sum(amt)
From sale
Group by date
date amt
1 81
2 48
MOLAP: Dimensional Modeling Using the
Multi Dimensional Model
MDDB: a special-purpose data model
Facts stored in multi-dimensional arrays
Dimensions used to index array Dimensions used to index array
Sometimes on top of relational DB
Products
Pilot, Arbor Essbase, Gentia
10
Datawarehouse_09/2013
The MOLAP Cube : 2 dimension
sale prodId storeId amt
p1 s1 12
p2 s1 11
p1 s3 50
s1 s2 s3
p1 12 50
p2 11 8
Fact table view:
Multi-dimensional cube:
p1 s3 50
p2 s2 8
11
Datawarehouse_09/2013
Select prodId, storeId,sum(amt)
From sale
Group by prodId,storeId
prodId storeId amt
p1 s1 12
p1 s3 50
p2 s1 11
p2 s2 8
3-D Cube : 3 dimension
Multi-dimensional cube: Fact table view:
sale prodId storeId date amt
p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8
p1 s1 2 44
day 2
s1 s2 s3
p1 44 4
p2
s1 s2 s3
p1 12 50
day 1
p1 s1 2 44
p1 s2 2 4
p1 12 50
p2 11 8
12
Datawarehouse_09/2013
Select date, prodId, storeId,sum(amt)
From sale
Group by date, prodId,storeId
date prodId storeId amt
1 p1 s1 12
1 p1 s3 50
1 p2 s1 11
1 p2 s2 8
2 p1 s1 44
2 p1 s2 4
Slicing a Data cube
13
Datawarehouse_09/2013
Dimensional Modeling
Data cube Data cube
A two-dimensional,
three-dimensional, or
higher-dimensional
object in which each object in which each
dimension of the data
represents a measure
of interest
- Grain
- Drill-down
- Slicing
14
Datawarehouse_09/2013
Example of Drill-Down
15
Datawarehouse_09/2013
MDDB (Multi Dimensional DataBase)
Stores : New York
Hats Coats Jackets Total
Product : Hats
Jan Feb Mar Total
Jan 200 550 350 1100
Feb 210 480 390 1080
Mar 190 480 380 1050
Total 600 1510 1120 3230
New York 200 210 190 600
Boston 20 175 125 320
San Jose 110 210 125 445
Total 330 595 440 1365
Months January
New York Boston San Jose Total
Hats 200 20 110 330
Coats 550 435 275 1260
Jackets 350 220 125 695
Total 1100 675 510 2285
16
Datawarehouse_09/2013
Example
P
r
o
d
u
c
t
Juice
Milk
NY
SF
LA
10
34
Dimensions:
Time, Product, Store
Attributes:
Product (upc, price, )
Store

roll-up to brand
roll-up to region
P
r
o
d
u
c
t
Time
M T W Th F S S
Milk
Coke
Cream
Soap
Bread
56
32
12
56
56 units of bread sold in LA on M

Hierarchies:
Product Brand
Day Week Quarter
Store Region Country
roll-up to week
17
Datawarehouse_09/2013
Cube Aggregation: Roll-up
day 2
s1 s2 s3
p1 44 4
p2
s1 s2 s3
p1 12 50
p2 11 8
day 1
. . .
Example: computing sums
s1 s2 s3
p1 56 4 50
p2 11 8
s1 s2 s3
sum 67 12 50
sum
p1 110
p2 19
129
drill-down
rollup
18
Datawarehouse_09/2013
Cube Operators for Roll-up
day 2
s1 s2 s3
p1 44 4
p2
s1 s2 s3
p1 12 50
p2 11 8
day 1
. . .
sale(s1,*,*)
s1 s2 s3
p1 56 4 50
p2 11 8
s1 s2 s3
sum 67 12 50
sum
p1 110
p2 19
129
sale(*,*,*)
sale(s2,p2,*)
19
Datawarehouse_09/2013
s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
* 67 12 50 129
Extended Cube
day 2
s1 s2 s3 *
p1 44 4 48
*
p1 44 4 48
p2
* 44 4 48
s1 s2 s3 *
p1 12 50 62
p2 11 8 19
* 23 8 50 81
day 1 sale(*,p2,*)
20
Datawarehouse_09/2013
Aggregation Using Hierarchies
store
region
day 2
s1 s2 s3
p1 44 4
p2
s1 s2 s3
p1 12 50
p2 11 8
day 1
region A region B
p1 56 54
p2 11 8
region
country
(store s1 in Region A;
stores s2, s3 in Region B)
p2 11 8
21
Datawarehouse_09/2013
Points to be noticed about MOLAP
Pre-calculating or pre-consolidating transactional data improves
speed.
BUT
Fully pre-consolidating incoming data, MDDs require an enormous
amount of overhead both in processing time and in storage. An
input file of 200MB can easily expand to 5GB input file of 200MB can easily expand to 5GB
MDDs are great candidates for the <50GB department data marts.
Rolling up and Drilling down through aggregate data.
With MDDs, application design is essentially the definition of
dimensions and calculation rules, while the RDBMS requires that
the database schema be a star or snowflake.
22
Datawarehouse_09/2013
Hybrid OLAP (HOLAP)
HOLAP = Hybrid OLAP:
Best of both worlds
Storing detailed data in RDBMS
Storing aggregated data in MDBMS
User access via MOLAP tools
23
Datawarehouse_09/2013
Multi-dimensional
access
Multidimensional
Client MDBMS Server
Multi-
SQL-Read
RDBMS Server
User
Data Flow in HOLAP
Viewer
Relational
Viewer
Multi-
dimensionaldat
a
User
data
Meta data
Derived
data
SQL-Reach
Through
SQL-Read
24
Datawarehouse_09/2013
When deciding which technology to go
for, consider:
1) Performance:
How fast will the system appear to the end-user? How fast will the system appear to the end-user?
MDD server vendors believe this is a key point in their favor.
2) Data volume and scalability:
While MDD servers can handle up to 50GB of storage, RDBMS
servers can handle hundreds of gigabytes and terabytes.
25
Datawarehouse_09/2013
An experiment with Relational and the
Multidimensional models on a data set
The analysis of the authors example illustrates the following differences between
the best Relational alternative and the Multidimensional approach.
relational relational Multi Multi- -
dimensional dimensional
Improvement Improvement
Disk space requirement Disk space requirement
(Gigabytes) (Gigabytes)
17 17 10 10 1.7 1.7
* This may include the calculation of many other derived data without any
additional I/O.
Reference: http://dimlab.usc.edu/csci599/Fall2002/paper/I2_P064.pdf
Retrieve the corporate measures Retrieve the corporate measures
Actual Vs Budget, by month (I/Os) Actual Vs Budget, by month (I/Os)
240 240 11 240 240
Calculation of Variance Calculation of Variance
Budget/Actual for the whole Budget/Actual for the whole
database (I/O time in hours) database (I/O time in hours)
237 237 2* 2* 110* 110*
What-if analysis
IF A. You require write access
B. Your data is under 50 GB
C. Your timetable to implement is 60-90 days
D. Lowest level already aggregated
E. Data access on aggregated level
F. Youre developing a general-purpose application for inventory movement or
assets management
THEN Consider an MDD /MOLAP solution for your data mart
IF A. Your data is over 100 GB
B. You have a "read-only" requirement B. You have a "read-only" requirement
C. Historical data at the lowest level of granularity
D. Detailed access, long-running queries
E. Data assigned to lowest level elements
THEN Consider an RDBMS/ROLAP solution for your data mart.
IF A. OLAP on aggregated and detailed data
B. Different user groups
C. Ease of use and detailed data
THEN Consider an HOLAP for your data mart
27
Datawarehouse_09/2013
Examples
ROLAP
Telecommunication startup: call data records (CDRs)
ECommerce Site
Credit Card Company
MOLAP MOLAP
Analysis and budgeting in a financial department
Sales analysis
HOLAP
Sales department of a multi-national company
Banks and Financial Service Providers
28
Datawarehouse_09/2013
Tools available
ROLAP:
ORACLE 8i
ORACLE Reports; ORACLE Discoverer
ORACLE Warehouse Builder
Arbors Softwares Essbase
MOLAP:
ORACLE Express Server
ORACLE Express Clients (C/S and Web) ORACLE Express Clients (C/S and Web)
MicroStrategys DSS server
Platinum Technologies Plantinum InfoBeacon
HOLAP:
ORACLE 8i
ORACLE Express Serve
ORACLE Relational Access Manager
ORACLE Express Clients (C/S and Web)
29
Datawarehouse_09/2013
Conclusion
ROLAP: RDBMS -> star/snowflake schema
MOLAP: MDD -> Cube structures
ROLAP or MOLAP: Data models used play major role in performance
differences
MOLAP: for summarized and relatively lesser volumes of data (10-50GB)
ROLAP: for detailed and larger volumes of data
Both storage methods have strengths and weaknesses
The choice is requirement specific, though currently data warehouses are
predominantly built using RDBMSs/ROLAP.
30
Datawarehouse_09/2013
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
31
Use of the system is loosely defined and can be ad-
hoc
Used by managers and end-users to understand the
business and make judgements
Datawarehouse_09/2013
Data Mining works with Warehouse Data
Data Warehousing provides the
Enterprise with a memory
32
z Data Mining provides
the Enterprise with
intelligence
Datawarehouse_09/2013
We want to know ...
Given a database of 100,000 names, which persons are the least likely to
default on their credit cards?
Which types of transactions are likely to be fraudulent given the
demographics and transactional history of a particular customer?
If I raise the price of my product by Rs. 2, what is the effect on my ROI?
33
If I offer only 2,500 airline miles as an incentive to purchase rather than
5,000, how many lost responses will result?
If I emphasize ease-of-use of the product as opposed to its technical
capabilities, what will be the net effect on my revenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
Datawarehouse_09/2013
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
34
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
Datawarehouse_09/2013
Data Mining in Use
The US Government uses Data Mining to track fraud
A Supermarket becomes an information broker
Basketball teams use it to track game strategy
Cross Selling
35
Cross Selling
Warranty Claims Routing
Holding on to Good Customers
Weeding out Bad Customers
Datawarehouse_09/2013
What makes data mining possible?
Advances in the following areas are making data mining
deployable:
data warehousing
better and more data (i.e., operational, behavioral, and demographic)
the emergence of easily deployed data mining tools and
36
the emergence of easily deployed data mining tools and
the advent of new data mining techniques.
-- Gartner Group
Datawarehouse_09/2013
Why Separate Data Warehouse?
Performance
Op dbs designed & tuned for known txs & workloads.
Complex OLAP queries would degrade perf. for op txs.
Special data organization, access & implementation methods needed for
multidimensional views & queries.
z Function
y Missing data: Decision support requires historical data, which
37
y Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
y Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
y Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled.
Datawarehouse_09/2013
Wal*Mart Case Study
Founded by Sam Walton
One the largest Super Market Chains in the US
Wal*Mart: 2000+ Retail Stores
SAM's Clubs 100+Wholesalers Stores
38
SAM's Clubs 100+Wholesalers Stores
This case study is from Felipe Carinos (NCR Teradata)
presentation made at Stanford Database Seminar
Datawarehouse_09/2013
Old Retail Paradigm
Wal*Mart
Inventory Management
Merchandise Accounts
Payable
Suppliers
Accept Orders
Promote Products
Provide special
Incentives
39
Purchasing
Supplier Promotions:
National, Region, Store Level
Incentives
Monitor and Track The
Incentives
Bill and Collect
Receivables
Estimate Retailer
Demands
Datawarehouse_09/2013
New (Just-In-Time) Retail Paradigm
No more deals
Shelf-Pass Through (POS Application)
One Unit Price
Suppliers paid once a week on ACTUAL items sold
Wal*Mart Manager
Daily Inventory Restock
40
Daily Inventory Restock
Suppliers (sometimes SameDay) ship to Wal*Mart
Warehouse-Pass Through
Stock some Large Items
Delivery may come from supplier
Distribution Center
Suppliers merchandise unloaded directly onto Wal*Mart Trucks
Datawarehouse_09/2013
Wal*Mart System
NCR 5100M 96 Nodes;
Number of Rows:
Historical Data:
New Daily Volume:
24 TB Raw Disk; 700 - 1000
Pentium CPUs
> 5 Billions
65 weeks (5 Quarters)
41
New Daily Volume:
Number of Users:
Number of Queries:
65 weeks (5 Quarters)
Current Apps: 75 Million
New Apps: 100 Million +
Thousands
60,000 per week
Datawarehouse_09/2013

You might also like