You are on page 1of 57

Y   

Presented By:
NIKHIL DEBBARMA
M. Tech (2nd Semester)
CSE, NITA
Outline
v What is Data Warehousing?
v Purpose of Data Warehousing
v Introduction, Definitions, and Terminology
v Comparison with Traditional Databases
v Characteristics of Data Warehouses
v Classification of Data Warehouses
v Multi-dimensional Schemas
v Building a Data Warehouse
v Functionality of a Data Warehouse
v Warehouse vs. Data Views
v Implementation difficulties and open issues
What is Data Warehousing?

A process of transforming
  data into information and
making it available to users
in a timely enough manner
to make a difference

[Forrester Research, April 1996]

Y 
Data Warehouse

v Technique for assembling and managing data from


various sources for the purpose of answering
business questions.
Thus making decisions that were not previous
possible.

v A decision support database maintained separately


from the organization͛s operational database.
Data Warehouse

v A data warehouse is a
ÿ subject-oriented

ÿ integrated

ÿ time-varying

ÿ non-volatile

collection of data that is used primarily in


organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
Data Warehouse
v Subject Oriented:
Data that gives information about a particular
subject instead of about a company's ongoing
operations.

v Data is arranged and optimized to provide answer to


questions from diverse functional areas
ÿ Data is organized and summarized by topic
[ Sales / Marketing / Finance / Distribution / Etc.
Data Warehouse
v Subject Oriented:
v It focuses on modeling and analysis of data for
decision makers.
v Excludes data not useful in decision support
process.
Data Warehouse
v Integrated:
Data that is gathered into the data warehouse from
a variety of sources and merged into a coherent
whole.
v The data warehouse is a centralized, consolidated
database that integrated data derived from the
entire organization
ÿ Multiple Sources

ÿ Diverse Sources

ÿ Diverse Formats
Data Warehouse
v Time-variant:
All data in the data warehouse is identified with a
particular time period.

The Data Warehouse represents the flow of data


through time

Data is periodically uploaded then time-dependent


data is recomputed
Data Warehouse
v Non-volatile
Once data is entered, it is never removed.

Data is stable in a data warehouse. More data is


added but data is never removed. This enables
management to gain a consistent picture of the
business.

Read-Only database for data analysis and query


processing.
Data Warehouse

v Data warehouses have the distinguishing characteristic that


they are mainly intended for decision support applications.
ÿ Traditional databases are transactional.

v Applications that data warehouse supports are:


ÿ OLAP °Online Analytical Processing) is a term used to describe the
analysis of complex data from the data warehouse.

ÿ DSS °Decision Support Systems) also known as EIS °Executive


Information Systems) supports organization͛s leading decision
makers for making complex and important decisions.

ÿ Data Mining is used for knowledge discovery, the process of


searching data for unanticipated new knowledge.
Very Large Data Bases

v Terabytes -- 10^12 bytes: malmart -- 24 Terabytes

v Petabytes -- 10^15 bytes: Geographic Information


Systems
v Exabytes -- 10^18 bytes: National Medical Records

v Zettabytes -- 10^21 bytes: meather images

v Zottabytes -- 10^24 bytes: Intelligence Agency Videos


Evolution
v 60͛s: Batch reports
ÿ hard to find and analyze information
ÿ inflexible and expensive, reprogram every new request
v ë0͛s: Terminal-based DSS and EIS °executive information
systems)
ÿ still inflexible, not integrated with desktop tools
v 80͛s: Desktop data access and analysis tools
ÿ query tools, spreadsheets, GUIs
ÿ easier to use, but only access operational databases
v 90͛s: Data warehousing with integrated OLAP engines and
tools
ieed for Data Warehousing

v Industry has huge amount of operational


data

v Knowledge worker wants to turn this data


into useful information.

v This information is used by them to support


strategic decision making .
ieed for Data Warehousing °contd..)

v It is a platform for consolidated historical


data for analysis.

v It stores data of good quality so that


knowledge worker can make correct
decisions.
ieed for Data Warehousing °contd..)

v From business perspective


-it is latest marketing weapon
-helps to keep customers by learning more
about their needs .
-valuable tool in today͛s competitive fast
evolving world.
Application Areas

—  


È 

   
 
  È   


   
  
      



     
Y  


  



 
  
 
Operational v/s Information °DW) System

È   
    

  

  
 
  
 


  
  
  

  Y      

 
È 
 Y   
 Y

  
Y   

 
!
 Y
  
  " 
#



 
Y

 
 
  "$
 

 "
 
 %&
 ' (
 )  
Operational v/s Information System

È   
    

È Y  
  
 
*   


 

*     


Y
# +,,)- +,,-
.


 

%


 0
/
 

    
)
  
   1 
Conceptual Structure of Data Warehouse

v Data marehouse processing involves


ÿ Cleaning and reformatting of data
ÿ OLAP

ÿ Data Mining  


 
Y   


 !
Y 
       Y##
Y    $#
"  
Y 
"  

Y   
  Y 
Warehouse Architecture

þ  þ 


  

 


 


 
 
 
Comparison with Traditional Databases

v Data Warehouses are mainly optimized for appropriate data


access.
ÿ Traditional databases are transactional and are optimized for both
access mechanisms and integrity assurance measures.
v Data warehouses emphasize more on historical data as their
main purpose is to support time-series and trend analysis.
Compared with transactional databases, data warehouses are
nonvolatile.
v In transactional databases transaction is the mechanism
change to the database.
By contrast information in data warehouse is relatively coarse
grained and refresh policy is carefully chosen, usually
incremental.
Characteristics of Data Warehouses
v Multidimensional conceptual view
v Generic dimensionality
v Unlimited dimensions and aggregation levels
v Unrestricted cross-dimensional operations
v Dynamic sparse matrix handling
v Client-server architecture
v Multi-user support
v Accessibility
v Transparency
v Intuitive data manipulation
v Consistent reporting performance
v Flexible reporting
Classification of Data Warehouses
v Generally, Data Warehouses are an order of magnitude larger
than the source databases.

v The sheer volume of data is an issue, based on which Data


Warehouses could be classified as follows.
ÿ Enterprise-wide data warehouses
[ They are huge projects requiring massive investment of time and
resources.
ÿ Virtual data warehouses
[ They provide views of operational databases that are materialized
for efficient access.
ÿ Data marts
[ These are generally targeted to a subset of organization, such as a
department, and are more tightly focused.
Data Modeling for Data Warehouses

v Traditional Databases generally deal with two-


dimensional data (similar to a spread sheet).
ÿ !owever, querying performance in a multi-dimensional
data storage model is much more efficient.

v Data warehouses can take advantage of this feature


as generally these are
ÿ ion volatile
ÿ The degree of predictability of the analysis that will be
performed on them is high.
Data Modeling for Data Warehouses

v Example of Two- Dimensional vs. Multi-


Dimensional
T hree d im ensio nal d ata cub e

er
u a rt Q tr 4
Q
c a l Q tr 3
F is tr 2
Q
tr 1
Q R eg 1 R eg 2
P R eg 3
P 123
r
o P 124
d P 125
u
c P 126
t R
R
R e g io n
Data Modeling for Data Warehouses

v Advantages of a multi-dimensional model


ÿ Multi-dimensional models lend themselves readily to
hierarchical views in what is known as roll-up display
and drill-down display.

ÿ The data canbe directly queried in any combination of


dimensions, bypassing complex database queries.
Multi-dimensional Schemas
v Multi-dimensional schemas are specified using:
ÿ Dimension table
[ It consists of tuples of attributes of the dimension.

ÿ Èact table
[ Each tuple is a recorded fact. This fact contains some
measured or observed variable °s) and identifies it with
pointers to dimension tables. The fact table contains the
data, and the dimensions to identify each tuple in the data.
Multi-dimensional Schemas
v Two common multi-dimensional schemas are
ÿ Star schema:
[ Consists of a fact table with a single table for each
dimension
ÿ Snowflake Schema:
[ It is a variation of star schema, in which the dimensional
tables from a star schema are organized into a hierarchy by
normalizing them.
Multi-dimensional Schemas
v Star schema:
ÿ Consists of a fact table with a single table for each
dimension.
Multi-dimensional Schemas
v Snowflake Schema:
ÿ It is
a variation of star schema, in which the
dimensional tables from a star schema are organized
into a hierarchy by normalizing them.
Multi-dimensional Schemas
v Èact Constellation :
ÿ Fact constellation is a set of tables that share some
dimension tables. !owever, fact constellations limit
the possible queries for the warehouse.
Multi-dimensional Schemas
v Indexing
ÿ Data warehouse also utilizes indexing to support high
performance access.
ÿ A technique called bitmap indexing constructs a bit
vector for each value in domain being indexed.
ÿ Indexing works very well for domains of low
cardinality.
Building A Data Warehouse
v The builders of Data warehouse should take a
broad view of the anticipated use of the
warehouse.
ÿ The design should supportad-hoc querying
ÿ An appropriate schema should be chosen that reflects
the anticipated usage.
Building A Data Warehouse
v The Design of a Data marehouse involves
following steps.
ÿ Acquisition of data for the warehouse.

ÿ Ensuring
that Data Storage meets the query
requirements efficiently.

ÿ Givingfull consideration to the environment


in which the data warehouse resides.
Building A Data Warehouse
v Acquisition of data for the warehouse
ÿ The data must be extracted from multiple,
heterogeneous sources.
ÿ Data must be formatted for consistency within the
warehouse.
ÿ The data must be cleaned to ensure validity.
[ Difficult to automate cleaning process.
[ Back flushing, °returning cleaned data to the
source is called back flushing).
Building A Data Warehouse
v Acquisition of data for the warehouse
(contd.)
ÿ The data must be fitted into the data model
of the warehouse.
ÿ The data must be loaded into the
warehouse.
[Proper design for refresh policy should be
considered.
Building A Data Warehouse
v Storing the data according to the data model of the
warehouse
v Creating and maintaining required data structures
v Creating and maintaining appropriate access paths
v Providing for time-variant data as new data are
added
v Supporting the updating of warehouse data.
v Refreshing the data
v Purging data
Building A Data Warehouse
v Usage projections
v The fit of the data model
v Characteristics of available resources
v Design of the metadata component
v Modular component design
v Design for manageability and change
v Considerations of distributed and parallel
architecture
ÿ Distributed vs. federated warehouses
Functionality of a Data Warehouse
v Èunctionality that can be expected:
ÿ Roll-up: Data is summarized with increasing
generalization
ÿ Drill-Down: Increasing levels of detail are revealed
ÿ Pivot: Cross tabulation is performed
ÿ Slice and dice: Performing projection operations on
the dimensions.
ÿ Sorting: Data is sorted by ordinal value.
ÿ Selection: Data is available by value or range.
ÿ Derived attributes: Attributes are computed by
operations on stored derived values.
Online Analysis Processing°OLAP)
v It enables analysts, managers and executives to gain
insight into data through fast, consistent, interactive
access to a wide variety of possible views of information
that has been transformed from raw data to reflect the
real dimensionality of the enterprise as understood by the
user.
 



Y




OLAP Cube


 . 
 
 Y 
   ++2 34+536
)
  67 +765,8
)

   29 :957:
)
    +2 23537

)
   1+ 2 8577

)
   )  2 8577
OLAP Operations
Y Y 

 


þ       


þ  

 
  


OLAP Operations
Y  

 


þ       


þ  

 
  


OLAP Operations
 Y 

 

 
!  



OLAP Operations
" 
 

 


# 

OLAP Server
v An OLAP Server is a high capacity,multi user data
manipulation engine specifically designed to
support and operate on multi-dimensional data
structure.
v OLAP server available are
ÿ MOLAP server
ÿ ROLAP server

ÿ !OLAP server
Presentation

 


# 


# 


Warehouse vs. Data Views
v Views and data warehouses are alike in that they both have
read-only extracts from the databases.
v !owever, data warehouses are different from views in the
following ways:
ÿ Data Warehouses exist as persistent storage instead of being
materialized on demand.
ÿ Data Warehouses are not usually relational, but rather multi-
dimensional.
ÿ Data Warehouses can be indexed for optimization.
ÿ Data Warehouses provide specific support of functionality.
ÿ Data Warehouses deals huge volumes of data that is contained
generally in more than one database.
Advantages of Warehousing
v !igh query performance
v Queries not visible outside warehouse
v Local processing at sources unaffected
v Can operate when sources unavailable
v Can query data not stored in a DBMS
v Extra information at warehouse
ÿ Modify, summarize °store aggregates)
ÿ Add historical information
Difficulties of implementing Data Warehouses

v Lead time is huge in building a data warehouse


ÿ Potentially it takes years to build and efficiently maintain a data
warehouse.

v Both quality and consistency of data are major concerns.

v Revising the usage projections regularly to meet the current


requirements.
ÿ The data warehouse should be designed to accommodate addition
and attrition of data sources without major redesign

v Administration of data warehouse would require far broader


skills than are needed for a traditional database.
Open Issues in Data Warehousing
v Data cleaning, indexing, partitioning, and views could be
given new attention with perspective to data warehousing.
v Automation of
ÿ data acquisition
ÿ data quality management
ÿ selection and construction of access paths and structures
ÿ self-maintainability
ÿ functionality and performance optimization
v Incorporating of domain and business rules appropriately into
the warehouse creation and maintenance process more
intelligently.
Data Warehousing Tools
v Data Warehouse
ÿ SQL Server 2000 DTS
ÿ Oracle 8i Warehouse Builder

v OLAP tools
ÿ SQL Server Analysis Services
ÿ Oracle Express Server
v Reporting tools
ÿ MS Excel Pivot Chart
ÿ VB Applications
Tools
v Data Extraction - SAS
v Data Cleaning - Apertus, Trillium
v Data Storage - ORACLE, SYBASE
v Optimizers - Advanced Parallel Optimizer
Bitmap Indices
Star Index
Tools
v Development tools to create applications
IBM Visualizer, ORACLE CDE
v Relational OLAP
Informix Metacube
mhy we use Oracle for our warehouse

v Table partition pruning


v Star query optimizer hint
v Bitmap indexes
v PL/SQL stored procedures
v Transportable table spaces
v Query rewrite
v Materialized views
v Job scheduler
Useful URLs
â Ralph Kimball͛s home page
http://www.rkimball.com
â Larry Greenfield͛s Data Warehouse Information
Center
http://pwp.starnetinc.com/larryg/
â Data Warehousing Institute
http://www.dw-institute.com/
â OLAP Council
http://www.olapcouncil.com/

You might also like