You are on page 1of 36

Data Warehousing, MIS & ETL

Overview

SivaSatish.K
Data Warehousing
Data & Information

 Data, in real terms is raw and has no facts behind its


existence.
 On the other hand, information is the processed raw data
(as per the business requirement) and supported by facts.
 Example : For an example, Management requires
“Number of booking (booking creation trend) for Asia in
year 2007”. Whenever this data will be retrieved from data
sources as per business requirements (any input system)
and stored in data base, this would be information to the
management.
Data Warehousing

 It stores the data related to the subject area of the


organizations and controlled/owned by centrally
coordinated organizations.
– In other words, it’s a relational database which is
designed for performing analysis and execution of
queries. It contains historical data also.
– Technique to implement data warehouse is
known as “Data Warehousing”
Why do we need data
warehousing ?

History
Reasons:
 In OLTP approach, historical data was archived to
update the recent information in the system and
even it is bit difficult to update that information
everywhere in OLTP. Unlike OLTP; DW stores
historical data to compare trends over period of time
(now a days, it’s in Tera bytes)
 As it has one DB at the end, number of sources can
be used to transfer data and sources can be
relational DB, Flat File, Unix file. ETL tool handles
these source files to transfer data from source to
target.
Continued…..

 DW stores the necessary data for specific analysis;


not jus all data.
 Small companies focus on one area so they can
have data in data marts ; on the contrary large
companies can have data in data warehouse and
data marts to facilitate their users’ need.
 DW is commonly used by organizations to analyze
trends over time. Trends can be of
Revenue/Sales/Purchases/ BL generation etc.
Continued……

 Most importantly, Though Organizations may use


data warehousing to view day-to-day operations but
primarily DW is to facilitate strategic planning
resulting from long-term data overviews and these
overviews, are the pillars for next business models,
forecasts and other reports.
Approaches:

 Normalized Approach
 Dimensional approach or De - Normalized
approach
Normalized Vs Dimensional Approach:

 Firslty, any of these approaches can be used to


store the data to the data warehouse but objective
makes a difference to implement these.
 In normalized approach, data element is
completely ignored for data warehouse.
 With dimensional approach, multidimensional
model with dimensional approach came into picture.
Here, processed data or transaction data are
partitioned into "facts" and “dimensions”.
Dimensional Approach continued…...

 Facts in simple meaning are ‘measures’ (figures


such as # of booking created, Container quantity,
Revenue, Gross Weight etc)
 Where as Dimensions are the entity to which these
facts belong such as Customer name, continent,
Week, Month)
 Hence, MIS is an example of Dimensional
approach.
Continued…with an example

 One Dimensional:-
– For SVVD (PAX-ANX-054 W), there are 4572
containers. Here, SVVD is the dimension and
4572 (container qty) is the measure i.e. fact.
– For Shipper name “ABC 123”, revenue is 1.2
million $. Here, shipper is the dimension and
revenue is the measure i.e. fact corresponds to
dimensions.
Continued….

 Multidimensional example:
– What is the gross weight for 47GP Containers
traveling from New York to Singapore with
shipment direction “E” in year 2006?
– Here in this example, Gross weight is the only
measure (Fact) and container type (47GP), Load
Port (New York) and Discharge Port (Singapore),
shipment direction (E) and Year (2006) are
various dimensions.
Characteristics : Data Warehouse

 Subject Oriented
 Integrated
 Non Volatile
 Time Variant
Continued……

 Subject oriented in a way that it revolves around


the subject area of corporate such as “Booking”, if
management wants to see booking pattern of last
quarter like which country has maximum booking
and shipment started ports.
 Integrated in a way that database should avoid
naming conflicts and inconsistencies among
keywords used and somehow related to subject
oriented feature also.
Continued…..

 Non volatile in a way that once stored in the


database, data should not be changed as it depicts
what has occurred in the past.
 Time Variant in a way that, In order to compare the
trends in business over the past years,
managements/decision makers need large amounts
of data. A data warehouse's focus on change over
time is what is meant by the term time variant which
is unlike OLTP, where performance requirements
demand that historical data be moved to an archive.
Architecture of Data Warehouse

Data warehouse basic Flow:


End Users
Analysis
Data Sources: Meta Raw
Data Data
-Relation data source
-Flat File End Users
Analysis
Summarized
data
End Users
Analysis
Terminologies: Freq. used in DW

 Source : Data where it is coming from


 Target : Destination database or data mart
 Test director Test management and defect
management tool from Mercury
 Sunopsis/ Informatica : ETL dev tool used to
extract, load and transformation
 Process Packages, Scenarios, Interfaces,
procedures : Different objects and their hierarchy in
Sunopsis ETL tool
MIS : Management Information
system
MIS: Management Information system

 Important Module in terms of strategic decisions:


– As most of the strategic decisions based on the
information.
– As management decisions are based on the
information (information is nothing, just the processed
data), processed data has been stored in database in
more organized and structured manner.
 MIS deals only with Data (back- end)
 It is treated as DSS ( Decision support system)
Continued…

 As said, deals only with Data; MIS is a Data


Warehouse used to store the data for various
modules such as Shipment, Equipment, DnD, AR
and Cost etc.
MIS : Architecture

 Pictorial Representation:
CPF Sybase
OLAP
CDW CDW Oracle Cubes
(ETL) Ware house
Gemstone Sybase
Server

Cognos
ODS Sybase Repository
EQP Shpmt
Data mart Mart Mart

CSS (ETL) BI Reports


Sybase
AR Cost
Mart Mart BI Client
The components and the purpose :

 OSCAR – One of the source systems for the master


and transactional data
 Middleware – The Sunopsis ETL tool is used as the
Middleware tool to Extract, Transform and Load the
source data into the target systems
 CDW (centralized Data Warehouse) – This is one of
the target systems, which holds the data for all the
Marts (Shipment, Cost, Trade PL, AR and
Equipment) which is used for reporting
Types of Testing

 Data
 Functional
 If Require, Middleware
Data Testing includes:
 Initial data Testing, Data validation, Data quality,
Data integration and Data volume testing.
– Data Validation ensures Null/not null values,
duplicate records, behavior of surrogate
keys/primary keys, Data type and format etc.
– Data quality testing ensures the accurate
information in dimension and fact tables.
– Initial testing means verifying the data in CDW,
Master tables which will be populated through
OSCAR and its web application (Front End
System)
Continued….

– Data Integration testing ensures the integrity of


data in target system if data populated from more
than one source system.
– Moreover, if any change to the data happens
through UI, then possibly test the same changes
to the target data warehouse.
– Data Volume testing ensures that Volume of
data is available from the OSCAR Test
environment then perform data volume testing
Functional Testing

 Number of records from source system (OSCAR) for


each individual dimension and fact
table should be tested into target data warehouse,
sometimes it may differ due to join conditions
between tables in source system.
 Testing will be done in target data warehouse after
data loads from Source system to target system.
 If any changes occur in data source, check
corresponding changes in target data warehouse.
Continued…..

 Application specific features to be tested according


to business rules/ transformation rules in target data
warehouse. (E.g. booking id should be system
generated unique identifier).
Middleware (sunopsis) Testing

 Verify the connection between source & target database.


 If needed, Verify Source to stage and stage to target
mappings and based on that only test what is in source
and what should be in target
 If an errors occurs, Check the error notification through
Open view
 For some target tables, there are two different
packages/scenarios (for source to stage and stage to
target), run both packages manually
Continued….

 Performance testing / Fail-over testing: Test


performance and fail-over of Middleware with huge
volume of data (If possible and Volume of data will
be available from the OSCAR Test environment)
Test cycle

 Subject areas for master target tables: Customer,


Geography, Organization, Commodity, Product,
Vendor/Tariff, Service contract and Equipment.
 Subject areas for transactional target tables: B/L,
Booking, Container Movement, Empty poisoning
plan, Job Order and Job order Invoice
(Transportation/Non-Transportation)
ETL

Overview
Why ETL tool ?

 While storing the data to the target database, data


will be processed according to the various
requirements of business (like ignore a particular
country, region etc) and later on saved through ETL
tools and technique to the database (target
repository)
 Generate a Unique Id in Shipment Mart.
ETL: Extraction, Transform and Load.

 Extract the data (Sales in different


countries in Asia region) through UI/Any
other Relational Source or Flat File.
 Transform that data based on various
requirements from business.
 Load the data to the target (table) and store
it in database.
Continued…..

 The Simplest way to understand the ETL


concepts such as
– ETL process,
– ETL software,
– ETL developments
– and ETL testing starts with Data warehousing
concepts which was explained earlier.
Recommended References:

 http://en.wikipedia.org/wiki/Data_warehouse

You might also like