You are on page 1of 45

OVERVIEW OF DATA WAREHOUSING

DATABASE VS DATA WAREHOUSE


To accelerate decision making: 1. Right information, 2. Right time, 3.Easily accessible Problems with database 1. Fragments 2. Operational / Information Processing

PROBLEMS

Problems !
IT business requires: 1. integrated 2. company-wide view of high quality 3. Fixed network with changing users Informational Processing systems department must be separated from operational systems to improve performance

Problems !
No single system of data.
View of databases as a whole is difficult Organization wants to analyze the activities in a balance way Customer relationship with management

SO WHAT IS
DATA WAREHOUSE ?
Subject-oriented: customers, patients, students, products, time.

Integrated: Gathered CENTRALLY from


1.several internal systems of records 2. sources external to the organization

WHAT IS DATA WAREHOUSE ?


Time - variant:

Use to study trends and changes.


Non - updatable: cannot updated by end users.

OPERATIONAL SYSTEMS
Used to run a business in real time based on current data and process large volumes of relatively simple read/write transactions, while providing fast response.

Examples
1. Sales order processing 2. Reservation systems 3. Patient registration

INFORMATION SYSTEMS
Designed to support decision-making based on 1. Historical data 2. Prediction data.

Designed for complex queries or data-mining applications.


Examples: 1. Sales trend analysis, 2. Customer segmentation 3. Human resources planning

DIFFERENCE
Characteristics Purpose Primary users Scope of usage Operational Systems Real time data entry Clerks, sales-persons, administrations Narrow, planned, and simple updates and queries Performance throughput, availability Many, constant updates and queries on one or a few table rows Informational Systems Real and analyze historical data. Managers, business analysts, customers Broad, ad hoc, complex queries and analysis Ease of flexible access and use Periodical batch updates and queries requiring many or all rows

Design goal Volume

DATA WAREHOUSE SCOPE


Broad : Required for companies, Very costly, May be divided according to Depts.
Narrow: Required for Personal information

TYPES OF DATA WAREHOUSE


Point Point End-users allowed to get operational databases directly using any tools

TYPES OF DATA WAREHOUSE


Central Data Warehouses

TYPES OF DATA WAREHOUSE


1. EIS : Executive Information System 2. DSS: Decision Support System 3. Reporting

Distributed Data Warehouse:


Certain Components of DW are distributed across a number of different physical databases

BIG PICTURE

END USERS
Executives and managers
"Power" users (business and financial analysts, engineers, etc.) Support users (clerical, administrative, etc.)

DATA MARTS

DATA MARTS
Create many DMs Limited scope Independent ETL process or derived from DW

Examples:
1. Financial DM 2. Marketing DM 3. Supply chain DM

D.M. PICTURE

DATA IN DATA WARE HOUSE


one version of the truth across the enterprise with meaning full recordes
For IT staff : clean, consistent, and documented formatted data. For engineer or analyst: convenient, in a common formatted data, exportable to other common formats

DATA IN DATA WARE HOUSE


Production Data: Data from different Operational systems with heterogeneous platforms Internal Data: Private data of organization like spread sheets, documents, customer profiles

DATA IN DATA WARE HOUSE


External Data: Data from external sources. Statistics relating to their industry produced by external agencies
Example: DW of car rental company contains data on the current production schedules of the leading automobile manufactures

DATA IN DATA WARE HOUSE

Archived Data: Data from current business and old data store in archive files

DATA IN DATA WARE HOUSE


Methods of archiving data: 1. Recent data is archived to separate archival database that may be online 2. Old data is archived to flat files on disk storage 3. Oldest data is archived to tape cartridges or microfilms or kept off line

DATA BASE STRUCTURE


DW made up of three separate databases: 1. Interim data store 2. Meta data repository 3. Production DW

OLTP
On line transaction processing Standard Normalized Structure

Designed for transactions: Insert, Updates, Delete

OLAP
On line analytical processing , Star Schema [See Table] Read Only Historical data Aggregated data

ARCHITECTURE AND END-TO - PROCESS

BACK END TOOLS AND UTILITIES


Tools are used to extract & loading data Data extraction from foreign sources by gateways & interfaces Examples: EDA/SQL, ODBC, Oracle Open Connect, Sybase Enterprise Connect Informix Enterprise gate way

CLEANING
Large volumes of data from multiple sources are involved High probability of errors and anomalies in the data Tools that help to detect data anomalies and correct them can have a high payoff

CLEANING
Examples where data cleaning becomes necessary are: 1. Inconsistent field lengths, 2. Inconsistent descriptions, 3. Inconsistent value assignments, 4. Missing entries and violation of integrity constraints. Different, classes of data cleaning tools used to extract & loading data

1. Data Migration 2. Data scrubbing 3. Data Auditing tools

DATA MIGRATION
Data migration tools allow simple transformation rules to be specified Examples: replace the string gender by sex.

Warehouse Manager from Prism is an example of a popular tool of this kind.

DATA SCRUBBING
Data scrubbing tools use domain-specific knowledge Example: Postal addresses, to do the scrubbing of data. Use parsing and fuzzy matching techniques to accomplish cleaning from multiple sources. Tools: Integrity and Trillum

DATA AUDITING
Data auditing tools make it possible to discover rules and relationships by scanning data. Example: Tool may discover a suspicious pattern (based on statistical analysis) that a certain car dealer has never received any complaints.

LOADING
Additional preprocessing required: 1.Checking integrity constraints 2. Sorting; summarization, aggregation 3.Other computation to build the derived tables stored in the warehouse Batch load utilities are used for this purpose. In addition to populating the warehouse, a load utility must allow the system administrator to monitor status, to cancel, suspend and resume a load, and to restart after failure with no loss of data integrity.

REFRESH
Refreshing a warehouse consists in propagating updates on source data to correspondingly update the base data and derived data stored in the warehouse. Two sets of issues: when to refresh, and how to refresh.

SUMMARIZATION
Required lot of space to store and require computer time as well as resources. Some of the summaries may contain figures that explain the summary. Advantage is that the data warehouse is not calculating the summaries.

METADATA
Administrative metadata Business metadata includes business terms and definitions, Operational metadata includes information that is collected during the operation of the warehouse:

The ETL Process


Capture Scrub or data cleansing

Transform
Load and Index

ETL = Extract, transform, and load

Steps in data reconciliation

Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

Steps in data reconciliation (continued)

Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps in data reconciliation (continued)

Transform = convert data from format of operational system to format of data warehouse

Record-level:
Selection data partitioning Joining data combining Aggregation data summarization

Field-level:
single-field from one field to one field multi-field from many fields to one, or one field to many

Steps in data reconciliation (continued)

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting


of target data at periodic intervals

Update mode: only changes in


source data are written to data warehouse

Single-field transformation
In general some transformation function translates data from old form to new form

Algorithmic transformation uses a formula or logical expression

Table lookup another approach

Multi field transformation

M:1 from many source fields to one target field

1:M from one source field to many target fields

You might also like