Professional Documents
Culture Documents
PROBLEMS
Problems !
IT business requires: 1. integrated 2. company-wide view of high quality 3. Fixed network with changing users Informational Processing systems department must be separated from operational systems to improve performance
Problems !
No single system of data.
View of databases as a whole is difficult Organization wants to analyze the activities in a balance way Customer relationship with management
SO WHAT IS
DATA WAREHOUSE ?
Subject-oriented: customers, patients, students, products, time.
OPERATIONAL SYSTEMS
Used to run a business in real time based on current data and process large volumes of relatively simple read/write transactions, while providing fast response.
Examples
1. Sales order processing 2. Reservation systems 3. Patient registration
INFORMATION SYSTEMS
Designed to support decision-making based on 1. Historical data 2. Prediction data.
DIFFERENCE
Characteristics Purpose Primary users Scope of usage Operational Systems Real time data entry Clerks, sales-persons, administrations Narrow, planned, and simple updates and queries Performance throughput, availability Many, constant updates and queries on one or a few table rows Informational Systems Real and analyze historical data. Managers, business analysts, customers Broad, ad hoc, complex queries and analysis Ease of flexible access and use Periodical batch updates and queries requiring many or all rows
BIG PICTURE
END USERS
Executives and managers
"Power" users (business and financial analysts, engineers, etc.) Support users (clerical, administrative, etc.)
DATA MARTS
DATA MARTS
Create many DMs Limited scope Independent ETL process or derived from DW
Examples:
1. Financial DM 2. Marketing DM 3. Supply chain DM
D.M. PICTURE
Archived Data: Data from current business and old data store in archive files
OLTP
On line transaction processing Standard Normalized Structure
OLAP
On line analytical processing , Star Schema [See Table] Read Only Historical data Aggregated data
CLEANING
Large volumes of data from multiple sources are involved High probability of errors and anomalies in the data Tools that help to detect data anomalies and correct them can have a high payoff
CLEANING
Examples where data cleaning becomes necessary are: 1. Inconsistent field lengths, 2. Inconsistent descriptions, 3. Inconsistent value assignments, 4. Missing entries and violation of integrity constraints. Different, classes of data cleaning tools used to extract & loading data
DATA MIGRATION
Data migration tools allow simple transformation rules to be specified Examples: replace the string gender by sex.
DATA SCRUBBING
Data scrubbing tools use domain-specific knowledge Example: Postal addresses, to do the scrubbing of data. Use parsing and fuzzy matching techniques to accomplish cleaning from multiple sources. Tools: Integrity and Trillum
DATA AUDITING
Data auditing tools make it possible to discover rules and relationships by scanning data. Example: Tool may discover a suspicious pattern (based on statistical analysis) that a certain car dealer has never received any complaints.
LOADING
Additional preprocessing required: 1.Checking integrity constraints 2. Sorting; summarization, aggregation 3.Other computation to build the derived tables stored in the warehouse Batch load utilities are used for this purpose. In addition to populating the warehouse, a load utility must allow the system administrator to monitor status, to cancel, suspend and resume a load, and to restart after failure with no loss of data integrity.
REFRESH
Refreshing a warehouse consists in propagating updates on source data to correspondingly update the base data and derived data stored in the warehouse. Two sets of issues: when to refresh, and how to refresh.
SUMMARIZATION
Required lot of space to store and require computer time as well as resources. Some of the summaries may contain figures that explain the summary. Advantage is that the data warehouse is not calculating the summaries.
METADATA
Administrative metadata Business metadata includes business terms and definitions, Operational metadata includes information that is collected during the operation of the warehouse:
Transform
Load and Index
Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Incremental extract = capturing changes that have occurred since the last static extract
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Transform = convert data from format of operational system to format of data warehouse
Record-level:
Selection data partitioning Joining data combining Aggregation data summarization
Field-level:
single-field from one field to one field multi-field from many fields to one, or one field to many
Load/Index= place transformed data into the warehouse and create indexes
Single-field transformation
In general some transformation function translates data from old form to new form