Data Warehousing and Data Mining

Data Warehouse:
A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes. Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed
Why do we need DataWareHousing?
Integrated, company-wide view of high-quality information. Separation of operational and informational systems and data (for improved performance).
Generic Two Level Architecture
Independent Datamarts
Data marts:
Mini-warehouses, limited in scope
Separate ETL for each independent data mart
Data access complexity due to multiple data marts
Dependent data mart with operational data store
ODS provides option for obtaining current data
T
E Simpler Data Access
Single ETL for
enterprise data warehouse (EDW)
Dependent data marts

loaded from EDW
Logical datamart and active datawarehouse
ODS and dataware house are same
Near real-time ETL for
active Data Warehouse
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
Event = a database action (create/update/delete) that results from a transaction
Data Characteristics Transient vs. Periodic Data
Transient operational data
Changes to existing records are written over previous records, thus destroying the previous data content
Periodic warehouse data
Data are never physically altered or deleted once they have been added to the store
Typical operational data is:
Transient not historical Not normalized (perhaps due to denormalization for performance) Restricted in scope not comprehensive Sometimes poor quality inconsistencies and errors
After ETL, data should be:

Detailed not summarized yet Historical periodic Normalized Comprehensive enterprise-wide perspective Quality controlled accurate with full integrity
Capture
Scrub
or data cleansing Transform Load and Index

ETL = Extract, transform, and load
Steps in data reconciliation
Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing Static extract = capturing a changes that have occurred since snapshot of the source data at a the last static extract point in time
18
Steps in data reconciliation (continued)
Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings,
erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting,
time stamping, conversion, key generation, merging, error detection/logging, locating missing data
19
Transform = convert data from format of operational system to format of data warehouse Record-level:
Selection data partitioning Joining data combining Aggregation data

summarization
Field-level:
single-field from one field to

one field
multi-field from many fields to

one, or one field to many
20
Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of
target data at periodic intervals
Update mode: only changes in
source data are written to data warehouse
21
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
Data Mining
Knowledge discovery using a blend of statistical, AI, and computer graphics techniques Goals:
Explain observed events or conditions Confirm hypotheses Explore data for new or unexpected relationships
Data mining is the process

Exploration and analysis By automatic or semi-automatic means Of large quantities of data To discover meaningful patterns and rules
Classification: Building a model that can be applied to unclassified data. E.g Classifying credit applicants as low, medium, high risk Assigning customers to predefined customer segments Estimation: Deals with continuously valued outcomes. E.g Estimating a familys household income Estimating the value of a piece of real estate
Prediction: Classification according to future behaviour e.g Predicting which customers will leave within the next six months (churn) Which telephone subscribers will order a value-added service Association rules: Grouping which things go together e.g shopping cart. arrangement of items on store shelves cross-selling opportunities
Clustering: Segmenting a diverse group into number of similar groups or clusters. e.g buying behaviour affordability, income level
Description and Visualization: Looking for explanation e.g Women support Arvind Kejriwal in greater numbers than men
Process large quantities of satellite imagerymilitary intelligence Determine which chemicals produce useful drugs(prediction techniques)- pharmaceutical Process improvement using statistical process control manufacturing Behavioural data from order tracking systems, billing systems, POS marketing Understanding the customers, who they are, what they are - CRM
OLAP Operations
Cube slicing come up with 2-D view of data Drill-down going from summary to more detailed views
Slicing a datacube
Drill Down

Data Warehousing and Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing and Data Mining

Uploaded by

Copyright:

Available Formats

Data Warehouse:

Why do we need DataWareHousing?

Generic Two Level Architecture

Mini-warehouses, limited in scope

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Dependent data mart with operational data store

ODS provides option for obtaining current data

Single ETL for

enterprise data warehouse (EDW)

Dependent data marts

Logical datamart and active datawarehouse

ODS and dataware house are same

Near real-time ETL for

active Data Warehouse

Event = a database action (create/update/delete) that results from a transaction

Data Characteristics Transient vs. Periodic Data

Transient operational data

Periodic warehouse data

Typical operational data is:

After ETL, data should be:

or data cleansing Transform Load and Index

Steps in data reconciliation

Steps in data reconciliation (continued)

Also: decoding, reformatting,

Steps in data reconciliation (continued)

Selection data partitioning Joining data combining Aggregation data

single-field from one field to

multi-field from many fields to

Steps in data reconciliation (continued)

target data at periodic intervals

Update mode: only changes in

source data are written to data warehouse

Data mining is the process

You might also like