Professional Documents
Culture Documents
A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes. Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed
Integrated, company-wide view of high-quality information. Separation of operational and informational systems and data (for improved performance).
Independent Datamarts
Data marts:
T
E Simpler Data Access
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
Changes to existing records are written over previous records, thus destroying the previous data content
Data are never physically altered or deleted once they have been added to the store
Transient not historical Not normalized (perhaps due to denormalization for performance) Restricted in scope not comprehensive Sometimes poor quality inconsistencies and errors
Detailed not summarized yet Historical periodic Normalized Comprehensive enterprise-wide perspective Quality controlled accurate with full integrity
Capture
Scrub
Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing Static extract = capturing a changes that have occurred since snapshot of the source data at a the last static extract point in time
18
Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings,
erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
time stamping, conversion, key generation, merging, error detection/logging, locating missing data
19
Transform = convert data from format of operational system to format of data warehouse Record-level:
Field-level:
Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of
21
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
Data Mining
Knowledge discovery using a blend of statistical, AI, and computer graphics techniques Goals:
Explain observed events or conditions Confirm hypotheses Explore data for new or unexpected relationships
Classification: Building a model that can be applied to unclassified data. E.g Classifying credit applicants as low, medium, high risk Assigning customers to predefined customer segments Estimation: Deals with continuously valued outcomes. E.g Estimating a familys household income Estimating the value of a piece of real estate
Prediction: Classification according to future behaviour e.g Predicting which customers will leave within the next six months (churn) Which telephone subscribers will order a value-added service Association rules: Grouping which things go together e.g shopping cart. arrangement of items on store shelves cross-selling opportunities
Clustering: Segmenting a diverse group into number of similar groups or clusters. e.g buying behaviour affordability, income level
Description and Visualization: Looking for explanation e.g Women support Arvind Kejriwal in greater numbers than men
Process large quantities of satellite imagerymilitary intelligence Determine which chemicals produce useful drugs(prediction techniques)- pharmaceutical Process improvement using statistical process control manufacturing Behavioural data from order tracking systems, billing systems, POS marketing Understanding the customers, who they are, what they are - CRM
OLAP Operations
Cube slicing come up with 2-D view of data Drill-down going from summary to more detailed views
Slicing a datacube
Drill Down