You are on page 1of 33

Database Concepts

Data Warehousing

Data Warehousing
A data warehouse is the organized collection of internally- and externally-generated data used to allow the business to make accurate business decisions.
It is not just a database. It is an organized collection of databases that is designed specifically to support management decision making.

Data Warehousing

slide 2 of 33

Concepts of a Data Warehouse


Subject-Oriented
Organized around the key subject areas of the business: employees, customers, products, vendors, etc.

Integrated
Data throughout the data warehouse is stored using consistent technologies and rules so that the exchange of data works flawlessly.

Time-variant
Much of the data is stored with date/time information so that trends can be followed.

Nonupdatable
Data cannot be updated by end users; data comes from operational systems. Data Warehousing slide 3 of 33

The Need for Data Warehousing


A business requires an integrated, company-wide view of high-quality information
One of the advantages a small business has over a large business is the close proximity between information and management. This allows the small business to act fast to trends. As a business grows, reaction time slows.

Informational systems needs to be separated from operational systems to improve data management.
Operational systems is the collection and management of data from business operations. Information systems is the collection and management of information used to make decisions.

Data Warehousing

slide 4 of 33

Data Warehousing vs. Data Mart

Data Warehousing

slide 5 of 33

A Company Wide View


A large organization will have data in many places; and in order to make accurate decisions, they will need to access all data sources
A university is likely dealing with different databases: human resources keeps track of student employees, health services keeps track of students, and the registrar office keeps track of student services. Each database will have fields for address and phone number, but the chances are high that the data is stored in different table structures with different field names and possibly on different computer systems.
Data Warehousing slide 6 of 33

Trends Promoting Use of DW


Organizations use multiple databases. These DBs are likely set up at different types on different systems. The various database systems in an organization are likely not synchronized. If they are, it isnt done instantly meaning there are data conflicts at most times. Organizations want to look at different types of data in a single format to track overall improvement (or decline). Organizations want to track/manage data from customers and suppliers in different areas of the business.

Data Warehousing

slide 7 of 33

Operational Systems
Operational systems process large quantities of relatively simple data on a day to day basis. Necessary to run the business on a daily basis. They comprise the bulk of the data collection for organizations. Primary users are the clerks, sales people, and store/office managers.
Data Warehousing slide 8 of 33

Informational Systems
Designed to support decision making based on historical point-in-time and prediction data.
Based on the Labor Day weekend sales in the last five years, what should the next two Labor Day weekend sales be assuming an increase of 20% in advertising? Queries of informational systems are much more complex that queries of operational systems.

Data Warehousing

slide 9 of 33

Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and @ctive Warehouse Three-Layer architecture

Data Warehousing

slide 10 of 33

Generic Two-Level DW Architecture


1. Data is extracted from various sources 2. Data is transformed and integrated before being loaded into the data warehouse 3. Data warehouse contains both detailed and summary data. 4. Users access the data warehouse by means of query and analytical tools.

Data Warehousing

slide 11 of 33

Data Mart Architecture


A data mart is often a subset of a larger data warehouse.
Some organizations may opt for a data mart architecture instead of a data warehouse. Organizations may opt for the speedier access to information provided by the data mart structure. Technology differences and/or limitations may make the smaller data mart a more practical architecture.

Data Warehousing

slide 12 of 33

Limitations of Independent DM
Problems with independent data marts include:
Separate end-user tools for each data mart may require costly redundancy. Data marts may not be consistent with one another, limiting the ease in which an organizational manager can get comprehensive information. Queries cant be customized so that they compare data from other data marts.
A dependent data mart addresses the first two limitations by using a central data warehouse that feeds subject-related data into data marts.

Data Warehousing

slide 13 of 33

Logical Data Mart A logical data mart has different relational views of a single physical data warehouse, instead of separate data marts.
Like partitioning a single hard drive. New data marts can be created quickly. Data marts are kept up-to-date since the source data is in the same system.
http://www.teradata.com/t/page/116324/
Data Warehousing slide 14 of 33

Three-Level Data Architecture


Operational data is stored in the various operation systems Reconciled data is stored in a data warehouse Derived data are stored in each of the data marts. The metadata (third level) level describes the process of creating the derived data
Operational metadata Enterprise data warehouse metadata Data mart metadata

Data Warehousing

slide 15 of 33

The Reconciled Data Layer


Reconciled data is data that has been organized from the operational systems into the data warehouse.
The data is detailed rather than summarized The data is periodic and doesnt predict future data The data is fully normalized giving the data more integrity. The data is comprehensive to the entire organization/enterprise. The data is up-to-date to the current time. The datas reliability must be unquestioned since business decisions will ultimately be based off of this data.

Data Warehousing

slide 16 of 33

Status vs. Event Data

Status

Event = a database action


(create/update/delete) that results from a transaction

Status
Data Warehousing slide 17 of 33

Transient vs. Periodic Data


Changes to existing records are written over previous records, thus destroying the previous data content

Data are never physically altered or deleted once they have been added to the store Data Warehousing slide 18 of 33

Steps in Data Reconciliation


Extract
Relevant data is captured from operational systems.
Static extract Incremental extract

Cleanse
Duplicate, missing, and misspelled data is corrected.

Transform
Converts the data from the format used in the operational systems to a format used by the enterprise data warehouse

Load
Transformed data is loaded into the enterprise data warehouse.

Data Warehousing

slide 19 of 33

Steps in Data Reconciliation


Incremental extract = capturing changes that have occurred since the last static extract

Static extract = capturing a snapshot of the source data at a point in time

Data Warehousing

slide 20 of 33

Data Reconciliation
Typical operational data is:
Transient not historical Not normalized (perhaps due to denormalization for performance) Restricted in scope not comprehensive Sometimes poor quality inconsistencies and errors

After ETL, data should be:


Detailed not summarized yet Historical periodic Normalized 3rd normal form or higher Comprehensive enterprise-wide perspective Timely data should be current enough to assist decisionmaking Quality controlled accurate with full integrity Data Warehousing slide 21 of 33

Single-Field Transformation
In general some transformation function translates data from old form to new form

Algorithmic transformation uses a formula or logical expression


Table lookup another approach, uses a separate table keyed by source record code

Data Warehousing

slide 22 of 33

Multi-field Transformation

M:1 from many source fields to one target field

1:M from one source field to many target fields

Data Warehousing

slide 23 of 33

Derived Data
Objectives
Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities

Characteristics
Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers)

Data Warehousing

slide 24 of 33

Components of a Star Schema


Fact tables contain factual or
quantitative data

1:N relationship between dimension tables and fact tables

Dimension tables are denormalized to maximize performance

Dimension tables contain descriptions


about the subjects of the business

Data Warehousing

slide 25 of 33

Star Schema Example


Fact table provides statistics for sales

Excellent for ad-hoc queries, but bad for online transaction processing

broken down by product, period and store dimensions

Data Warehousing

slide 26 of 33

Star Schema with Sample Data

Data Warehousing

slide 27 of 33

Issues Regarding Star Schema


Dimension table keys must be surrogate (non-intelligent and non-business related), because:
Keys may change over time Length/format consistency

Granularity of Fact Table what level of detail do you want?


Transactional grain finest level Aggregated grain more summarized Finer grains better market basket analysis capability Finer grain more dimension tables, more rows in fact table

Duration of the database how much history should be kept?


Natural duration 13 months or 5 quarters Financial institutions may need longer duration Older data is more difficult to source and cleanse Data Warehousing slide 28 of 33

User Interface for Data Warehouse


Even the best organized and managed data is useless if the end-user cannot access the data in an effective and efficient way. Different classifications of tools used for accessing the data warehouse:
Traditional query and reporting tools Online analytical processing Data-mining tools Data-visualization tools

Data Warehousing

slide 29 of 33

Tools for Data Warehouses


Traditional Query Tools
Includes spreadsheets and personal computer database programs and SQL for queries.

OLAP (MOLAP, ROLAP) Tools


On-line analytical processing is the use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques.

Data Warehousing

slide 30 of 33

Slicing a Data Cube

Data Warehousing

slide 31 of 33

Tools for Data Warehouses (cont)


Data-mining Tools
Knowledge discovery using a sophisticated blend of techniques from traditional statistics, AI, and computer graphics. Automated searches of the data can find potential problems or potential sales avenues.
Customer 593851042 has charged an average of 3 purchases per month for the last 12 months. Over the weekend, the card was used 7 times. Possible credit card theft?

Data Warehousing

slide 32 of 33

Data Visualization
Data Visualization
The representation of data in graphical and multimedia formats for human analysis.
Trends and patterns can often be easier to see and act on when presented in a visual format.

Data Warehousing

slide 33 of 33

You might also like