You are on page 1of 112

Introduction to

Data Warehouse

(slides in this section


are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

Introduction
Introduction to
to Data
Data Warehousing
Warehousing and
and Data
Data
Mining
Mining
1) Data Warehouse Introduction
2) Engineering Conflicts
3) OLTP and DSS
4) Stovepipe vs. Integration
5) Data Warehouse Solution
6) Enterprise Information System
7) Security in a Data Warehouse
8) Moving Data to a Data Warehouse
9) Data Marts
10) Data Mining
2

1
Introduction
Introduction

• Key topics for this course include:


– Data Warehouse
– Data Mart
– Data Mining

• Background and review of relational database


systems
• Main focus on data warehouse and data mining

Data
Data Warehouse
Warehouse Introduction
Introduction
• A data warehouse is a single source for key,
corporate information needed to enable business
decisions

• A database application is a piece of software that


provides a user interface for users to add, delete,
query and update data

• Typically, a database management system is


used to actually do the work of adding, deleting,
querying or updating data

Database Data
Application System

2
Engineering
Engineering Conflicts,
Conflicts, Query
Query and
and Update
Update

• It is often an engineering problem when data is


updated and long-running queries occur at the
same time
• In some cases, the users who are doing updates
must wait for queries to complete
• One way to avoid this is to make a read-only
copy of data

Database System
Application
Data Data
for update for query

OLTP
OLTP and
and DSS
DSS Defined
Defined
• An application that updates is called an on-line
transaction processing (OLTP) application

• An application that issues queries to the read-


only database is called a decision support system
(DSS)

OLTP
Application
Database System

OLTP DSS
DSS Data Data
Application

3
Applications
Applications in
in aa Typical
Typical Enterprise
Enterprise
• Most organizations have several disparate
OLTP/DSS applications in several databases

Finance Inventory Sales


OLTP OLTP OLTP
Application Application Application

Finance Inventory Sales


DSS DSS DSS
Application Application Application

DATABASE SYSTEM
Finance Inventory Sales
OLTP OLTP OLTP
Data Data Data

Finance Inventory Sales


DSS DSS DSS
Data Data Data

Stovepipe
Stovepipe vs
vs Integration
Integration
• When systems stand by themselves they are
often referred to as “stovepipes”

• Systems that easily share data are called “well


integrated systems”

Finance
OLTP Inventory
Application OLTP
Application

Finance
DSS Inventory
Application DSS
Application

4
Problems
Problems with
with Stovepipe
Stovepipe Architecture
Architecture
• Problems:
– Users who wish to access data must query several different DSS to
find it
– Data may have fundamental conflicts between DSS
– a department code table in one DSS may differ in another DSS
– a measurement may be stored in meters in one DSS and yards in
another

• Solution:
– Use a data warehouse, where data is integrated
from the several different stovepipe systems
– Data warehouse is really sharing-lite -- you
don’t have to co-ordinate as much when applications are built
and you still reap the benefits of data sharing

Data
Data Warehouse
Warehouse Solution
Solution

• A data warehouse is an attempt to integrate


separate DSS so that users can query one place
to find the answers to their questions

• A data warehouse has the key, corporate data in


the organization

• A data warehouse tracks historical


data

10

5
Data
Data Warehouse
Warehouse -- A
A Success
Success Story
Story

• Largest data warehouse is Wal-Mart (9 TB)

• Uses for Wal-Mart data warehouse


– Identifies where a new store should be built based on customer demand
– Identifies how stores are performing across the nation
– Contains every “scan” from every purchase

• Benefits Wal-Mart gained from their data warehouse


– Provided competitive advantage over K-Mart
– Reduced excess inventory in individual stores
– Avoided wasted funds in building stores which would fail

11

Selling
Selling the
the Data
Data Warehouse
Warehouse
• A data warehouse project will fail without
corporate sponsorship
– Preferably, the project should be sponsored by the CEO
– The CEO must be sold on the value to the business to
improve competitive advantage by deploying a data warehouse

• If an active, corporate sponsor does not exist,


data sources will be very difficult to identify

• Only add data to the warehouse


that will answer key,
corporate questions asked
by the corporate sponsor.
Otherwise, you will have a data dump
12

6
Building
Building aa Useful
Useful Data
Data Warehouse
Warehouse
• You really need:
– strong executive sponsorship
– good knowledge of the data
– sound software engineering
– stability from source systems
– users who want a success

• A 75 percent failure rate is often cited

• It is WORTH the effort!!!

13

Enterprise
Enterprise Information
Information System
System
• An EIS (Enterprise Information System) allows
users to query data in a data warehouse

• Users can access key, corporate data in the data


warehouse
Enterprise
Information
System

Data Warehouse

14

7
Users
Users of
of an
an Enterprise
Enterprise Information
Information System
System
• Frequently, multiple EIS are needed to satisfy
different types of users
– Some users only want a system that has pre-defined reports so they
only need to “click one button” to see data they need. These users
want the system to be no harder to use than a “coffee pot”
– Other users want to delve into the data and build their own queries

• Executives want a high-level, summary


data and a simple tool
– Must be VERY easy to use, users want to click a few
buttons and get data they want
– Results must be graphs
– Users should be able to drill-down into key areas.

15

Users
Users of
of an
an Enterprise
Enterprise Information
Information System
System
• Analysts want a flexible, more detailed tool
– Often very knowledgeable about the data
– Willing to do more work to learn about the data
– Sometimes even learn SQL to issue their own
ad-hoc queries

• General users want a tool that provides detailed


data, but is very easy to use
– Want access to the data warehouse to do
routine tasks such as
“Find me Hank’s phone number”, etc.
– Simple application, but not so focused
on large reports

16

8
Data
Data Warehouse
Warehouse // EIS
EIS
Inventory
OLTP
Finance Inventory Data
OLTP Sales
OLTP OLTP
Application Application Application

Finance Enterprise SSaalleess


OLTP Information OLTP
OLTP
Data Data
System Data

Data Warehouse

Finance Inventory Sales


Subject Subject Subject
Area Area Area

17

Need
Need for
for Data
Data Warehouses
Warehouses

• Data warehouses provide a single place to store key


corporate data
– The idea is that users can go one place to find this key data using an
enterprise information system (EIS)

• Data warehouse is also a place to store and access


historical data
– Users measure performance goals for their company over a period of
time
– Company statistics are available
– Data not stored in the same place is difficult to locate
and compare, easily lost
– Single query can be used to access key data

18

9
Security
Security in
in Data
Data Warehouse
Warehouse
• Building a data warehouse does increase security
risk because key, corporate information is all in
one place

• To mitigate that risk, database system


components can be used to protect the data
warehouse. These include
– Views
– Access control
– Security Administration
– Encryption
– Audit

19

Moving
Moving Data
Data into
into the
the Data
Data Warehouse
Warehouse
• Moving data from source OLTP systems to the
data warehouse is the hard part of data
warehousing

• Updates to the data warehouse are performed


periodically
– weekly
– nightly
– monthly

• Occasionally, real-time
data is needed in a data warehouse, but this
is not very common

20

10
Using
Using Middleware
Middleware to
to Move
Move Data
Data
• Data can be moved to the warehouse via data
migration software

• This is often called “middleware” because it sits


between the source OLTP and the data
warehouse

Data
Warehouse
Source Migration Data
OLTP Software Warehouse
System “Middleware”

21

Need
Need for
for aa Data
Data Mart
Mart
• A data mart is a subset of the data warehouse
that may make it simpler for users to access key
corporate data
– Sometimes, users only need a piece of data from the data
warehouse

• The data mart is typically fed from the data


warehouse
Data Warehouse

Finance Inventory Sales


Subject Subject Subject
Area Area Area

New York California


Data Mart Data Mart

22

11
Data
Data Mart
Mart in
in Action
Action
Inventory
OLTP
Finance Inventory Data
OLTP Sales
OLTP OLTP
Application Application Application

Finance Enterprise SSaalleess


OLTP Information OLTP
OLTP
Data Data
System Data

Data Warehouse

Finance Inventory Sales


Subject Subject Subject California
New York Area Area Area Data Mart
Data Mart

23

Data
Data Mining
Mining Introduction
Introduction
• Data Mining is done by running software that
examines a database and looks for patterns in the
data

• A data warehouse by itself will respond to queries


from users
– It will not tell users about patterns in data that users may not have
thought about
– To find patterns in data, data mining is
used to try and mine key information from
a data warehouse

24

12
Advantages
Advantages of
of Data
Data Mining
Mining
• Data mining allows companies to collect
information and make them more productive and
beat their competition

• Data mining helps identify


– why customers buy certain products
– ideas for very direct marketing
– ideas for shelf placement
– training of employees vs. employee retention
– employee benefits vs. employee retention

25

Implementing
Implementing Data
Data Mining
Mining
• Apply data mining tools to run data mining
algorithms against data
• There are two approaches:
– Copy data from the Data Warehouse and mine it
– Mine the data in the Data Warehouse

• Popular tools use a variety of different data


mining algorithms:
– association rules
– genetic algorithms
– decision trees
– neural networks

26

13
Data
Data Mining
Mining using
using Separate
Separate Data
Data
• You can move data from the data warehouse to
data mining tools
– Advantages
– Data mining tools may organize data so they can run faster
– Disadvantages
– Could be very expensive to move large
amounts of data
Data Warehouse

Data Mining Tool


Copy of data made
by the
Data Mining Tool

27

Data
Data Mining
Mining Against
Against the
the Data
Data Warehouse
Warehouse

• Data mining tools can access data directly in the


Data Warehouse
– Advantages
– No copy of data is needed for data mining
– Disadvantages
– Data may not be organized in a way that is
efficient for the tool
Data Warehouse

Data Mining Tool

28

14
Data
Data Mining:
Mining: Summary
Summary
• Data mining attempts to find patterns in data that
we did not know about
• Often data mining is just a new buzzword for
statistics
• Data mining differs from statistics in that large
volumes of data are used
• Many different data mining algorithms exist and
we will discuss them in the course
• Examples
– identify users who are most likely to commit credit card fraud
– identify what attributes about a person most results in them buying
product x.
29

SQL Review

(slides in this section


are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

30

15
Introduction
Introduction to
to SQL
SQL

1) Introduction to SQL
2) Data Definition Language (DDL)
3) Data Manipulation Language (DML)
4) SELECT Construct
5) SELECT Operators
6) Wildcard Searches
7) Aggregate Operators
8) Calculated Attributes
9) Sorting Results

31

Introduction
Introduction to
to Structured
Structured Query
Query Language
Language

• Structured Query Language (SQL) is the language


used to communicate with a relational database
– Industry standard
– Based on set theory

• SQL composed of two types of constructs:


– Data Definition Language (DDL)
– Defines the structure of the database
– Data Manipulation Language (DML)
– Provides the constructs to input and retrieve data

32

16
SQL
SQL Overview
Overview -- DDL
DDL
• Data Definition Language (DDL) is used to
describe the structure of the database
– Create tables, indexes, etc.
– Typical Operations are:
– CREATE TABLE defines what columns are in the table and
establishes the table
– CREATE INDEX defines an index for the table. Indexes are used
to improve database performance

33

SQL
SQL Overview
Overview -- DML
DML
• Data Manipulation Language (DML) is used for
storing, updating, and retrieving data.

• Typical operations include:


– SELECT is used to retrieve data.
– Ex: SELECT * FROM PRODUCTS
– INSERT is used to add new rows to the database.
– INSERT INTO PRODUCTS VALUES ('food',
'hardware', 'housewares')
– UPDATE is used to change rows that already exist in the database.
– UPDATE PRODUCTS SET PRICE = PRICE + 4
– DELETE is used to eliminate rows of data from the database.
– DELETE FROM PRODUCTS

34

17
SELECTOverview
SELECT Overview
• SELECT is used to retrieve records from the
database.

• Single table SELECT constructs:


– WHERE
– IN
– BETWEEN
– LIKE
– Aggregate Operators
– DISTINCT
– ORDER BY

35

SELECTExamples
SELECT Examples
• Query Purpose: Retrieve names and prices of all
products

SELECT ProductName, Price


FROM TinyProducts

• Query Purpose: Retrieve all information for all


employees from the TinyProducts table

SELECT *
FROM TinyProducts

36

18
SELECTwith
SELECT with WHERE
WHERE
• The WHERE clause is used to filter which
information is returned from a SELECT

• Query Purpose: Retrieve all information only for


product type of “food”

SELECT *
FROM TinyProducts
WHERE ProductType = ‘Food’

37

Use
Use of
of Boolean
Boolean Operators
Operators
• Conditions can be separated by Boolean
operators:
– AND, OR, NOT

• Query Purpose: List all information about food


products that are either cereal or fruit

SELECT *
FROM TinyProducts
WHERE (ProductName = 'Cereal')
OR (ProductName = 'Fruit')

38

19
Boolean
Boolean Operator
Operator Example
Example
• Query Purpose: List the names of all products
that the type is fruit and the price is less than
$2.00

SELECT ProductType, ProductName


FROM TinyProducts
WHERE Price < 2
AND ProductName = 'Fruit'

39

IN Operator
IN Operator
• The IN operator allows a search for records that
match one value in a set of unordered values

• Example questions to use IN:


– 'Find all products whose type is Food, Hardware, or Housewares'
– 'Find all food whose type is Meat, Fish, Vegetables, or Fruit'

40

20
IN Example
IN Example
• Query Purpose: List the name of Housewares that
are Cookware, Linens, or Dishes

SELECT ProductName, ProductType


FROM TinyProducts
WHERE ProductName in
('Cookware', 'Linens', 'Dishes')

instead of:

SELECT ProductName, ProductType


FROM TinyProducts
WHERE (ProductName = ’Cookware')
OR (ProductName = 'Linens')
OR (ProductName = 'Dishes') 41

BETWEEN Operator
BETWEEN Operator
• The BETWEEN operator allows a search for a range
of values

• Example Queries:
– 'Find all fruit between Bananas and Grapes'
– 'Find all cereals whose price is between $1.50 and $4.00 a box

1.50 4.00

42

21
BETWEEN Example
BETWEEN Example
• Query Purpose: Find all products whose price is
between $2.00 and $8.00

SELECT ProductName, Price


FROM TinyProducts
WHERE Price BETWEEN 2.00 AND 8.00

instead of:

SELECT ProductName, Hardware


FROM TinyProducts
WHERE (Price >= 2.00) OR (Price <= 8.00)

43

Wildcard
Wildcard Searches
Searches of
of Strings
Strings
• The LIKE operator is used to search parts of a
string

• The following wildcard characters are used:


% to match any zero or more characters
_ to match exactly one character

44

22
Wildcard
Wildcard Search
Search Examples
Examples

• Query Purpose: List all products whose name


starts with an ’C'
SELECT *
FROM TinyProducts
WHERE ProductName LIKE 'C%'

• Query Purpose: List all products that have a SKU


number with the last 2 characters of ’23' when
you don't know the first character

SELECT *
FROM TinyProducts
WHERE SKUNumber LIKE '_23'
45

Aggregate
Aggregate Operators
Operators
• MIN, MAX, and AVERAGE are used when computing
statistics on a range of data

• Query Examples:
– 'What is the highest batting average on the team?'
– 'What is the average number of hits for all the little league teams in
the National League?'
– 'What are the names of the players that had the lowest average on
the little league team?'

46

23
Aggregate
Aggregate Operators
Operators Example
Example

• Query Purpose: Find the minimum, maximum,


and average batting average of all players in the
National League of Little League

SELECT MIN(Average), MAX(Average),


AVG(Average)
FROM PLAYERS
WHERE League = 'National'

47

SUM and
SUM COUNT Operators
and COUNT Operators
• Use the SUM operator to total the results of a
query

• COUNT will count the total number of occurrences


of an item in a search

1+2+3+4
48

24
SUM And
SUM COUNT Examples
And COUNT Examples
• Query Purpose: Find the total number of
homeruns hit by all players in the American
League?
SELECT SUM(HomeRuns)
FROM PLAYERS
WHERE League='American'

• Query Purpose: List the names of players that


have hit 3 home runs in the National League?

SELECT COUNT(*)
FROM PLAYERS
WHERE HomeRuns = '3'
AND League = 'National'
49

Calculated
Calculated Attributes
Attributes
• A new attribute can be obtained by using
arithmetic operators (+,-, *, /) on other
numeric attributes

• All operators follow standard precedence:


– Multiplication and division are computed first left to right
– Addition and subtraction are computed last left to right
– Use parenthesis to override the standard precedence

(+,-, *, /)
50

25
Calculated
Calculated Attributes
Attributes Example
Example
Query Purpose: List all players with their hits, at
bats, and their batting average

SELECT Name, Hits, AtBats,


(Hits / AtBats)
FROM PLAYERS

51

DISTINCT Operator
DISTINCT Operator
• DISTINCT is used to exclude duplicate
occurrences in the result of a query

• Query Purpose: List all distinct batting averages

SELECT DISTINCT(Average)
FROM PLAYERS

52

26
Sorting
Sorting Query
Query Results
Results
• The ORDER BY clause is used at the end of the
SELECT statement to sort the results of a query

• Use DESC on the end of the ORDER BY clause to


sort the data in descending order. Otherwise, the
result will be in ascending order

53

Sorting
Sorting Example
Example
• Query Purpose: List all players in ascending
order of their batting average

SELECT Name, Average


FROM PLAYERS
ORDER BY Average

• For descending order add the keyword DESC

SELECT Name, Average


FROM PLAYERS
ORDER BY Name DESC
54

27
Sorting
Sorting Calculated
Calculated Attributes
Attributes
• To refer to a computed attribute in the ORDER BY,
use its position in the list of columns following
SELECT

• Query Purpose: List all players in descending


order of their batting average (here we assume
batting average is computed at the time of the
query)

SELECT Name, Hits, AtBats,


Hits / AtBats
FROM PLAYERS
ORDER BY 3 DESC
55

More
More SQL
SQL

1) GROUP BY Construct
2) HAVING Filter
3) Multiple Tables
4) Joins
5) Equijoins
6) Cartesian Product
7) Nulls
8) OUTER JOIN

56

28
GROUP BY Clause
GROUP BY Clause
• GROUP BY will partition a table into multiple
groups of related rows.

• As an example, consider the EMPLOYEE table


where Department partitions the EMPLOYEE set
into subsets:

Engineering

Marketing Customer

Finance

57

GROUP BY Example
GROUP BY Example
• Query Purpose: For each department, list the
average salary using the EMPLOYEE table

SELECT Department, AVG(Salary)


FROM EMPLOYEE
GROUP BY Department

58

29
GROUP
GROUP
GROUP BY With
GROUP BY
BY
BY With WHERE
WithWHERE
WHERE
WHERE

• To filter data further, we can use the WHERE


clause with GROUP BY clause

Query Purpose: For each department, list the


highest salary of their administrative assistants.

SELECT Department, MAX(Salary)


FROM EMPLOYEE
WHERE Title='administrative assistant'
GROUP BY Department

59

HAVING Construct
HAVING Construct
• HAVING is used to restrict the output of aggregate
functions, such as SUM, MIN, MAX and AVG, to only
those groups of rows that meet some condition.

Query Purpose: List the average salary for all


departments that have more than three
employees.

SELECT Department, AVG(Salary)


FROM EMPLOYEE
GROUP BY Department
HAVING COUNT(*) > 3

60

30
Multi-Table
Multi-Table SQL
SQL
• It is often necessary to combine data into multiple
tables.

EMPLOYEE ATTENDS
EmpID Name Salary EmpID Name
1 Fred 200 1 Harvard
2 Ethel 300 2 GMU
3 Mike 400 2 Yale
4 David 100 3 MIT
3 Stanford
3 GMU

61

Joins
Joins
• Joins are the means by which multiple tables can
be combined.

• A join allows us to combine data from different


tables. A join operation is done through the
SELECT construct.

• Types of Joins: Equijoin, Outer Join, Inner Join

62

31
Equijoin
Equijoin
• Joins only those rows where a foreign key
matches the primary key

• Allows information from multiple tables to be


linked together in a single query

• Can be used to link as many tables as needed in a


single query

63

Equijoin
Equijoin Query
Query Example
Example

• Query Purpose: List the names of all colleges


attended by Ethel

SELECT b.Name
FROM EMPLOYEE a, ATTENDS b
WHERE a.EmpID = b.EmpID
AND a.Name = 'Ethel'

64

32
Equijoin
Equijoin Example
Example

EMPLOYEE
EmpID Name Salary

1 Fred 200
2 Ethel 300
3 Mike 400

ATTENDS
EmpID College GPA

1 Harvard 2.45
2 GMU 3.79
2 Nova 3.65
3 Yale 2.85
3 Nova 2.65
3 GMU 4.0

65

Warning
Warning about
about Joining
Joining Tables
Tables
• A join is really just a subset of a cartesian
product. When no fields are 'joined' in the WHERE
clause, a cartesian product is produced
– Restated in English: When the linking condition is omitted from
the WHERE clause, you get a lot of excess garbage that you
probably do not want.

Sample Query:

SELECT b.Name
FROM EMPLOYEE a, ATTENDS b
WHERE a.Name = 'Ethel'

66

33
Cartesian
Cartesian Product
Product
• Each row in one table with every other row in other
table
a.EmpID a.Name a.Salary b.EmpID b.GPA

2 Ethel 300 1 3.4


2 Ethel 300 2 2.8
2 Ethel 300 3 3.7
2 Ethel 300 4 3.5
....

67

Nulls
Nulls
• An attribute may be defined as null.

• This indicates that the value is unknown and


avoids the need for user-defined special
indicators.

• To prevent a column from having nulls, specify


NOT NULL on the column in the CREATE TABLE
statement when setting up the database.

68

34
Nulls
Nulls Examples
Examples
Statement Purpose: Add an employee whose salary
is unknown

INSERT INTO EMPLOYEE (3,'Hank', NULL)

Query Purpose: Find all employees whose salary is


unknown (or null)

SELECT *
FROM EMPLOYEE
WHERE Salary IS NULL

69

OUTER
OUTER JOIN
JOIN
• An OUTER JOIN is used when the query should
return a result row even for rows that do not have
corresponding data in one of the tables.

• A LEFT OUTER JOIN returns all rows from the


'left' table.

• Nulls are returned when a row in the 'left' table


has no corresponding rows in the right table.

70

35
LEFT
LEFT OUTER JOIN Example
OUTER JOIN Example
• Query Purpose: List the college GPAs for each
employee. Include employees who have not
attended any colleges

SELECT a.Name, b.GPA


FROM EMPLOYEE a
LEFT OUTER JOIN ATTENDS b
on a.EmpID = b.EmpID

71

LEFT
LEFT OUTER
OUTER JOIN
JOIN Example
Example
• Result of the outer join
– All employees are listed.
– For an equijoin, only those who attended a college would be listed
– Here, employee number 4 did not attend college, but is still
retrieved by the outer join.

Name GPA

---------- -----
Fred 2.45
Ethel 3.79
Ethel 3.65
Mike 2.85
Mike 2.65
Mike 4.00
David NULL 72

36
Advanced SQL

(slides in this section


are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

73

Advanced
Advanced SQL
SQL

1) Finding the nth element in a list


2) Finding the median
3) Correlated subquery
4) Data Definition Language Constructs

74

37
Find
Find the
the Nth
Nth Element
Element
• It is very common to try to find the nth element in
a list.
– Examples:
– Who makes the second highest salary in marketing department?
– What is the fifth best product in sales?
– This can be done with a program that uses SQL to access the
database: SQL is sent to the database and the program keeps
retrieving the result set until the threshold is crossed.

• We show another way of


doing this using standard SQL.

75

Find
Find the
the Nth
Nth Element:
Element: Example
Example Table
Table
• Consider a table, called TEST, with just one
column, x, with the following values:

X
4
5
8

76

38
Find
Find the
the Nth
Nth Element:
Element: Step
Step 11
• First join TEST with itself, this yields each
element matched with every other element:

4 4
4 5
4 8
5 4
5 5
5 8
8 4
8 5
8 8
77

Find
Find the
the Nth
Nth Element:
Element: Step
Step 22
• Next keep only those rows where the first column
is greater than or equal the second column.
4 4 4 4
4 5 5 4
4 8 5 5
5 4 8 4
5 5 8 5
5 8 8 8
8 4
8 5 Notice the pattern that just developed, each number on the
list now has a certain number of values that match on the
8 8 right. This number matches the position of this value in
the list. For example, 4 has only one match as it is the first
number in the list, 5 has two matches, 8 has three matches.

78

39
Find
Find the
the Nth
Nth Element:
Element: Step
Step 33
• Now group by the column on the left and identify
the size of each group.

• The same ideas can be applied to any SELECT


statement output.

4 4 4 1
5 4 5 2
5 5 8 3
8 4
8 5
8 8

79

Finding
Finding the
the Nth
Nth Element:
Element: Example
Example
• Query Purpose: Find the information about the
product with the second highest price.

SELECT a.ProductName, a.ProductType,


a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumber
HAVING COUNT(*) =
(SELECT COUNT(*)-1 FROM TinyProducts)

80

40
Finding
Finding the
the Top
Top N
N Elements:
Elements: Example
Example
• To ask for the top n values instead of the nth
value, specify a range (>=) instead of just an
equality (=) in the HAVING.

• Query Purpose: Find information about the


products with the two highest prices.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber


FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumber
HAVING COUNT(*) >=
(SELECT COUNT(*)-1 FROM TinyProducts)
ORDER BY a.Price

81

Finding
Finding the
the Median
Median
• The median is defined as the element in the
middle of the list.

• Query Purpose: Find the median price in


TinyProducts.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber


FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber
HAVING COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)

82

41
Using
Using Subqueries
Subqueries
• A subquery may be used in the middle of a query.

• Query Purpose: Find the information about the


highest priced product, using a simple subquery.

SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber


FROM TinyProducts a
WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)

83

Correlated
Correlated Subquery
Subquery
• If the subquery references a data element from
outside of the subquery, it is called a correlated
subquery.
– For each row in the outer part of the query, the correlated subquery
is executed.

The following query will indicate who makes more money than ‘Ethel’
SELECT a.Name, a.Salary
FROM Employee a WHERE EXISTS
(SELECT b.Salary
FROM Employee b
WHERE a.Salary > b.Salary
AND b.Name = 'Ethel')

84

42
Other
Other Data
Data Manipulation
Manipulation

• INSERT
– Add rows to a single table

• UPDATE
– Modify rows in a single table

• DELETE
– Remove rows from a single table

85

INSERTExamples
INSERT Examples
• Statement Purpose: Add a record for employee
#1, ’Fred' with a salary of 200 to the EMPLOYEE
table

INSERT INTO Employee VALUES


(1, ’Fred', 200)

• Statement Purpose: Copy all rows in the


EMPLOYEE table and place them in NEW_EMPLOYEE

INSERT INTO New_Employee


SELECT * FROM Employee

86

43
UPDATEExample
UPDATE Example
• Statement Purpose: Modify Fred’s salary to 150

UPDATE Employee
SET Salary = 150.00
WHERE Name = 'Fred'

• Statement Purpose: Give all employees a ten


percent raise

UPDATE Employee
SET Salary = Salary * 1.10

87

DELETE Examples
DELETE Examples
• Statement Purpose: Remove all employees who
have a salary higher than 100.

DELETE FROM Employee


WHERE Salary > 100

• To remove all employees:

DELETE FROM Employee

88

44
CREATE TABLE Example
CREATE TABLE Example
• Statement Purpose: Create a table to store
employee information

CREATE TABLE EMPLOYEE


(EmpId SMALLINT,
Name CHAR(10),
Salary DECIMAL(5,2))

To drop the EMPLOYEE table

DROP TABLE EMPLOYEE

89

Data Warehouse
Security

(slides in this section


are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

90

45
Data
Data Warehouse
Warehouse Security
Security

1) Key Security Services


2) Views
3) Access Control
4) Roles
5) Encryption
6) Audit Trails
7) Security Holes
8) Intrusion Detection
9) Misuse Detection

91

Introduction
Introduction
• A key feature provided by database systems is
good security services.
– In a database system with good security, applications do not have
to worry about problems that arise with security violations.

• A data warehouse also requires good security


services because it holds key, corporate data.

Database System

EIS
Security
Services

92

46
Key
Key Security
Security Services
Services
• Access Control
– Controls who accesses what data

• Administration of Access Control


– Used to give access to users as well as track who has various
accesses and what kind of accesses are given to a user or group of
users
– Audit tracks the usage of the data warehouse

93

Security
Security in
in aa Data
Data Warehouse
Warehouse
• A data warehouse consolidates organizations key
data in one place.
– A data warehouse increases the security risk that unauthorized
users will try to obtain this data

• Security aspects of EIS applications must be


designed and implemented very thoroughly.

• Access control and audits are


two of the critical components
of security.

94

47
Data
Data Warehouse
Warehouse Security
Security Components
Components
• Database system components that can be used
to protect a data warehouse include:
– Views
– Allow users to only see certain rows or columns of data
– Access control
– Indicate which users have access to what data
– Administration
– This component is used to actually give access to groups of users
and to define the accesses given to either an individual or a group.
– Encryption
– Protect data from access outside of the DBMS
– Audit
– Track what users are doing

95

Views
Views in
in Data
Data Warehouse
Warehouse
• A view is a logical view into one or more tables.
Users may be given access to the view without
access to the base table.

• Views provide some security assistance because


they can hide data from users.
EMPLOYEE
Name Address Salary
Hank 1 South Street $50,000
Esther 2 North Street $80,000
Tom 34 Main Street $90,000
Sue 45 Easy Street $28,500
Dave 56 5th Avenue $35,000
Pete 7 Broadway $60,000
Kathy 89 Western Avenue $85,000

96

48
View
View Example
Example
• A view called SAFE_EMPLOYEE may be created
as:
CREATE VIEW SAFE_EMPLOYEE AS
(SELECT name, address FROM EMPLOYEE)

Now users of the view SAFE_EMPLOYEE will not


even know that salary exists.

SAFE_EMPLOYEE
Name Address Salary
Hank 1 South Street
Esther 2 North Street
Tom 34 Main Street VIEW (SAFE_EMPLOYEE)
Sue 45 Easy Street
Dave “Salary” is effectively hidden
56 5th Avenue
Pete 7 Broadway
Kathy 89 Western Avenue

97

Updating
Updating Views
Views
• Restrictions exist on updating views. For the
EMPLOYEE table, it is possible to insert into the
SAFE_EMPLOYEE view.
– Example :
INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300)

This will insert a NULL into the SALARY column of the base table
EMPLOYEE.

• Other restrictions to view updates exist:


– Cannot update a view that is defined with an aggregate
– Cannot update a view that is defined with a GROUP BY

98

49
Data
Data Warehouse
Warehouse Access
Access Control
Control
• Access control is implemented in a data
warehouse with the SQL Grant and Revoke
commands.
• Syntax
– GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON
<object-name> TO <user name>
– Example: GRANT SELECT ON EMPLOYEE TO MARY

• Access control is done by DBAs and creators of


tables.

• To remove access the REVOKE command is used.


– Example: REVOKE SELECT ON EMPLOYEE FROM MARY

99

Database
Database Roles
Roles
• Roles provide security administration by allowing
users to be grouped into roles. Accesses may
then be given to a group of users.
– As an example, some roles for a company might be:
– Administrative assistant
– Loan officer
– Salesperson

• Accesses may be assigned based on roles.


– This dramatically simplifies administration.
– If new tables are created, it is not necessary to add thousands of
new accesses.
– Examples:
CREATE ROLE loan_officer AS (Hank, John, Mike)
GRANT SELECT ON LOAN TO LOAN_OFFICER

100

50
Example
Example of
of Application-based
Application-based Roles
Roles
• Consider:

Database
Users Applications Data
System

• If the database system controls accesses than it


does not matter what the application does,
accesses are controlled consistently (same for
SALES as MARKETING)

• However, more fine-grained access control can


be granted in the application.

101

Application
Application Roles
Roles
• The application can restrict:
– Data entry screens
– Reports

• Care must be taken to restrict users in a


consistent fashion so that a user cannot jump to
a different application and avoid security set up
by another application.

102

51
Role
Role Based
Based Security
Security in
in aa Data
Data Warehouse
Warehouse
• Both application and database level security are
useful in a data warehouse.

• Database level security is needed so that users


are only allowed to see data they
need to see.

• Application level security can


be used to control access to
certain menus so that users do
not even know what reports exist.

103

Encryption
Encryption
• Encryption is the process of coding data so that it
can only be read by users who have the key that
allows them to decrypt the data.
– Example: A message “sell 500 shares” would appear as “xyzzy”
without the key. Once the key is paired with the encrypted string
“xyzzy”, it can then be decrypted.
– The size of the key is a factor in how difficult it is to attack the
encryption scheme.

• Three places where encryption might be used in a


data warehouse:
– Network
– Data
– Tape backups

104

52
Network
Network Encryption
Encryption
• In a data warehouse application, data and queries
are transmitted through a network.
– Attackers might be able to steal network traffic just by breaking
into the network medium.

• One way to reduce the risk of this threat is to


encrypt traffic on the network.
User

Network

Data
Warehouse
Application Database
System
Tape Backup

105

Network
Network Encryption
Encryption
• Network encryption is critical because the
network connects all of the key components in a
data warehouse.

• Encrypting network traffic mitigates the risk that


an attacker could succeed with the “man in the
middle” attack.

• Without this, it may be possible for the


“man in the middle” to masquerade as
another user and circumvent existing
application and database security.

106

53
Data
Data Encryption
Encryption
• Data encryption refers to encrypting the actual
data in the data warehouse.

• If the attackers were to retrieve data from


the warehouse, they would have to
decrypt it in order to read it.

EIS Database Data


System Warehouse

107

Backup
Backup Encryption
Encryption
• Periodically, databases are copied to some kind
of long-term storage (usually tapes).

• If the database is encrypted, but the tapes are not


encrypted, the risk exists of someone walking off
with the tapes.

EIS Database Data


System Warehouse

Tape Backup

108

54
Audit
Audit Trails
Trails
• Audit trails are a means of tracking queries,
updates, deletes, and additions of new data to the
data warehouse.
– Audit trails are turned on when the DBMS is started and all
activity that uses the data warehouse is tracked in the audit trail.

• If a user is suspected of an evil deed, the audit


trail can be examined to identify what data has
been accessed by users.

109

Details
Details of
of DW
DW Audit
Audit Trails
Trails
• An audit trail of a database system typically
includes the following information:
– User ID, Date, Time, Object that has been accessed (table or view),
Action that accessed the object (INSERT, UPDATE, DELETE,
SELECT)
– For UPDATE, the old value and new value is tracked.

• For data warehouses, the SELECT is often used


to track the queries that have
been run against the warehouse.

110

55
Other
Other Uses
Uses for
for DW
DW Audit
Audit Trails
Trails
• Audit trails can be used to identify the most
popular data in the warehouse.
– This information can be used to optimize queries

• An additional use for audit trails is performance


tuning of the data warehouse.
– Administrators know where to focus their efforts
– Reduces administrative overhead

111

Dealing
Dealing with
with Known
Known Security
Security Holes
Holes
• Commercial database systems and operating
systems are often filled with holes that allow
users to obtain unauthorized access.
– To reduce the risk of these known holes, vendors often provide
“fixes” to their products as soon as these holes become public.

• It is important to constantly keep up with known


security holes and apply the latest fixes as soon
as they are released.

• One of the key risks surrounding a data


warehouse is that privileged users have the “keys
to the kingdom”.

112

56
The
The Risk
Risk of
of “Privileged
“Privileged Users”
Users”
• "Privileged users" include:
– Data warehouse administrators
– Operating system programmers
– Operators in the computer center

– These users can:


– Modify, delete and query any data in the warehouse
– Modify the audit trail to mask their actions
– Give other users unauthorized access

• Numbers of "privileged users" could


be anywhere from 20 to 30 in some
organizations.

113

Reducing
Reducing the
the Risk
Risk of
of Privileged
Privileged Users
Users
• One way to reduce the risk of privileged users is
to separate security administration from database
administration.
– This would separate the task of giving accesses and managing the
audit trail from the task of making sure the data in the warehouse
was correct and properly optimized.

Security Services
Access Control Security Database
Audit Services Services

Database Services Database Tuning


Access Control
Database Tuning Query Optimization
Audit
Query Optimization Backups
Backups

114

57
Information
Information Security
Security Attacks
Attacks
• Two types of Information security attacks on data
warehouses are:
– Intrusion
– An intrusion occurs when an unauthorized user gains access to the
data warehouse.
– Misuse
– Misuse, often referred to as the insider
problem occurs when a user who has access
to the warehouse uses that access for an
unauthorized purpose

• Audit Trails can be used to


identify either type of attack, but
identification of misuse is typically MUCH harder
to do than intrusion.
115

Intrusion
Intrusion Detection
Detection
• An intrusion is defined as an unauthorized
access to a system. The assumption is the user is
external to the environment (e.g.; a hacker).
• To reduce the risk of intrusion, intrusion
detection tools are used.
– These tools monitor access to the data warehouse and sound an
alarm if unauthorized accesses are detected.

INTRUSION DETECTION SYSTEM

DATA
USER WAREHOUSE

116

58
Misuse
Misuse Detection
Detection
• Unwanted access by a user that has the ability to
access data is referred to as misuse.
– This is also known as the insider problem.
– Some estimates have shown that 80 % of computer crime is a
result of misuse.

• For data warehouses the threat of misuse is high


especially by privileged users.

117

Summary
Summary
• DBMS Security is useful for data warehouses to
hide data from users with views and to restrict
access to data with GRANT and REVOKE.

• Application Level Security assists EIS that


access data warehouses by hiding certain reports
from users.
• Encryption can be used to further protect against
the risk of someone walking off with the data
warehouse.
• Audit Trails are useful for:
– Catching attackers
– Identifying usage trends of the data warehouse
118

59
Moving Data
to the
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

119

Moving
Moving Data
Data to
to the
the Data
Data Warehouse
Warehouse

1) Moving Data into the Data Warehouse


2) Updating the Data Warehouse
3) Full Refresh
4) Copy Only the Changes
5) BCP
6) Simple Transformations
7) Complex Transformations
8) Commercial ETL Tools

120

60
Moving
Moving Data
Data into
into the
the Data
Data Warehouse
Warehouse
• Data must be moved to the data warehouse from
source systems.

• Some key issues:


– Determine the frequency of data updates -- how often should data
be moved from source systems to the data warehouse.
– Various means of updating data in the warehouse exist:
– SQL Commands
– Database system load programs (e.g.; SQL Server’s BCP)
– Commercial tools

121

Updating
Updating the
the Data
Data Warehouse
Warehouse
• OLTP (On-Line Transaction Processing) Systems
have to send their updates to the data warehouse.

Finance Inventory Sales


OLTP OLTP OLTP
Application Application Application

Data Warehouse

Finance Inventory Sales


Subject Subject Subject
Area Area Area

122

61
Frequency
Frequency of
of Updates
Updates to
to the
the Data
Data
Warehouse
Warehouse
• Updates may occur daily, weekly, monthly, or in
real-time.

Finance Inventory Sales


OLTP OLTP OLTP
Application Application Application

Mo
Wee
Da

nth
kly
ily

ly U
Up

Upd
da

pda
te

ate

te
Data Warehouse

Finance Inventory Sales


Subject Subject Subject
Area Area Area

123

Determining
Determining the
the Frequency
Frequency of
of Updates
Updates
• Requirements should drive update frequency

• Range of updates runs from real-time, to


quarterly.
– Real time update
– Expensive
– Requires update of warehouse while users are
querying
– Daily update
– Somewhat cheaper than real time, but significant
maintenance required if the warehouse has lots of tables.
– Monthly or weekly update
– Much more manageable

124

62
Updating
Updating the
the Warehouse
Warehouse
• Full Refresh vs. Only the Changes
Sales
Inventory OLTP
OLTP Application
Finance Application
OLTP

F u l l ges for
cha
Application

Ch

n
refr
an
ges

e s h ther ta
sin

of s
Fu

o
ce
ll R

o m e les
las
efr

tu
esh

tabl
b
pd
ate

es
Data Warehouse

Finance Inventory Sales


Subject Subject Subject
Area Area Area

125

Full
Full Refresh
Refresh
• Copy the entire source table in the OLTP system
to the destination table in the Data Warehouse.

Source OLTP Target Data Warehouse

Target
Source Table
Table

126

63
Copy
Copy Only
Only the
the Changes
Changes
• Copy only the changes to the source table in the
OLTP system to the destination table in the data
warehouse.

Source OLTP Target Data Warehouse

Source
Table

Target
Table

Modified data
since last update
to the warehouse
Data from two updates ago.
Historical data no longer in
source OLTP.
127

Full
Full Refresh
Refresh vs.
vs. Only
Only the
the Changes
Changes
• Full Refresh
– Pros
– Much easier to implement
– Less chance of messing up your database (good data integrity)
– Cons
– Can take a lot longer to actually do -- may “run out of night”
– Can lose out on warehouse ability to track historical data.

• Only the Changes (DELTA)


– Pros
– Tracks historical data
– Cons
– Can be very hard to implement
– Can require changes in source applications (more on this later)
128

64
Full
Full Refresh
Refresh Using
Using INSERT-SELECT
INSERT-SELECT
• One way to move data from one table to another
is via the INSERT-SELECT.
– Syntax: INSERT INTO <target_table>
<any sql SELECT statement>

• Example:
INSERT INTO DW_EMPLOYEE
SELECT *
FROM EMPLOYEE

TARGET

129

Updating
Updating Changes
Changes Using
Using INSERT-SELECT
INSERT-SELECT
• Changes may be moved by adding a WHERE
clause to the INSERT-SELECT.

• Example:
– INSERT INTO DW_EMPLOYEE
SELECT *
FROM EMPLOYEE
WHERE DATE-UPDATED =
DATEPART(m, CURRENT_TIMESTAMP)

130

65
Updating
Updating Using
Using BCP
BCP
• BCP is the bulk copy program that comes with MS
SQL Server.
– Bulk copy (BCP) moves data to or from a flat file to a SQL table.

• Syntax:
bcp <table> [in | out] <data file>

Target Data
Source OLTP Warehouse

Unload Load
Source Temporary Target
Flat
Table File Table

131

BCP
BCP Example
Example
• To bulk copy data from the publishers table in
the pubs database to the publishers.txt data file
in ASCII text format, execute from the command
prompt:
bcp pubs..publishers out publishers.txt -c
-Sservername -Usa -Ppassword

• To bulk copy data from the publishers.txt file


into the pub2 table in the pubs database, execute
from the command prompt:
bcp pubs..pub2 in publishers.txt -c
-Sservername -Usa -Ppassword

132

66
Simple
Simple Transformation
Transformation
• In addition to moving data from OLTP to the
warehouse, it is often necessary to transform
data.
– Example: System A stores TOTAL_CLOTH in meters and system
B stores TOTAL_CLOTH in yards. Before the data is moved from
system A, we need to transform the data.

Store 32
Store 31 (Pattern = 32,
(Pattern = 31,
TRANSFORMATION Total Cloth = 20 meters)
TOTAL_CLOTH = 50 yards )

Data Warehouse
P a t t e r n = 3 1 , T o t a l C l o t h = 5 0 yards
P a t t e r n = 3 2 , T o t a l C l o t h = 7 0 yards

133

Complex
Complex Transformation
Transformation
• More complex transformations occur when a
value in a source table must be moved to several
locations in a data warehouse.

BLUE3 4 8 4
(Color = Blue, 34 Inches, LS)

34 in 84
O
TT
ER
CONVERT TO
O NV E 84 nd
C OD s) a
C eve les
CENTIMETERS
BLUE le o tab
gs
(lon in tw
pu t

TABLE 2
TABLE 1 TABLE 3 TABLE 4
86.36 cm
Long Long
COLOR Data Warehouse Sleeves Sleeves

134

67
Commercial
Commercial ETL
ETL Tools
Tools
• Key tools in the marketplace
– Informatica
– Ardent
– DecisionBase (Platinum)
– Microsoft Data Transformation Services

• All provide libraries of common transformations.

• All provide the ability to


code complex transformations.

135

Data
Data Transformation
Transformation Services
Services

136

68
Choose
Choose aa Source
Source

137

Choose
Choose aa Destination
Destination

138

69
Choose
Choose to
to use
use aa Query
Query for
for Transfer
Transfer

139

Enter
Enter SQL
SQL Query
Query

140

70
Choose
Choose Destination
Destination TableName
TableName

141

Verify
Verify Transformation
Transformation

142

71
Decide
Decide When
When to
to Run
Run Transformation
Transformation

143

Final
Final Verification
Verification

144

72
Run
Run Transformation
Transformation

145

Check
Check Results
Results

select *
from orderfact

orderid orderdate productid productname quantity unitprice discount

10248 1996-07-04 00:00:00.000 11 Queso Cabrales 12 14.0000 0.0


10248 1996-07-04 00:00:00.000 42 Singaporean Hokkien Fried 10 9.8000 0.0
10248 1996-07-04 00:00:00.000 72 Mozzarella di Giovanni 5 34.8000 0.0
10249 1996-07-05 00:00:00.000 14 Tofu 9 18.6000 0.0
10249 1996-07-05 00:00:00.000 51 Manjimup Dried Apples 40 42.4000 0.0

146

73
Summary
Summary
• ETL is one of the hard parts of building a data
warehouse.

• Either full refreshes of data or just the changes


may be done.

• Doing full refresh is easy, but historical data is


lost and it may take a lot of time.

• Tracking changes is a tough business.

• ETL commercial tools are beginning to mature


and can lessen the pain of this task.

147

More Ways of
Moving Data to
the
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

148

74
More
More Ways
Ways of
of Moving
Moving Data
Data
to the Data Warehouse
to the Data Warehouse

1) Determining What Data Has Changed


2) Recovery Logs
3) Triggers
4) Insert Triggers
5) Delete Triggers
6) Update Triggers
7) Manual Detection

149

More
More Ways
Ways of
of Moving
Moving Data
Data
to the Data Warehouse
to the Data Warehouse
• There is a need to move data into the data
warehouse from OLTP and DSS applications

• The problem is detecting what data needs to be


moved into the data warehouse

• Three methods:
– Recovery Logs
– Triggers
– Manual Techniques

150

75
Determining
Determining What
What Data
Data Has
Has Changed
Changed

• Problem: How to get updates made to the source


to the same information in the data warehouse?

SOURCE How to get updates from


Source Table A to DATA WAREHOUSE

?
Data Warehouse Table B
A
LE LE
B
TAB TAB
UP
DA
TE
S

P
OLT

151

Determining
Determining What
What Data
Data Has
Has Changed
Changed (cont.)
(cont.)

• Problem: How to get updates made to multiple


sources to the same information in the data
warehouse?

SOURCE DATA WAREHOUSE

A
LE
TAB LE
A LE
B
“ROW X”
Employee
NAME DEPT. SALARY
? TAB
“ROW X”
TAB
“ROW X”
S
ATE
X

Fred Mktg 35000


EmployeeCount
ROW

Hank Sales 60000

?
UPD

DEPT COUNT SalaryInfo


Sue IT 71000 Mktg 1
DEPT AVG SAL TOT SAL
Joe Sales 50000 Sales 1 2
IT 1 Mktg 35000 35000
UPDATES HR 0 IT 71000 71000
HR 0 0
P
OLT
Insert into
Employee Values Sales 60000 60000
(‘Joe’,’Sales’,’50000) 55000 110000

152

76
What
What is
is the
the Recovery
Recovery Log?
Log?
• Recovery log is used for transaction processing
– Used to handle errors
– Does contain before and after image.

• Recovery log can be used to


identify the data to be updated
in the data warehouse.
– Change Data Capture Utility
– This scans the database log and identifies all changes that the user
is interested in and either writes them to a file or stores them in
another table.

153

Change
Change Data
Data Capture
Capture Utility
Utility in
in Action
Action

SOURCE

OLTP DATA
All changes
to DBMS
DBMS
LOG

RECOVERY LOG

S
AD
RE
CHANGE
DATA
CAPTURE DATA WAREHOUSE
UTILITY WRITES

154

77
Example
Example of
of Using
Using Recovery
Recovery Log
Log
• Consider an update to the Employee table
– The information is recorded in the log
– The change data capture reconstructs update
LOG
– Can then be sent to the data warehouse
TABLE=EMPLOYEE
SSN=10
UPDATE EMPLOYEE
Where SSN=10 OldSalary=100,
SET Salary=Salary*2.0
NewSalary=200

CHANGE
DATA
CAPTURE DATA
RECONSTRUCTS WAREHOUSE
UPDATE

155

Using
Using the
the Recovery
Recovery Log
Log
• Recovery logs are usually in proprietary format.
Use commercial tools to read the log and identify
the changes.

• Commercial tools such as CA’s log analyzer can


place the results of their work in a table.

156

78
Summary
Summary of
of Change
Change Data
Data Capture
Capture
• Pro
– Log exists anyway, might as well use it to find what has changed

• Con
– Some difficult scenarios may occur where it is hard to see what the
new update should be in the Data Warehouse.

– Proprietary format, may not be supported in many DBMS and will


always lag behind DBMS development.

– Many tables will be in the source that have nothing to do with the
data warehouse, but change data capture will process their changes
as well.

157

Triggers
Triggers
• Triggers allow DBA’s to specify that when an
“event” such as an INSERT, UPDATE, or DELETE
occurs on a table, another event is triggered.
– Triggers are used to identify changes that are needed by the
warehouse.
– A trigger can be added to a source table and whenever the source
table is updated, an update can be placed either directly in the
warehouse or in a staging table that tracks all updates.

• Triggers can be used to detect the


changes and perform data
warehouse updates.
– A different trigger might be run on key updates so that the data
warehouse nightly process would know what data has changed.

158

79
Example
Example of
of aa Trigger
Trigger

STAGING
STEP 2 X, Y
A When values are
LE
TAB
inserted, sets off TRIGGER inserts
the TRIGGER values (X, Y) into
a “STAGING” area STEP 3
Values (X, Y) are inserted

Nightly
Process

Nightly Process inserts


values (X, Y) into
STEP 1 the Data Warehouse

STEP 4 DATA WAREHOUSE

A
INSERT into LE
TABLE A TAB
VALUES (X, Y) Values (X, Y)

159

Real-Life
Real-Life Trigger
Trigger Example
Example
• OLTP/DSS Data - Employee table:
–Employee (ssn, name, salary)

• DW Data - Summary table:


–EmployeeStatistics (total number employees,
total salary paid, average salary).

• When a row is inserted in the employee table, we


need to do an insert into the EmployeeStatistics
table.
– Shown on the next page

160

80
Insert
Insert Trigger
Trigger Example
Example
CREATE TRIGGER EmployeeInsertTrigger
ON Employee
FOR INSERT AS
BEGIN
UPDATE EmployeeStatistics
SET NoEmployee = NoEmployee +
(SELECT COUNT(*) FROM INSERTED)

UPDATE EmployeeStatistics
SET TotSalary = TotSalary +
(SELECT SUM(Salary) FROM INSERTED)
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
END

161

Insert
Insert Trigger
Trigger in
in Action
Action
COMMANDS RESULTS

INSERT INTO EMPLOYEE


VALUES (1, 'John', 300) (1 ROW(S) AFFECTED)

INSERT INTO EMPLOYEE


VALUES (2,'Mike', 400) (1 ROW(S) AFFECTED)

Employee
SELECT * FROM EMPLOYEE EmpId Name Salary
------ --------------------------
1 John 300.00
2 Mike 400.00

EmployeeStatistics
SELECT * FROM NoEmployee TotSalary AvgSalary
EMPLOYEESTATISTICS ---------- ---------- ---------
2 700.00 350.00

162

81
Delete
Delete Trigger
Trigger Example
Example
CREATE TRIGGER EmployeeDeleteTrigger
ON Employee
FOR DELETE AS
BEGIN
DECLARE @numberEmployee int

UPDATE EmployeeStatistics
SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)

UPDATE EmployeeStatistics
SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)

SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics

IF @numberEmployee > 0
BEGIN
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
End
ELSE
UPDATE EmployeeStatistics SET AvgSalary = 0.0
END

163

Update
Update Trigger
Trigger Example
Example
CREATE TRIGGER EmployeeUpdateTrigger
ON Employee

FOR UPDATE AS
BEGIN

IF UPDATE (Salary)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary -
(SELECT SUM(Salary) FROM DELETED) +
(SELECT SUM(Salary) FROM INSERTED)

UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
END

164

82
Summary
Summary of
of Using
Using Triggers
Triggers
• Pro
– Only needed for tables whose data is going to go to the DW

• Con
– Additional work needed to create detailed triggers

– Non-trivial to generate a trigger to implement appropriate action

– May not be acceptable for commercial software on source system

165

Other
Other Ways
Ways to
to Determine
Determine What
What Has
Has Changed
Changed
• There are other manual ways of detecting the
change and doing DW updates
– Look at each row of OLTP and the data in the warehouse
– Compare the differences between the two files, if the data is not in
the warehouse, add it!

OLTP DATA WAREHOUSE

Hank Hank
E
AR
John John
Mike MP
CO
Mike
Sam
ADD THE DIFFERENCES

166

83
Manually
Manually Identifying
Identifying What
What Has
Has Changed
Changed

• Pro
– Flexible

• Con
– Very expensive

– Could take a long time

167

Summary
Summary

• Recovery Logs
• Triggers
• Manual Detection

168

84
Data Warehouse
Design

(slides in this section


are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

169

Data
Data Warehouse
Warehouse Design
Design

1) Overview
2) Describing a Design - ER Diagrams
3) Design Normalization
4) Star Schema Design

170

85
Overview
Overview
• How to describe a design
– Entity Relationship (ER) Diagram

• Types of Designs

– Normalized
– Star Schema
– Snowflake

171

Describing
Describing aa Design
Design
• Different techniques exist, the most prevalent is
the ER (Entity-Relationship) Diagram

• Entities
– Things that occur in the real world, usually nouns e.g.; employee,
part, product, etc.

• Relationships
– How entities interact, example: one employee may attend many
colleges -- usually verbs
– Types of relationships
– 1-1
– 1-Many
– Many-1
– Many-Many

172

86
Examples
Examples of
of Relationships
Relationships
1-1 1-MANY

MANY-1 MANY- MANY

173

Normalized
Normalized Design
Design
• Methodology
– All 1-1 relationships are placed in a single table.
– Many-many relationships require two tables that store the single-
valued relationships and one linking table that indicates how the
entities are related. The relationship is represented in the linking
table by referencing keys in the two tables that represent each
entity in the relationship.

• Checking the design


– In a Normalized Design, there are many different normalized
forms. Each normal form (NF) builds on the previous one so that a
table in 2NF is, by definition, in 1NF.
– 1NF
– 2NF
– 3NF

174

87
Dealing
Dealing With
With Many-Many
Many-Many Relationships
Relationships
• For Many-Many
– Two 1-1 Tables (SUPPLIER, PARTS)
– One linking table (SP)
– Ex: Suppliers, Parts are the 1-1, SP is the linking table that says
who sells what parts.

SUPPLIER PARTS
S# SNAME P# PNAME
1 SEARS 1 HAMMERS
2 OFFICE DEPOT 2 NAILS

SP
S# P#
1 1
1 2
2 1
2 2

175

Normalized
Normalized Design:
Design: Example
Example
• A store sells a product which is supplied by a
given vendor. The product is purchased by a
customer at a certain time.
– Entities: Customer, Product, Store
– Relationships: Customer buys Product
– Product is located in Store
– Product is supplied By a Vendor

VENDOR

CUSTOMER PRODUCT STORE

BUYS IS-LOCATED-IN

176

88
Checking
Checking aa Normalized
Normalized Design
Design
• Normalization
– Used to reduce data insertion, delete, and update anomalies caused
by bad designs.
– Enables users to quickly check a design and make sure there are no
glaring holes in the design.
– 1NF
– All “cells” are atomic -- i.e. each entry in a column contains only
one value
– 2NF
– All non-key values are functionally dependent upon the entire
primary key -- i.e. if the primary key changes, all other columns
change.
– 3NF
– No transitive dependencies -- i.e. all keys are completely
dependent on the primary key. If the primary key changes, all
non-key columns are affected.
177

Overview
Overview of
of Normalized
Normalized Design
Design

• Pro
– Relatively easy to change

• Con
– Queries can involve numerous joins
– The massive number of tables and links between tables makes it
hard for customers to build their own queries

178

89
Star
Star Schema
Schema
• Methodology
– Single fact table in the middle describing a key event (e.g. sale)
surrounded by dimension tables (i.e. location, time, employee)
D = DIMENSIONS

D2
D1

FACT
D3
D5
D4

179

Star
Star Schema:
Schema: Methodology
Methodology
• Identify a key fact that occurs.
– Usually some event creates a real fact. Selling a product in a store
on Wednesday, patient visiting a hospital, etc.

• Identify all the dimensions of the data being used.


Think of a dimension as a way to slice the data.
– Ex: by time, by product, by customer, etc.

• Drill down operations are very well supported

180

90
Star
Star Schema:
Schema: Example
Example
• A store sells a product which is supplied by a
given vendor. The product is purchased by a
customer at a certain time.

• Fact
– CustomerPurchase

• Dimensions are
– Customer
– Product
– Time
– Vendor

181

Star
Star Schema:
Schema: Example
Example (cont.)
(cont.)

Time
Customer

Sale Price

Store
Product

SALE SALE ID CUST. ID STORE ID PROD. ID PRICE TIME

1 3 7 4 $3.00 4/24/99

CUSTOMER CUST. ID NAME PHONE Buys Apples Has Big Car

3 FRED 1234 Y Y

TIME DAY MONTH QTR YEAR

24 4 2Q 99

182

91
Star
Star Schema:
Schema: Overview
Overview

• Pro
– Easy for users to navigate and understand

• Con
– Performance
– Can end up with one monster fact table, millions of rows
– Flexibility
– Not as easy for customers to change the design

183

Snowflake
Snowflake Schema
Schema
• Several stars can be connected to form a snowflake

MARKETING

Distrib- Ad
ution

Direct Mail
Price

Sales PRODUCT
SALES Manu-
Location facturing
Marketing Parts
Revenue

Make
Sale Chips
Cost
Product
Price
Price
Labor
Vendor

184

92
Summary
Summary
• Two basic types of design
– Star Schema
– Normalized

• Many Data Warehouse vendors sell products built


specifically for the star schema

• Some data warehouses insist that normalization


is the way to build the data warehouse.

185

Building a
Data Warehouse

(slides in this section


are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

186

93
Building
Building aa Data
Data Warehouse
Warehouse

1) Top Down Approaches


2) Enterprise Data Model Approach
3) "Let Data Users Decide"
4) "Let Data Warehouse Builders Decide"
5) "Let Senior Management Decide"
6) Bottom Up Approach

187

Building
Building the
the Data
Data Warehouse
Warehouse
• How to decide what data goes into the data
warehouse?

• Methods:
– Top Down
– Using Enterprise Data Models
– "Let data users decide" approach
– "Let data warehouse builders decide" approach
– "Let senior management decide" approach

– Bottom Up
– Combine data marts into a data warehouse

188

94
Using
Using Enterprise
Enterprise Data
Data Models
Models
• Use the Enterprise Data Model to decide what
data goes into the data warehouse.
– Model key processes. This approach says let the business decide.
– Identify key data used by these processes in an enterprise data
model -- might be a giant Entity-Relationship diagram.

• Put data in the warehouse based on the


enterprise data model.

189

An
An Enterprise
Enterprise Data
Data Model
Model Example
Example

PUT IN SELL CHIPS


MAKE BAGS
CHIPS

BUY MORE
COUNT POTATOES

$$

CHIP
SUPPLIERS

CHIP
RECIPES INGREDIANTS

190

95
"Enterprise
"Enterprise Data
Data Model"
Model" Approach
Approach

• Pro
– All inclusive -- no chance of leaving key data out.

• Con
– Very difficult to build an EDM.
– If the business model changes, you may have to rebuild the
Enterprise Data Model and the data warehouse.

• Ways of Avoiding the Con


– In some cases you can buy an EDM -- if the business is common
enough the packaged EDM might be very close and then you just
have to modify it to fit your business.

191

"Let
"Let Data
Data Users
Users Decide"
Decide"
• Let the users of the data warehouse choose what
data will go into the warehouse.
– The data users deciding the data warehouse data and design will
pay for it as well.
– Also, you can charge users who
query the data as well.
SOURCE

USERS

DATA WAREHOUSE

192

96
"Let
"Let Data
Data Users
Users Decide":
Decide": An
An Example
Example

DATA WAREHOUSE

DATA
DATA DATA
demographics
Ethnic
budget
trends group

Advertising
? Age spending
Revenue
?
education ?

MARKETING HUMAN
RESOURCES FINANCE

193

"Let
"Let Data
Data Users
Users Decide"
Decide" Approach
Approach
• Pro
– Reduces budget problems
– Users know best!

• Con
– Requires marketing
– Could end up with data in the warehouse that is meaningless to the
people who run the place.
– Users may not place important data in the warehouse because their
budget is small.
– Users who need the data may not use the DW because of budget
concerns.

• Ways of Mitigating the Con


– Do not just take money -- try to determine if data is really
corporate.
194

97
Pay
Pay As
As You
You Go
Go Warehouse
Warehouse Analogy
Analogy

I-495

195

"Let
"Let Data
Data Warehouse
Warehouse Builders
Builders Decide"
Decide"
• The technical staff who is building the warehouse
decides what data gets put in the warehouse.

LETS PUT
INFORMATION ON
HOW TO BUILD
VIRUSES IN THE
DATA WAREHOUSE

DATA WAREHOUSE

196

98
"Let
"Let Data
Data Warehouse
Warehouse Builders
Builders Decide"
Decide"
Approach
Approach
• Pro
– Very easy to design
– Does not take much time
– Do not have to deal with users

• Con
– Could easily result in data DUMP not data warehouse

• Ways to mitigate the con


– Talk to lots of users to help you guess what should go in the DW

197

“Let
“Let Senior
Senior Management
Management Decide”
Decide”
• The senior management decides what data goes
into the warehouse.
• Asking the senior management is the safest way
to build a data warehouse.
• Identify the key questions on senior
management’s mind and get the data to answer
these questions.

198

99
“Let
“Let Senior
Senior Management
Management Decide”
Decide” Approach
Approach
• Pro
– Ensures executive support for the project

• Con
– Senior management does not have much time for this -- you will
have to only get a few questions at a time
– This dramatically increases visibility - if you do not move quickly
senior management will become very angry with the DW.

• Ways to mitigate the con


– Do your homework before talking to the senior management -- talk
to the aides of senior management to find out what is on their
mind.
– Allocate resources so you can plan to move very quickly once you
hear from the senior management.

199

Bottom-Up
Bottom-Up Approach
Approach
• Move data from existing OLTP Applications to
data marts.

• Combine data marts into a data warehouse.

DATA
WAREHOUSE

DATA DATA DATA


MART MART MART
25 50 200
YARDS METERS CM

OLTP OLTP OLTP


APP APP APP

200

100
Bottom-Up
Bottom-Up Approach
Approach

• Pro
– Data marts are much easier to build than full-fledged DW.

• Con
– Could end up with a bunch of stove pipe data marts.

• Ways to mitigate the con


– Develop standards for data when building the data marts so that
you can glue data from different data marts together.

201

Recommendations
Recommendations for
for an
an Approach
Approach

"Let senior management decide"

202

101
User Interface to
the
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )

203

User
User Interface
Interface to
to the
the Data
Data Warehouse
Warehouse

1) Introduction
2) Types of Users
3) Functions Users Want to Do
4) Approaches to Building a User Interface
5) Hand Built
6) Class Libraries
7) OLAP Tools
8) Types of User Interfaces

204

102
Introduction
Introduction
• A User Interface (UI) is a front end application
designed for the user that presents information in
a simplified manner.
– Data in a data warehouse does nothing if users cannot access it
– Users do not want to learn SQL to drive DW applications
Finance Inventory Sales
OLTP OLTP OLTP
Application Application Application

DATA WAREHOUSE
Finance Inventory Sales
OLTP OLTP OLTP
Data Data Data

USER INTERFACE

205

Building
Building User
User Interfaces
Interfaces
• DW applications have different types of users
with different functionality requirements.
– It is critical to identify the key users.
– Once you do this, you need to identify their functional
requirements.

• There are three main approaches to building UI’s


– Build your own entirely
– Use commercial Class Libraries
– Using OLAP Tools

206

103
Types
Types of
of Users
Users

CEO

Executive Executive Executive

Marketing Sales Finance

Analysts Analysts Analysts

Everyone Everyone Everyone

207

Types
Types of
of Users
Users (cont.)
(cont.)

• Executives
– People who run the place
– Need answers quickly
– May not be very technical
– Expect UI to get them what they
want quickly and efficiently without
any need for special training

• Analysts
– Have time to really analyze data and think about it
– May have strong statistical and IT background
(i.e. Power user of Excel)
– Expect UI to have many complex features, and
provide the ability to generate new queries and perform statistical
analysis of the data.

208

104
Types
Types of
of Users
Users (cont.)
(cont.)

• Regular User
– All other users
– Just need some simple answers to simple questions like “What is
Hank’s phone number)
– Expect UI to be simplistic, easy to understand, and provide access
to basic information.

209

Subject
Subject Matter
Matter Experts
Experts Expect
Expect
• Query data in the data warehouse
• Trend analysis
– “show me how much money we have spent on computers in the
last four years”
Trend

Sales

1995 1999

• Benchmark to competitors
– “what are all our competitors charging for product X”

210

105
Subject
Subject Matter
Matter Experts
Experts Expect
Expect (cont.)
(cont.)

• Drill Down
– “on that chart you just showed me, I noticed that revenue was
down in Region #4. Please drill down and show me the breakdown
of each area in Region #4.”

DRILL DOWN
WAL-MART

20
REVENUE

15 Y Values
10
5 X Values
0
1 2 3 4
REGIONS

DRILL DOWN

Revenue
DC
MD VA

Region 4
211

Approaches
Approaches to
to Building
Building User
User Interfaces
Interfaces
• Hand-Built
– Write all of your own code

• Use Class Libraries


– Use an object oriented approach and buy the CLASS libraries that
do all the hard work

• OLAP
– Use an On-Line Analytical Processing package to build user
interfaces for you.

212

106
Architecture
Architecture of
of User
User Interfaces
Interfaces (cont.)
(cont.)

• Hand Built

DATA
USER INTERFACE
WAREHOUSE
i.e. JAVA

DBMS

Commercial
Off The Shelf
• Class Libraries (COTS)
GRAP
HIC
CLAS S
S
LIBR
ARY OLAP
USER
CLASS
INTERFACE
LIBRARY

USER E
FAC
INTER SS
Hand CLA
RY
Built LIBRA

213

Architecture
Architecture of
of User
User Interfaces
Interfaces (cont.)
(cont.)

• OLAP E
YEAR OR
ST

REGION

Result Cube

Commercial
Off The Shelf
(COTS)

REVENUE
USER INTERFACE

DBMS

214

107
Hand-Building
Hand-Building User
User Interfaces
Interfaces
• Write all the code yourself
– Requires many design documents, coding and testing for all of the
code components.

• Pros

– Very flexible

• Cons
– Could take a long time to develop
– Requires substantial resources
– May need lots of testing and debugging

215

Using
Using Class
Class Libraries
Libraries to
to Build
Build User
User Interfaces
Interfaces
• Write initial user dialog yourself and call class
libraries for the hard part (graphics and data
access functionality).

• Pro
– Many class libraries available -- avoid doing a lot of coding
yourself

• Con
– Not as flexible -- if the class library does not do what you want it
to do you have to
– Find a new class library
– Live without the functionality
– Can take a while to find the class library you need and learn how
to interface to it
216

108
Using
Using OLAP
OLAP Tools
Tools to
to Build
Build User
User Interfaces
Interfaces
• Many different OLAP tools
– Need to survey an OLAP tool
– Buy an OLAP tool
– Install it
– If it does not match all requirements some code may be needed to
communicate with the OLAP tool.

• Three types multi-dimensional OLAP

– Relational OLAP (ROLAP)


– Multi-dimensional (MOLAP)
– Hybrid (HOLOP)
– Distributed (DOLAP)

217

Summary
Summary of
of Tools
Tools for
for UI
UI Development
Development of
of DW
DW
• Tools that may be used include:
– Development of in-house software
– Do it all yourself
– Use Class Libraries
– OLAP
– ROLAP
– MOLAP
– HOLAP
– DOLAP
• Different tools or techniques may be useful
depending upon what kind of user interface is
being developed.
– Executive Information Systems
– Analytical Systems
– Enterprise Information Systems

218

109
Types
Types of
of User
User Interfaces
Interfaces
• Executive Information System
– Developed for the person who runs the place
• Analytical System
– Developed for business analysts
• Enterprise Information System
– Developed for users throughout the organization
CEO
EXECUTIVE
INFORMATION SYSTEM

Executive Executive Executive

Marketing Sales Finance

Analysts Analysts Analysts


ANALYTICAL SYSTEM

ENTERPRISE
INFORMATION SYSTEM Everyone Everyone Everyone

219

Executive
Executive Information
Information System
System
• The Executive IS is developed specifically for
people who run the organization.
• Development process:
– No clean life cycle
– Prototype constantly. Usually have to guess at
what executives will want to see
– Show executives let them come up with ideas
for revisions
– Drill down functionality required

• Tools
– Frequently hand-built, but purchasing a class library can help
lower the development cost.
– May just want to use tools that allow development of a
subscription service in which users may “Subscribe” to a few
canned reports.
220

110
Analytical
Analytical System
System
• Analytical systems are user interfaces developed
for business analysts in an organization.

• Development process:
– Allow users to drag-and-drop data around to further the analysis of
this data.
– More complex interface is acceptable
– Users may be required to know some SQL knowledge

• Tools:
– OLAP Tools are frequently used to build the interface

221

Enterprise
Enterprise Information
Information System
System
• Enterprise IS is written for the general user to
retrieve simple, key information.
• Development process:
– Frequently developed in-house
– So many users around that you really cannot pick a few and ask
what they need.
– Simpler than Executive IS as it does not require drill down
functionality.

• Tools
– Place some simple, key information
on a few screens and control
access and then deploy.

222

111
Summary
Summary of
of Types
Types of
of User
User Interfaces
Interfaces
• Executive Information System
– For the senior executives
– Use in-house development or in -house development augmented by
class libraries

• Analytical System
– OLAP may make sense here as the interface is more complicated,
but OLAP has drawbacks due to:
– Data sparseness
– No well accepted query language

• Enterprise Information System


– Much simpler than executive system
– Good candidate for in-house development

223

112

You might also like