You are on page 1of 50

PART - 1

DATA WAREHOUSING CONCEPTS

Its all about data!


DWH Data Warehousing Concepts Page 1 of 293

1. Data Warehousing Concepts


1.1. Introduction
Based on the way the data is used, databases can be classified in two ways: the one that is used for transactions
i.e. Online Transaction Processing (OLTP) and the one that is used for analysis Online Analytical Processing
(OLAP). As the businesses these days contain huge amounts of data and the users are connected to these
databases across the globe and round the clock the necessity for maintaining a separate database for the sake of
analysis is very much clear.

1.2. OLTP Databases


OLTP Databases are what we generally refer as Databases. These are the databases that contain day-to-day
transactions. Typically, an OLTP database has hundreds of users connected to it and performing transactions
round the clock. Most of the time, these transactions insert data in to the database. Examples are ATM Machines,
Online Shopping, Online Application Filing, and Online Railway Reservations. The ratio of the number of records
being inserted are more than the number of records being updated or deleted. Hence these databases are
optimized for insertions. These databases are normalized to reduce the redundancy of the data and increase
performance while inserting the data. 3rd Normalization Form is most commonly found in all types of businesses.
Figure 1.1 shows the architecture of OLTP Databases.

Figure 1.1

Local Network
LocalorNetwork
or
Internet
Internet

OLTP Database
Users (Normalized)

OLTP Database Architecture

Normalization is a refinement process for Online Transaction Processing (OLTP) data models. OLTP systems
support the day-to-day operations of the financial institution. This is where trades are booked, executed and
settled, where new product and customers are entered into the computer systems. The focus is on transaction
management, entering changing and deleting records online in consistent manor. OLTP systems are not designed
for analysis, reporting and decision support. Dimensional modeling, a completely different approach should be
used to design Decision Support Systems (DSS).

1.3. OLAP Systems / Decision Support Systems


An OLTP (relational) database and an OLAP (multi-dimensional) database both contain information about your
business. An OLTP database can be used for many different purposes. It is generally optimized so that you can
quickly insert and update records. An OLAP database is generally used to analyze data. It is optimized so that
you can quickly retrieve data. An OLAP database is generally created from the information you have put in an

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Concepts Page 2 of 293

OLTP database. OLAP systems are often referred to as Decision Support Systems. Decision Support Systems or
DSS, (sometimes also called Business Intelligence or BI) is about synthesizing useful knowledge from large data
sets. It's about integration, summarization and abstraction as well as ratios, trends and allocations. It's about
comparing data-based generalizations with model-based assumptions and reconciling them when they're
different. It's about good, data-facilitated creative thinking and the monitoring of those creative ideas that were
implemented. It's about using all types of data wisely and understanding how derived data was calculated. It's
about continuously learning, and modifying goals and working assumptions based on data-driven models and
experience. In short, business intelligence should function like a virtuous cycle of decision-making improvement.
OLAP systems store data in multidimensional databases. You then access these databases to perform financial
and statistical analyses on different combinations of the data. Vendors offer a variety of OLAP products that you
can group into three categories: relational OLAP (ROLAP), multidimensional OLAP (MOLAP), and hybrid OLAP
(HOLAP).

Relational OLAP (ROLAP)


ROLAP products (e.g., Informix's Meta Cube ROLAP Option for the Informix Dynamic Server, Micro Strategy's
DSS Agent) adapt traditional relational databases to support OLAP. Summaries and aggregated data are stored in
the database itself. The ROLAP approach begins with the premise that data does not need to be stored multi-
dimensionally to be viewed multi-dimensionally. A scalable, parallel, relational database provides the storage and
high-speed access to this underlying data. A middle analysis tier provides a multidimensional conceptual view of
the data and an extended analytical functionality, which are not available in the underlying relational server.
ROLAP depends on a specialized schema design and its technology is limited by its non-integrated, disparate tier
architecture. The problem is that the data is physically separated from analytical processing.
The two important features of ROLAP are:
Data warehouse and relational database are inseparable.
Any change in the dimensional structure requires a physical re-organization of the database, which is too
time consuming. Certain applications are too fluid for this and the on-the-fly dimensional view of a
ROLAP tool is the only appropriate choice.

Multidimensional OLAP (MOLAP)


The traditional ER Model tends to be too complex and difficult to navigate, as the most important data warehouse
requirement is to have fewer queries accessing large amounts of records. MOLAP servers support
multidimensional views of data through array-based data warehouse servers. They map multidimensional views
of a data cube to array structures. The advantage structures of using a data cube is that it allows fast indexing to
pre-compute summarized data. As with a multidimensional data store storage utilization is low, and MOLAP is
recommended in such cases.

MOLAP vs ROLAP
The following arguments can be given in favour of MOLAP:
Relational tables are unnatural for multidimensional data
Multidimensional arrays provide efficiency in storage and operations
There is a mismatch between multidimensional operations and SQL
For ROLAP to achieve efficiency, it has to perform outside current relational systems, which is the same
as what MOLAP does.
The following arguments can be given in favour of ROLAP:
ROLAP integrates naturally with existing technology and standards
MOLAP does not support ad hoc queries effectively because it is optimized for multidimensional
operations
Since data has to be downloaded into the MOLAP systems, updating is difficult
The efficiency of ROLAP can be achieved by using such techniques as encoding and compression
ROLAP can readily take advantage of parallel relational technology
The claim that MOLAP performs better than ROLAP is intuitively believable.

HOLAP

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Concepts Page 3 of 293

HOLAP products (e.g., Microsoft SQL Server OLAP Services) combine MOLAP and ROLAP. With HOLAP
products, a relational database stores most of the data. A separate multi-dimensional database stores the densest
data, which is typically a small proportion of the data.

1.4. Data Warehouses


Data Warehouse in its most simplest and generic definition is a simple database with huge amounts of data in it.
However, a data warehouse is a multi-dimensional database that is designed for query and analysis rather than
for transaction processing. It usually contains historical data derived from transaction data, but it can include data
from other sources. It separates analysis workload from transaction workload and enables an organization to
consolidate data from several sources.
Data Warehousing is the process of making your operational data available to your business managers and
decision support applications. Data warehousing doesn't just make data available; proper warehousing focuses
on efficient information access. Of course, this efficiency doesn't happen magically. First you have to understand
the business user needs from the data and the decision support applications, and then you must evaluate your
current operational data and determine how to transform that data into what the business user requests. The tools
that you choose for your warehousing solution will take data from your operational systems (extract it), convert
your operational data into business information using your defined business rules (transform it), and create a data
warehouse (load it).
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth
by William Inmon:
Subject Oriented
Integrated
Nonvolatile
Time Variant

Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales
data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like
"Who was our best customer for this item last year? This ability to define a data warehouse by subject
matter, sales in this case makes the data warehouse subject oriented.

Integrated
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a
consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of
measure. When they achieve this, they are said to be integrated.

Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the
purpose of a warehouse is to enable you to analyze what has occurred and whatever once happened never
changes.

Time Variant
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to
online transaction processing (OLTP) systems, where performance requirements demand that historical data be
moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
Data warehousing technology comprises a set of new concepts and tools which support the knowledge worker
(executive, manager, and analyst) with information material for decision making. The fundamental reason for
building a data warehouse is to improve the quality of information in the organization. The key issue is the
provision of access to a company-wide view of data whenever it resides. Data coming from internal and external
sources, existing in a variety of forms from traditional structural data to unstructured data like text files or
multimedia is cleaned and integrated into a single repository. A data warehouse (DWH) is the consistent store of
this data which is made available to end users in a way they can understand and use in a business context.
Figure 1.2 shows the Data Warehouse Architecture
As the figure 1.2 depicts, Data Warehouses get their data from multiple OLTP sources. These sources need not
be maintained using same database management systems. They may be having different structures as well. To
understand this better let us consider an example of a garment manufacturing company having its retail outlets

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Concepts Page 4 of 293

across the globe. It may also be having different web based applications and/or portals through customers can
place their orders online. In a scenario like this, the data need not be maintained in same fashion at all these
places. Number of orders placed online may be more as compared to any given outlet since outlets are
geographically restricted. Hence, it is wise to go with software like Oracle to maintain the information about online
orders. Based on the average number of transactions that occur per outlet per day some outlets may be
maintaining their data in SQL Server, some in Access, some in Universal Database DB2 and so on When
business analysts of such a company need a sales report of the whole company they need to integrate all the
data from these various sources. Executing such an ad hoc query on all these databases transactional databases
obviously does not yield in a faster response. So there arises a need to maintain a separate database that can be
used only for querying purposes. This database needs to be periodically updated from the transactional
databases. It is this query-purpose database that is referred to as a Data Warehouse. Data from the transactional
databases is not directly brought in to the data warehouses. Instead they initially pass through such process as
Data Cleansing.

Figure 1.2

SQL Server Multi-dimensional Cubes


1.5. Data Warehouse Life Cycle
Data Cleansing
Data Cleansing is the process of cleansing or validating the data brought from multiple sources. The sources data
Oracle
may be invalid for reasons more than one. The data might also have become invalid because of improper manual
feeding done at the OLTP level. Organizational policies change with the time and hence their business logic. Data
from the OLTP source becomes invalid if it no longer meets the new Met
business logic and policies.
a
Business
Extracting, TransformingStagingand Loading
Area (ETL) Dat Business
Staging Area Data a
Intelligence
Extracting, Transforming and Loading (ETL) is the processWarehouse
of reading (extracting) data from heterogeneous
Intelligence
DB2transform
sources and UDB them so that the discrete data from different sources gets integrated and then loading in to
the target Data Warehouse. During this process, data from multiple tables may be merged in to one and/or data
from single table may be routed in to multiple tables and/or can be sorted, grouped, filtered and so on These
operations are done at a special dedicated area wherein all the data from all the sources are first dumped off, well
known as Staging Area. Software like Informatica Power Center and Oracle Warehouse Builder are used for these
operations.Access

Data Marts
From OLTP Sources

Figure 1.3
Meta Data
Metadata is to the data warehouse what the card catalogue is to the traditional library. It serves to identify the
Flat location
contents and Files of data in the warehouse. Metadata is a bridge between the data warehouse and the

business process does this set of queries support?, When did the job1to update
decision support application. It answers questions as What does this field mean in business terms?, Which
3
our data mart last run? A metadata repository should contain
2 the customer data in

Ranking Aggregati
A description of the structure of the data warehouse. Splitting
This includes the schema, view, dimensions,
To Warehouse

on
hierarchies and derived data definitions, data marts location and contents, etc
End Users
Operational
OLTP Sourcesmetadata such as data linkages, currency of data and monitoring information (warehouse
(Analysts)
usage statistics, error reports and itsFiltering
trails)
The summarization processes which include dimension definition, data on granularity, partitions,
summary measure, aggregation, summarization, etc Calculatio
Data Cleansing
Details of data sources which include source databases ns and their contents, gateway descriptions, data
Merging
Data
extractions, clearing, transformation Warehouse
rules Architecture
and defaults.
Data related to system performance, which include indices and profiles that improve data access and
retrieval performances.
Business metadata, which includes business terms and definitions, data ownership information and
Transformations
changing policies.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com AnPh:
overview
2761-2214of ETL
/ 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Concepts Page 5 of 293

Data Marts
Data Marts contain the summarized data of the ware houses and are referred as High Performance Query
Structures. They consist of Materialized Views and Special Indexes. In some businesses these data marts may be
maintained within the ware houses whereas, in some other scenarios they may be maintained apart from the data
warehouses.

Multi-dimensional Cubes
A cube is a structure that stores your business data in a multi-dimensional format that makes it easy to analyze.
Designed to be departmental, and optimized for performance, a multi-dimensional OLAP cube consists of
aggregated, summarized, and pre-calculated data. Usually each cube contains data that focuses on a specific
aspect of the business, such as sales data, financial data, or data for tracking inventory. Each cube is usually
designed to address a specific business question. When you create a report, you connect to a cube, and use the
data from that cube in your report.

Figure
Business 1.4
Intelligence
Business Intelligence comprises of console based and/or window based and/or web based applications that we
use for querying our data warehouses. These applications provide security for the data being accessed, and are
Sales
more user friendly for the non-technical personnel to operate with. These also allow the end users to do such
operations as drilling, knowledge discovery and manyAccounts
others.

1.6. Data Mining / Knowledge Discovery In Databases


Data
Warehouse
Finance
Data mining (DM) or knowledge discovery in databases (KDD), as it is also known, is the non-trivial extraction of
implicit, previously unknown and potentially useful information from the data.
Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable
Data Marts
patterns in data. With the widespread use of databases and the explosive growth in their sizes, organizations are
Figure
faced with 1.6
the problem of information overload. The problem of effectively utilizing these massive volumes of data
is becoming a major problem for all enterprises. Figure 1.5
Traditionally, we have been using data for querying a reliable database repository via some well-circumscribed
application or canned report-generating utility. While this mode of interaction is satisfactory for a large class of
Product

applications, there exist many other applications which demand exploratory data analyses. Data mining
techniques support automatic exploration of data. Data mining attempts to source out patterns and trends in the
s

data and infers rules from these patterns.


Databases With these rules
Flat Files users and
Cleansing will be able to support, review and examine
decisions in some related business or scientific area. This opens up the possibility of a new way of interacting with
Integration
databases and data warehouses.
Consider for example, a banking application where the manager wants to know whether there is a specific pattern
followed by defaulters. It is hard to formulate a SQL query for such information. It is generally accepted that if we
Regions the query. But if we have some vague idea
know the query precisely, we can turn to query language to formulate
and we do not know the precisely query, then we can resort to data mining techniques. Tim
e
Data Warehouse
Multi-dimensional
Cubes
Drill up & Drill Down&
Selection
Transformation
Drilling is the term used to navigate through the warehouse data through a given dimension. We can say we drill
the sales data by region. In that case, viewing region wise sales report and then sub region wise sales report for a
Data Mining
given region and then moving forward to country wise sales and then to state wise sales data is what is referred
to as Drilling Down the data. Drilling up is navigating back.

Slicing & Dicing


Slicing & dicing the data means analyzing the same data in different fashions and groups. Let us consider a sales
report, which measures my sales by the amount of items sold region wise-time wise-product category wise.
Pattern
Changing the order and way we view the data within these given dimensions is what is known as slicing & dicing.
Recognition

Data Mining
www.wilshiresoft.com Architecture Wilshire Software Technologies Rev. Dt: 18-Oct-07
info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
Knowledge
DWH Data Warehousing Concepts Page 6 of 293

Figure 1.7

Region #1 Drilling up the


Data
Sub Region #1

Country #1 Drilling Down


the Data
State #1

City #1
:
:
:

Region #2

Sub Region #2 Drill Operations

Figure 1.8

Slicing & Dicing Quarter #1QuarterRegion


#1Region #2Region
#3
Region #1Region #2Product #1
Product #2Product #3Product
#4
Product #1Product #2Region #1
Region #2Region #3Region #4

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 7 of 293

2. Data Warehousing Modeling


2.1 Logical Modeling
2.1.1 Data Warehouse Tables
Tables in a data warehouse can be classified in two ways: dimensions and facts. Dimensions are the tables that
contain vital information to analyze the business. Facts contain the measurable quantities to measure the
business.

Facts or Fact Tables


A fact table is the primary table in a dimensional model where the numerical performance measurements of the
business are stored. We strive to store the measurement data resulting from a business process in a single data
mart. Since measurement data is overwhelmingly the largest part of any data mart, we avoid duplicating it in
multiple places around the enterprise. We use the term Fact to represent a business measure. A row in a fact
table corresponds to a measurement. A measurement is a row in a fact table. All the measurements in a fact table
must be at the same grain. Facts are always numeric and additive. Fact tables generally have less number of
columns and more number of rows.

Dimension tables
Dimension tables contain textual descriptors of business. Dimension table are integral companions to a fact table.
Most often, if not always, dimension table have many columns or attributes. It is not uncommon for a dimension
table to have 50 to 100 attributes. Dimension tables generally have more number of columns and less number of
rows. Dimension table attributes play a vital role as they are the key to making the data warehouse usable and
understandable. The power of dimension tables is directly proportional to the quality and depth of the dimension
attributes.

Figure 2.1

ProductsProduct IDProduct SalesProduct IDCustomer


NameProduct IDSales DateQuantity
DescriptionCategoryCategory SoldAmount Sold
DescriptionSub CategorySub
Category DescriptionList Fact Table
PriceMinimum PriceModel
NumberUnit of
MeasureSupplier Name
Dimension Table

Fact Table

Fact and Dimension Tables

Dimension Table

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 8 of 293

2.1.2 Data Warehouse Columns


Dimensions
A cube contains dimensions and measures. A dimension is a component of a cube; it groups related business
data, such as product lines, or sales regions, or time. Dimensions become the axis dimension labels for the
columns and rows of your reports. Dimensions have levels. A level is a component of a dimension; it specifies the
amount of detail for the data. Each level above the lowest level contains the aggregated data from the level below.
The lowest level contains the most detailed data and is called as a Detail; the highest level contains the most
summarized data. Dimensions also have members. For example, the dimension USA could contain California,
and Los Angeles. A member is a subset of a dimension and the cube equivalent of a value in a relational column.
Members are organized within a dimension by levels, for example Country, State/Province, and City. Members at
the lowest level are aggregated to members at higher levels. For example, the value of California is an aggregate
of Los Angeles, San Francisco, and so on.

Details
Details are the dimensions beyond which an analyst is not interested to analyze. A detail is always a dimension.
Let us consider an example of a sales schema. If the analysts are viewing reports based on region dimension,
where in the sales are analyzed region wise, sub region wise, country wise, state wise and city wise. The data in
the data warehouse may also be containing dimensions like locality or area within the city. However reports are
taken only up to the extent of the city. In such a case city is considered as a detail.

Measures
There is also a special dimension called measures. Measures are the numbers on which you make your
comparisons. It includes members such as: cost, profit, or taxes. Measures are the measurable quantities upon
which the business is measured. They are always numeric quantities. They are the ones on which we perform
aggregations.

Figure 2.2
2.1.3 Data Warehouse Issues
Region
Critical Columns
DimensionRegionSu
The huge collection
b of data within a warehouse is useful for analysis only if the data in it is consistent. However,
there existsRegionCountryStateCi
some information which when updated in a warehouse makes the data in a warehouse inconsistent.
The columns SalesColumns.
Product Critical columns exist only
ty that contain such vital information are identified to be as Critical
IDCustomer
at the OLTP source. An example can be the Levelscity in which a customer resides. Let us say, there exists a customer
IDDateQuantity
Scott in a city A. He has been residing in the city from past say 5 years. In the period of these five years, lets
consider he has made purchases worth Rs. 3 lakhs. Now, Scott has moved SoldAmount Sold
to a city B. When you update the
Scott's city to B in your warehouse, all the purchases originally made in city A by Scott will be shown as if made
in the City B. This makes the warehouse data inconsistent. Hence the customer city can be identified as a critical
column.
Detail Measures
Surrogate Keys
Having surrogate key is the solution for the critical column problem discussed above. A surrogate key to a data
warehouse is what a primary key isDimensions,
for an OLTP source.
DetailsItand
is used to uniquely identify a record in dimension
Measures
tables. For the problem discussed above, only possible solution is to insert a new record in to your warehouse as
when a critical column gets updated in the OLTP sources. This violates the primary key constraint that might have
been specified for the column Customer ID. Hence, we no longer maintain the Customer ID as a primary key but
we have one more new column acting as the so-called Surrogate Key which typically contains a value generated
from a sequence. So whenever a customer changes city, we insert a new record into our warehouse with the
same customer id but with a different surrogate key. Customers latest city can be identified by the largest
surrogate key value for that customer.

IMPORTANT
IMPORTANT
To identify the records whose critical columns have been updated at an OLTP source, several
To identify
methods are the records
followed, of whose
which, critical
Updatecolumns
Flag and have
Time been updated
Stamp are ofatmajor
an OLTP
use. source, several
Update flag is
a methods
Boolean are followed,
column whoseof value
which,isUpdate
true forFlag andrecords
all the Time Stamp are of
that have major
been use. Update
inserted flag is
or modified
a Boolean
after the lastcolumn
updatewhose value
of data is true for Newly
warehouse. all the inserted
records that have been
or modified inserted
records canoralso
modified
be
after the last update of data warehouse.
identified with the help of a time stamp column Newly
being inserted or modified records can
maintained in the OLTP sources.Rev. Dt: also be
www.wilshiresoft.com Wilshire Software Technologies 18-Oct-07
identified with the help of a time
info@wilshiresoft.com Ph: stamp column
2761-2214 being maintained
/ 6677-2214 / 6452-6173 in the OLTP sources. Version: 5
DWH Data Warehousing Modeling Page 9 of 293

2.1.4 Data Warehouse Schema


How do we go about actually designing the warehouse? In this section, we address the problem of designing data
warehouse schemas for this purpose.

Star Schema
Figure 2.4
A star schema is a modeling paradigm in which the data warehouse contains a large, single central Fact Table
and a set of smaller
ProductsDimension
Product Tables, one for each dimension. The fact tableTime contains
Timethe detailed summary
data. Its primary key has one
IDProduct NameSub key per dimension. Each dimension is a single, highly de-normalized table. Every
IDDayMonthQuarter
tuple in the fact Cat
table consists ofIDList
fact or subject of interest, and the dimensions that provideofthat
YearDay fact. Each tuple of
WeekDay
the fact table consists of
PriceMinimum a key pointing to each of the dimension tables that
of provide its multidimensional
YearDay
coordinates. It also stores numericalofvalues for those coordinates. The dimension
PriceUnit tables consist
NameFiscal ID of columns that
correspond to the attributes of the dimension.
MeasureModel
NumberSupplier
Name
SnowflakeIMPORTANT
schema
IMPORTANT
We noticedEach
that star
tupleschema
in the consists
fact tableof acorresponds
single fact table
to oneandanda single de-normalized
only one tuple in each dimension
Dimensiontable Table
for each
ofEach
dimension whereas tuple in theinfact
the multidimensional
one tuple tablemodel.
data
a Dimension corresponds
To may
Table to
support one and to
attribute
correspond only onethan
tuple
hierarchies,
more theintuple
one eachinDimension
dimension the tables Tablebe
can
Fact Table.
normalized So whereas
to we
create one tuple
havesnowflake in a Dimension
schema.between
a 1:N relationship Table
A snowflake may
the Fact correspond
schema
Table and to
consistsmore than
of a single
the Dimension one tuple
Table. in the Fact Table.
fact table and multiple
SoSub
dimension tables. weLike
havethea 1:N
starrelationship
schema, each between
tuple the
of Fact
the Table
fact and consists
table the Dimension
of a Table.
key pointing to each of the
Categories Sub its SalesProduct
Catmultidimensional
dimension tables that provide coordinates. It also stores numerical values for those
IDSub IDCustomer IDTime
coordinates. Dimension tables inCat star schema are de-normalized,
Figure 2.3 ID normalized IDCost IDQuantity while those in a snowflake schema are
normalized. TheNameCategory
advantage is that tables are easier
SoldAmount Sold to maintain. It also saves the storage space.
However, it may reduce the effectiveness of navigating across the tables due to a large number of join operations.
ProductsProduct TimeTime Fiscal TimeFiscal
IDProduct NameSub IDFiscal DayFiscal
IDDayMonthQuarter
CategoryCategoryLi
CategoriesCategor YearDay MonthFiscal
of WeekDay
st PriceMinimum
y IDCategory Name of QuarterFiscal
YearDayYear
PriceUnit of NameFiscal
MeasureModel MonthFiscal
NumberSupplier QuarterFiscal Year
Name
CustomersCustom
er IDCustomer
NameCity Fact Table
IDAreaPhone SalesProduct CostsCost
NumberEmail IDCustomer IDTime IDProduct IDTime
AddressMartial IDCost IDQuantity IDUnit CostUnit
IMPORTANT SoldAmount Sold Price
Status
IMPORTANT
The major difference between a Star schema and a Snowflake schema is that Star schema highly de-
The major whereas,
normalized, differenceSnowflake
between a Star schema
schema andnormalized.
is partially a Snowflake schema is that Star schema highly de-
normalized, whereas, Snowflake schema is partially normalized.
Snowflake schema is preferred when a data warehouse is most of the time used as a source for one
Snowflake
Customers
more schema is
rather Custom
high-end preferred
data warehouse when a data
than warehouse
for direct analysis.is most of the time used as a source for one
er more rather high-end
IDCustomer data warehouse than for direct analysis.
NameRegionSub
RegionCountryState
CitiesCity IDCity
CostsCost
CityAreaPhone IDProduct IDTime
NameCountry ID
NumberEmail IDUnit Dimension CostUnit Tables
AddressMartial Price
Status Fact Table

CountriesCountry
IDCountry NameSub
Region ID

Dimension Tables
Dimension Tables

Sub RegionsSub
Region IDSub
Region NameRegion
www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07
Name
info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
Dimension Tables
Star Schema
Snowflake
Schema
DWH Data Warehousing Modeling Page 10 of 293

Fact Constellation
Most often, there may be a need to have more than one Fact Table and these are called Fact Constellations. A
Fact Constellation is a kind of schema where we have more than one Fact Table sharing among them some
Dimension Tables. It is also called Galaxy schema. If such a schema is highly de-normalized it is also called as
Multi-Star schema.

Figure 2.5

2.2. Physical Modeling


Logical design is what you draw with a pen and paper or design with Oracle Warehouse Builder or Designer
before building your warehouse. Physical design is the creation of the database with SQL statements. During the
physical design process, you convert the data gathered during the logical design phase into a description of the
physical database structure. Physical design decisions are mainly driven by query performance and database
maintenance aspects. For example, choosing a partitioning strategy that meets common query requirements
enables Oracle to take advantage of partition pruning, a way of narrowing a search before performing it.

NOTE
NOTE2.6
Figure
Few
During theFew topics design
physical of the section Physical Modeling
the of this material dealsinto
withactual
data warehousing conceptsAtinthis
topics of theprocess,
section you translate
Physical Modelingexpected
of this schemas
the material deals with data database structures.
warehousing
might varyconcepts
with thein
time, yourelative
have to
relative
to
to
Oracle
map: Oracle
9i.
9i.
Readers
Readers
should
should
note that
note that the
concepts
concepts
discussed
discussed
here
here might vary with the
Fact Fact
database that is being used. Readers who are not intended about database administration can skip
database
Entities
this that
and is
to Tables
section being used.
Table
proceed further.Readers who are not intended Tableabout database administration can skip
this section and proceed further.
Relationships to Foreign Key Constraints
Attributes to Columns
Primary Unique Identifiers to Primary Key Constraints
Unique Identifiers to Unique Key Constraints
Dimension Dimension Dimension
Some of these structures require disk space. Others exist only in the data dictionary. Tables
Tables Tables Additionally, the following
structures may be created for performance improvement:
Indexes and Partitioned Indexes Fact Constellation
Materialized Views

Tablespaces
A tablespace consists of one or more data files, which are physical structures within the operating system you are
using. A data file is associated with only one tablespace. From a design perspective, tablespaces are containers
for physical design structures. Tablespaces need to be separated by differences. For example, tables should be
separated from their indexes and small tables should be separated from large tables. Tablespaces should also
represent logical business units if possible. Because a tablespace is the coarsest granularity for backup and
recovery or the transportable tablespaces mechanism, the logical business design affects availability and
maintenance operations.

Tables and Partitioned Tables


Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data
warehouse. Using partitioned tables instead of non partitioned ones addresses the key problem of supporting very
large data volumes by allowing you to decompose them into smaller and more manageable pieces. The main
design criterion for partitioning is manageability, though you will also see performance benefits in most cases
because of partition pruning or intelligent parallel processing. For example, you might choose a partitioning
strategy based on a sales transaction date and a monthly granularity. If you have four years worth of data, you
Logical designing vs. Physical designing
can delete a months data as it becomes older than four years with a single, quick DDL statement and load new
data while only affecting 1/48th of the complete table. Business questions regarding the last quarter will only affect
three months, which is equivalent to three partitions, or 3/48ths of the total volume. Partitioning large tables
improves performance because each partitioned piece is more manageable. Typically, you partition based on
transaction dates in a data warehouse. For example, each month, one months worth of data can be assigned its
own partition.

Views

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 11 of 293

A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the
output of a query and treats it as a table. Views do not require any space in the database.

Integrity Constraints
Integrity constraints are used to enforce business rules associated with your database and to prevent having
invalid information in the tables. Integrity constraints in data warehousing differ from constraints in OLTP
environments. In OLTP environments, they primarily prevent the insertion of invalid data into a record, which is not
a big problem in data warehousing environments because accuracy has already been guaranteed. In data
warehousing environments, constraints are only used for query rewrite. NOT NULL constraints are particularly
common in data warehouses. Under some specific circumstances, constraints need space in the database. These
constraints are in the form of the underlying unique index.

Indexes and Partitioned Indexes


Indexes are optional structures associated with tables or clusters. In addition to the classical B-tree indexes,
bitmap indexes are very common in data warehousing environments. Bitmap indexes are optimized index
structures for set-oriented operations. Additionally, they are necessary for some optimized data access methods
such as star transformations. Indexes are just like tables in that you can partition them, although the partitioning
strategy is not dependent upon the table structure. Partitioning indexes makes it easier to manage the warehouse
during refresh and improves query performance.

Materialized Views
Materialized views are query results that have been stored in advance so long-running calculations are not
necessary when you actually execute your SQL statements. From a physical design point of view, materialized
views resemble tables or partitioned tables.

Dimensions
A dimension is a schema object that defines hierarchical relationships between columns or column sets. A
hierarchical relationship is a functional dependency from one level of a hierarchy to the next one. A dimension is a
container of logical relationships and does not require any space in the database. A typical dimension is city, state
(or province), region, and country.

2.2.1. Hardware and I/O Considerations


Data warehouses are normally very concerned with I/O performance. This is in contrast to OLTP systems, where
the potential bottleneck depends on user workload and application access patterns. When a system is
constrained by I/O capabilities, it is I/O bound, or has an I/O bottleneck. When a system is constrained by
having limited CPU resources, it is CPU bound, or has a CPU bottleneck. Database architects frequently use
RAID (Redundant Arrays of Inexpensive Disks) systems to overcome I/O bottlenecks and to provide higher
availability. RAID can be implemented in several levels, ranging from 0 to 7. Many hardware vendors have
enhanced these basic levels to lessen the impact of some of the original restrictions at a given RAID level.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 12 of 293

RAID 0 Striping
To avoid I/O bottlenecks during parallel processing or concurrent query access, all tablespaces accessed by
parallel operations can be striped. Striping divides the data of a large table into small portions and stores them on
separate data files on separate disks. As shown in Figure 2.7, tablespaces should always stripe over at least as
many devices as CPUs. In this example, there are four CPUs, two controllers, and five devices containing
tablespaces.
RAID 0 is a non-redundant disk array, so there will be data loss with any disk failure. If something on the disk
becomes corrupted, you cannot restore or recalculate that data. RAID 0 provides the best write throughput
performance because it never updates redundant information. Read throughput is also quite good, but you can
improve it by combining RAID 0 with RAID 1. Oracle does not recommend using RAID 0 systems without RAID 1
because the loss of one disk in the array will affect the complete system and make it unavailable. RAID 0 systems
are used mainly in environments where performance and capacity are the primary concerns rather than
availability.

RAID 1 Mirroring
RAID 1 provides full data redundancy by complete mirroring of all files. If a disk failure occurs, the mirrored copy
is used to transparently service the request. RAID 1 mirroring requires twice as much disk space as there is data.
In general, RAID 1 is most useful for systems where complete redundancy of data is required and disk space is
not an issue. For large data files or systems with less disk space, RAID 1 may not be feasible, because it requires
twice as much disk space as there is data. Writes under RAID 1 are no faster and no slower than usual. Reading
data can be faster than on a single disk because the system can choose to read the data from the disk that can
respond faster.

RAID 0 + 1 (Striping and Mirroring)


RAID 0+1 offers the best performance of all RAID systems, but costs the most because you double the number of
drives. Basically, it combines the performance of RAID 0 and the fault tolerance of RAID 1. You should consider
RAID 0+1 for data files with high write rates, for example, table data files, and online and archived redo log files.

2.2.2. Parallelism
Data warehouses often contain large tables and require techniques both for managing these large tables and for
providing good query performance across these large tables. Parallel execution dramatically reduces response
time for data-intensive operations on large databases typically associated with decision support systems (DSS)
and data warehouses. You can also implement parallel execution on certain types of online transaction
processing (OLTP) and hybrid systems. Parallel execution is sometimes called parallelism. Simply expressed,
parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query;
many processes do part of the work at the same time. An example of this is when four processes handle four
different quarters in a year instead of one process handling all four quarters by itself. The improvement in
performance can be quite high. In this case, each quarter will be a partition, a smaller and more manageable unit
of an index or table.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 13 of 293

Figure 2.7

RAID 0 Striping

RAID 0 + 1
Striping & Mirroring
RAID 1 Mirroring

The most common use of parallel execution is in DSS environments. Complex queries, such as those involving
joins of several tables or searches of very large tables, are often best executed in parallel. Parallel execution is
useful for many types of operations that access significant amounts of data. Parallel execution improves
processing for:
Large table scans and joins
Creation of large indexes
Partitioned index scans
Bulk inserts, updates, and deletes
Aggregations and copying
You can also use parallel execution to access object types within an Oracle database. For example, use parallel
execution to access LOBs (large objects). Parallel execution benefits systems that have all of the following
characteristics:
Symmetric multi-processors (SMP), clusters, or massively parallel systems
Sufficient I/O bandwidth Underutilized or intermittently used CPUs (for example, systems where CPU
usage is typically less than 30%)
Sufficient memory to support additional memory-intensive processes such as sorts, hashing, and I/O
buffers

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 14 of 293

If your system lacks any of these characteristics, parallel execution might not significantly improve performance.
In fact, parallel execution can reduce system performance on over utilized systems or systems with small I/O
bandwidth.

2.2.3. Partitioning
In conjunction with parallel execution, partitioning can improve performance in data warehouses. Partitioned
tables and indexes facilitate administrative operations by enabling these operations to work on subsets of data.
For example, you can add a new partition, organize an existing partition, or drop a partition and cause less than a
second of interruption to a read-only application.

Partitioning Methods
Oracle offers four partitioning methods:
Range Partitioning
Hash Partitioning
List Partitioning
Composite Partitioning
Each partitioning method has different advantages and design considerations. Thus, each method is more
appropriate for a particular situation.

Range Partitioning
Range partitioning maps data to partitions based on ranges of partition key values that you establish for each
partition. It is the most common type of partitioning and is often used with dates. For example, you might want to
partition sales data into monthly partitions. Range partitioning maps rows to partitions based on ranges of column
values.

Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to a partitioning key
that you identify. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately
the same size. Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning
is a good and easy-to-use alternative to range partitioning when data is not historical and there is no obvious
column or column list where logical range partition pruning can be advantageous. Oracle uses a linear hashing
algorithm and to prevent data from clustering within specific partitions, you should define the number of partitions
by a power of two (for example, 2, 4, 8).

List Partitioning
List partitioning enables you to explicitly control how rows map to partitions. You do this by specifying a list of
discrete values for the partitioning column in the description for each partition. This is different from range
partitioning, where a range of values is associated with a partition and with hash partitioning, where you have no
control of the row-to-partition mapping. The advantage of list partitioning is that you can group and organize
unordered and unrelated sets of data in a natural way.

Composite Partitioning
Composite partitioning combines range and hash partitioning. Oracle first distributes data into partitions according
to boundaries established by the partition ranges. Then Oracle uses a hashing algorithm to further divide the data
into sub partitions within each range partition.

Index Partitioning
You can choose whether or not to inherit the partitioning strategy of the underlying tables. You can create both
local and global indexes on a table partitioned by range, hash, or composite methods. Local indexes inherit the
partitioning attributes of their related tables. For example, if you create a local index on a composite table, Oracle
automatically partitions the local index using the composite method.

2.2.4. Indexes
Bitmap Indexes

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 15 of 293

Bitmap indexes are widely used in data warehousing environments. The environments typically have large
amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications,
bitmap indexing provides:
Reduced response time for large classes of ad hoc queries
Reduced storage requirements compared to other indexing techniques
Dramatic performance gains even on h/w with a relatively small no. of CPUs or a small amount of
memory
Efficient maintenance during parallel DML and loads
Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space
because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a
fraction of the size of the indexed data in the table. An index provides pointers to the rows in a table that contain a
given key value. A regular index stores a list of row ids for each key corresponding to the rows with that key value.
In a bitmap index, a bitmap for each key value replaces a list of row ids. Each bit in the bitmap corresponds to a
possible row id, and if the bit is set, it means that the row with the corresponding row id contains the key value. A
mapping function converts the bit position to an actual row id, so that the bitmap index provides the same
functionality as a regular index. If the number of different key values is small, bitmap indexes save space.
Bitmap indexes are primarily intended for data warehousing applications where users query the data rather than
update it. They are not suitable for OLTP applications with large numbers of concurrent transactions modifying the
data. Parallel query and parallel DML work with bitmap indexes as they do with traditional indexes. Bitmap
indexing also supports parallel create indexes and concatenated indexes.

B-tree Indexes
A B-tree index is organized like an upside-down tree. The bottom level of the index holds the actual data values
and pointers to the corresponding rows, much as the index in a book has a page number associated with each
index entry.

Materialized Views
Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on
a monthly, weekly, or daily basis. The data is normally processed in a staging file before being added to the data
warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast
majority of the data is stored in a few very large fact tables. One technique employed in data warehouses to
improve performance is the creation of summaries. Summaries are special kinds of aggregate views that improve
query execution times by pre calculating expensive joins and aggregation operations prior to execution and
storing the results in a table in the database. For example, you can create a table to contain the sums of sales by
region and by product. The summaries or aggregates that are referred to in this book and in literature on data
warehousing are created in Oracle using a schema object called a materialized view. Materialized views can
perform a number of roles, such as improving query performance or providing replicated data. Prior to Oracle8i,
organizations using summaries spent a significant amount of time creating summaries manually, identifying which
summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. The
introduction of summary management in Oracle8i eases the workload of the database administrator and means
the end user no longer has to be aware of the summaries that have been defined. The database administrator
creates one or more materialized views, which are the equivalent of a summary. The end user queries the tables
and views in the database. The query rewrite mechanism in the Oracle server automatically rewrites the SQL
query to use the summary tables. This mechanism reduces response time for returning results from the query.
Materialized views within the data warehouse are transparent to the end user or to the database application.
Although materialized views are usually accessed through the query rewrite mechanism, an end user or database
application can construct queries that directly access the summaries. However, serious consideration should be
given to whether users should be allowed to do this because any change to the summaries will affect the queries
that reference them.
In data warehouses, you can use materialized views to pre compute and store aggregated data such as the sum
of sales. Materialized views in these environments are often referred to as summaries, because they store
summarized data. They can also be used to pre compute joins with or without aggregations. A materialized view
eliminates the overhead associated with expensive joins and aggregations for a large or important class of
queries.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 16 of 293

Frequently Asked Questions


1. When should I consider a data warehousing solution?
Answer: When users are requesting access to a large amount of historical information for reporting
purposes, you should strongly consider a warehouse or mart. The user will benefit when the information
is organized in an efficient manner for this type of access.
2. DBAs have always been told that having non-normalized data is bad. Why is it now okay?
Answer: Normalization in relational databases results in an efficient use of database storage. Data
warehousing is not concerned with accomplishing the same storage efficiencies. The main concern is to
provide information to the user as fast as possible. Because of this, storing information in a de-normalized
fashion, including aggregate columns and summarization, provides the best immediate results.
3. What is the difference between data warehousing and OLAP?
Answer: These two terms are often used interchangeably. Warehousing is primarily the organization and
storage of the data such that it can be analyzed easily. OLAP deals with the particulars of the process on
analyzing the data, managing aggregations, and partitioning information into cubes for in-depth
visualization.
4. How often should I load data into my warehouse from my enterprise transaction systems?
Answer: The answer to this question may depend on the needs of the users and the volume of
information that it is to be moved. It is common to schedule weekly or monthly dumps from the
operational data stores, during periods of low activity (for example, nights or weekends). The longer the
gap between load, the longer processing times for the load when it does run. You will have to weigh the
implications of each to come up with an ideal solution for your situation.
5. How do I get started with data warehousing?
Answer: Build one! The easiest way to get started with data warehousing is to analyze an existing OLTP
database and see what type of trends would be interesting to examine. From there you could model your
new schema and load it with some current data. Although it may seem trivial, it is not. Start small and
build from there. SQL 7.0 offers excellent tools and technologies for starting any warehousing effort.
6. What is the Data Warehousing concept?
Answer: Data Warehousing is the process of making your operational data available to your business
managers and decision support applications. Data warehousing doesn't just make data available; proper
warehousing focuses on efficient information access. Of course, this efficiency doesn't happen magically.
First you have to understand the business user needs from the data and the decision support
applications, then you must evaluate your current operational data and determine how to transform that
data into what the business user requests. The tools that you choose for your warehousing solution will
take data from your operational systems (extract it), convert your operational data into business
information using your defined business rules (transform it), and create a data warehouse (load it).
7. What is the architecture of a Warehouse?
Answer: The architecture of a warehouse refers to the tools and products that are required to create your
warehouse as well as the tools and products that are necessary to achieve the ultimate business goals
set for your warehouse. Most warehousing architectures involve multiple tools and products. This is true
for the SAS data warehousing implementation, which utilizes components that manage the process of
moving and converting data as well as optimizing data for decision support. SAS provides all of the
components that you need, including the SAS suite of decision support tools.
o Use the SAS/ACCESS and SAS query products for the extraction process, which makes your
data available to the SAS system. Once the data is available to SAS, it can be transformed and
loaded into data warehouses, data marts, specialized data stores, and reports.
o Use the DATA step and SAS procedures to transform, consolidate, reformat, and cleanse your
data.
o Use SAS/Warehouse Administrator software to manage component definitions (tables, MDDBs,
reports, graphs, and queries), handle the process of moving the data, and to generate the SAS
source code for your data warehouse.
o Use the metadata repository provided with SAS/Warehouse Administrator for collecting, storing,
and accessing metadata about the information environment and processes.
8. What is the process of warehousing data?
Answer: Data warehousing involves the entire information delivery process from access and
transformation of data that resides in different operational stores, through the organization process that

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 17 of 293

makes the data available for decision making, to surfacing the data for exploitation via a range of decision
support tools.
The data warehousing process begins with your operational data sources and ends with your decision
support applications and reports. In between, the following must happen to your data.
1. Data is extracted from each of your operational systems, which can includes databases, ERP
systems, and external data (Web feeds, purchased data, etc.).
2. The data is transformed based on the business rules that apply to your data usage.
3. The data is stored in its transformed format in the data warehouse or data mart.
By following these simple steps, you provide a "single version of the truth" to all users, and free your IT
department for other strategic projects.
9. What is a data warehouse?
Answer: A data warehouse is a home for large quantities of data that normally reside in a number of
different locations. A Data Warehouse is the "corporate memory". Academics will say it is a subject
oriented, point-in-time, inquiry only collection of operational data. Typical relational databases are
designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-
line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional
relational databases.
Generally it:
o Includes a variety of historical information
o Merges data sets that are otherwise difficult to combine or compare
10. What are the most frequent data errors that slow down data input process?
Answer:
1. Missing or inconsistent IDs
2. Missing or incomplete demographic data
3. Duplicate records
4. Misspelled or inconsistent formats in names (Mc and MC as different entries)
5. Alpha data in numeric only fields
6. Inconsistent date formats
7. Numeric data out of range
11. What is ETL/ how does Oracle support the ETL process?
Answer: ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or Transporting)
and Loading (ETL) data from source systems into the data warehouse. Oracle supports the ETL process
with their "Oracle Warehouse Builder" product. Many new features in the Oracle9i database will also
make ETL processing easier. For example:
o New MERGE command (also called UPSERT, Insert and update information in one step);
o External Tables allows users to run SELECT statements on external data files (with pipelining
support).
12. What is the difference between a data warehouse and a data mart?
Answer: This is a heavily debated issue. There are inherent similarities between the basic constructs
used to design a data warehouse and a data mart. In general a Data Warehouse is used on an enterprise
level, while Data Marts is used on a business division/department level. A data mart only contains the
required subject specific data for local analysis.
13. What is the difference between a W/H and an OLTP application?
Answer: Typical relational databases are designed for on-line transactional processing (OLTP) and do
not meet the requirements for effective on-line analytical processing (OLAP). As a result, data
warehouses are designed differently than traditional relational databases. Warehouses are Time
Referenced, Subject-Oriented, Non-volatile (read only) and Integrated. OLTP databases are designed to
maintain atomicity, consistency and integrity (the "ACID" tests). Since a data warehouse is not updated,
these constraints are relaxed.
14. What is the difference between OLAP, ROLAP, MOLAP and HOLAP?
Answer: ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the
data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will store data at a fine

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 18 of 293

grain level, response times are usually slow. MOLAP stands for Multidimensional OLAP. Users see their
data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS)
like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is
usually critical and fast. HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate
Software's Holos is an example HOLAP environment. In a HOLAP system one will find queries on
aggregated data as well as on detailed data.
15. What is the difference between an ODS and a W/H?
Answer: An ODS (Operational Data Store) is an integrated database of operational data. Its sources
include legacy systems and it contains current or near term data. An ODS may contain 30 to 90 days of
information. A warehouse typically contains years of data (Time Referenced). Data warehouses group
data by subject rather than by activity (subject-oriented). Other properties are: Non-volatile (read only)
and Integrated.
16. What Oracle tools can be used to design and build a W/H?
Answer: Oracle Warehouse Builder, Oracle Designer, Oracle Express, Express Objects
17. When should one use an MD-database (multi-dimensional database) and not a relational one?
Answer: Data in a multi-dimensional database is stored, as business people view it, allowing them to
slice and dice the data to answer business questions. When designed correctly, an OLAP database will
provide must faster response times for analytical queries. Normal relational databases store data in two-
dimensional tables and analytical queries against them are normally very slow.
18. What is a star schema? Why does one design this way?
Answer: A star schema is a single fact table containing a compound primary key, with one segment for
each dimension and additional columns of additive, numeric facts. Why? It allows for the highest level of
flexibility of metadata Low maintenance as the data warehouse matures Best possible performance
19. When should you use a STAR and when a SNOW-FLAKE schema?
Answer: The star schema is the simplest data warehouse schema. Snow flake schema is similar to the
star schema. It normalizes dimension table to save data storage space. It can be used to represent
hierarchies of information.
20. How can Oracle Materialized Views be used to speed up data warehouse queries?
Answer: With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle can direct
queries to use pre-aggregated tables instead of scanning large tables to answer complex queries.
Materialized views in a W/H environment are typically referred to as summaries, because they store
summarized data.
21. What Oracle features can be used to optimize my Warehouse system?
Answer: The following Oracle features can be used to compliment your Warehouse system/database:
o From Oracle8i One can transport tablespaces between Oracle databases. Using this feature
one can easily "detach" a tablespace for archiving purposes. One can also use this feature to
quickly move data from an OLTP database to a Warehouse database.
o Data partitioning allows one to split big tables into smaller more manageable sub-tables
(partitions). Data is automatically directed to the correct partition based on data ranges or hash
values.
o Oracle Materialized Views can be used to pre-aggregate data. The Query Optimizer can direct
queries to summary/ roll-up tables instead of the detail data tables (query rewrite). These will
dramatically speed-up warehouse queries and saves valuable machine resources.
o Oracle Parallel Query can be used to speed up data retrieval by using multiple processes (and
CPUs) to process a single task.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 19 of 293

Exercises

Exercise #1.1
Scenario
There exists a garment manufacturing company, which is having its outlets across the globe. It
also has several portals and web based applications using which customers can make
purchases online. You are asked to develop design & maintain a data warehouse for them.
Database Structure
CustomersCust_IDNumberP.K.Cust_NameVarchar2(6
0)Not NullStreet_AddressVarchar2(60)Not
NullCity_IDNumberF.K.Postal_CodeVarchar2(10)

ProductsProd_IDNumberP.K.Prod_NameVarchar2(60
)Not NullSub_Cat_IDNumberF.K.

OrdersProd_IDNumberF.K.Cust_IDNumberF.K.Order_
DateDate/TimeNot NullQuantity_SoldNumberNot
NullAmount_SoldNumberNot Null

RegionsRegion_IDNumberP.K.Region_NameVarchar
2(40)Not Null

Sub_RegionsSub_Region_IDNumberP.K.Sub_Region
_NameVarchar2(40)Not NullRegion_IDNumberF.K.

CountriesCountry_IDNumberP.K.Country_NameVarc
har2(40)Not NullSub_Region_IDNumberF.K.

StatesState_IDNumberP.K.State_NameVarchar2(40)N
ot NullCountry_IDNumberF.K.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 20 of 293

Exercise #1.1 (contd)

3. Computer Associates ERWin r 7


CitiesCity_IDNumberP.K.City_NameVarchar2(40)Not
NullState_IDNumberF.K.

3.1. Introduction
This chapter combines Logical Data Modeling concepts and the ERWin 4.1 tool features that support logical data
modeling. Logical data modeling is one of the most widely used techniques to analyze and document business
data requirements. A model is a set of diagrams and supporting documents containing the essential data
elements,CategoriesCategory_IDNumberP.K.Category_NameV
detailed definitions, and descriptions of the relationships between the business elements or objects.
Whetherarchar2(40)Not Null new application systems, re-engineering existing processes, customizing a
you are developing
purchased software package, or defining a data warehouse, it is critical to understand the core business
information needs of the organization.
This chapter covers the diagramming techniques necessary to model essential business data in an Entity
Sub_CategoriesSub_Cat_IDNumberP.K.Sub_Cat_Na
Relationship Diagram (ERD) NullCategory_IDNumberF.K.
meVarchar2(40)Not using ERWin. Students will learn the components of the ERD, naming guidelines,
diagramming conventions, and learn to formulate detailed questions about the business rules and requirements.
Students will also learn all the important tool features of ERWin.

What's New in AllFusion ERwin Data Modeler 7.1


You can now load a wide variety of models generated by other sources into an AllFusion ERwin DM model.
Exercise
Due to an increasing number of modelers who expect their modeling tools to provide interfaces and model
migrationIdentify the dimension
capabilities, andfrom
technology fact Meta
tablesIntegration Technology, Inc. is now embedded with AllFusion ERwin
Identifyintegration
DM to provide the total number of measures,
capabilites with aboutdimensions and detailsmetadata
100 industry-leading in each table
products in the form of a wizard.
It has been noted by your superior that there do exists some critical columns in the given
structure. Identify them and suggest appropriate surrogate keys.
Design a star and a snowflake schema for the given OLTP structure
Analyze both the schemas
Identify which schema (star or snowflake) schema is most suitable
Identify whether any fact constellations exist
What are all the summary tables that might be required
What are all the possible data marts that can be created
Business Importance
The warehouse is used by the end users for analyzing the sales of the company. This
information is being used by a group of business analysts and senior managerial personnel.
This information is not permitted to be passed to any other departments of the same/any
other company.
End User Requirements
Top 10 products being sold (region wise) at the end of every month
Top 50 customers based on product category, sub category
List of top 5 cities where the sales are decreasing rapidly
Quarterly sales report for the last three years

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 21 of 293

3.2. ERWIN Workplace

Toolbars ERWin Toolbox

Diagram
Model Explorer Window

Advisories Pane
Action Model Explorer

n log

ERWin workplace includes:


A drawing canvas/diagram window
Model explorer
Dockable Toolbars
ERWin Toolbox
Stored Display Tabs
Action Log
Advisories Pane
A drawing canvas/diagram window is where you create entities (tables) and attributes (columns). You also specify
the relationships that these entities share between each other in the same window. Model explorer is exactly like
the folder browser in the windows explorer. You can use it navigate faster within your ERWin model. It displays
the list of all the objects in the model (entities, attributes ). Dockable toolbars in ERWin are like toolbars in any
other software. They give you a one-click access to several functionalities that are available in menus. Perhaps,
ERWin toolbox is the most widely used toolbar. It contains several icons for creating entities & attributes and to
define relationships. The Action Log pane is a dockable window on the user interface that provides transaction log
information containing real time changes made to a model. The Advisories Pane is a dockable pane in the main
AllFusion ERwin DM workplace that displays messages associated with actions you perform in AllFusion ERwin
DM.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 22 of 293

3.3. Getting Started


Creating a model
Click the New icon in the standard toolbar.
ERWin responds popping up a New Model dialog box.
Select the Logical/Physical in the new model type
Select the Database as Oracle
Select the Version as 9.x
Click OK

Switching
Save Between
New
Models

open

Model Types
ERWin lets you design three types of models:
Logical/Physical
Logical
Physical
Logical/Physical is a model type created in ERWin that automatically links the logical and physical models.
Logical is a data model that represents business rules that governs how a business operates. Usually, a logical

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 23 of 293

model is used as a starting point for the physical model which is typically a database. Physical is a model that
represents the objects in the database as well as rules for managing the data. In ERWin, we use the physical
model to create and update the database. If you are working with ERWin data models that you saved in a version
prior to ERWin 4.0, chances are you have a logical/physical model, which you can still maintain in ERWin. ERWin
also lets you split logical/physical models, derive new models of either type, as well as design separate logical
and physical models from scratch.

Figure 3.3

3.4. Basic Data Model Objects


In ERWin, basic objects in the logical model are: Physical
Physical
Entities
Attributes
Relationships
In the physical model, the basic objects are: Logical & Physical Views
Tables
Columns Logical
Logical
Constraints (same as relationships)
Views
Basically, the same drawing tools are used to create both logical and physical objects.

Entity
IE:
IE:Information
InformationEngineering
Engineering
An entity is a logical object that represents a person, place, or thing about which an organization maintains
information. Following are some entities:
Employee
Department
Customer
Movie
ERWin Notations
Payment
Store

Table
An entity in the logical model usually corresponds to a table in a physical data model. In the physical model, a
graphic box represents a table in which data is stored in the database.

Two Types of Entities and Tables


Two types of entities and tables can be drawn in an ERWin data model:
IDEF1X:
IDEF1X:Integrated
IntegratedDefinition
Definitionfor
for
An independent entity is represented as a box with square corners. Customer entity is an independent
information
informationmodeling
modeling
entity because none of its primary keys are contributed by another entity.
A dependent entity is represented as a box with rounded corners. Customer entity contributes a primary
key to the Movie Rental Record, which as a result becomes a dependent entity.

Adding Entities and Tables


ERWin includes a single tool in the ERWin toolbox for creating both independent and dependent entities and/or
tables.
To add an entity (or table) in the logical (or physical) model:
Click the tool in the ERWin toolbox and then click in the diagram window.
Repeat for each entity that you want to add.
ERWin automatically numbers the objects (e.g., E/1, E/2, E/3, and so on).

Attributes and Columns

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 24 of 293

Attributes collect information about an entity and columns collect information about a table. Logical attributes
usually correspond to physical columns in a table. For example, the Customer entity may contain the Customer
Number attribute, which may become the Cust_No column in a database.

Primary Key and Non-Key Areas


In ERWin, entities and tables are drawn as a box with a horizontal line near the top of the box. The area above
the line is called the primary key area and contains the primary key attributes or columns. Below the line is the
non-key area, which contains attributes or columns that are not the primary key.

Adding Attributes and Columns


After you create an entity or table, you can add attributes and/or columns. ERWin provides many easy methods
for creating and modifying the properties of these objects. The most basic method is to add the name directly in
the diagram window.
Click to select the entity or table.
Press the Tab key and when the edit box appears, type the attribute or column name.
When you finish, press Tab to add the next primary key or Enter to add a non-key.

Relationships
A very important object in a data model is the relationship, which is represented by the solid or dashed line that
connects two entities or two tables. A relationship line connects a parent and a child entity or table. Usually, a
symbol appears at the child-end of the relationship line. The symbol changes based on the diagram notation that
you select.

Types of Relationships
Relationships are important because the type of relationship determines how a primary key of the parent entity or
table migrates to the child entity or table as a foreign key. There are two types of relationships:
An Identifying Relationship is represented by a solid line and through it the primary key of the parent
migrates to the primary key area of the child entity or table.
A Non-Identifying Relationship is represented by a dashed line and through it the primary key of the
parent migrates to the non-key area of the child entity or table.

Adding Relationships
Again, ERWin provides many easy methods for creating a relationship. The easiest way to create a relationship is
to use the Relationship tool in the ERWin Toolbox. Before you create a relationship, consider whether you want
the foreign keys to migrate to the primary key area or the non-key area of the entity or table. Then choose the
relationship tool from the ERWin Toolbox. To create a relationship:
1. Click to select the relationship tool.
2. Click the parent entity or table.
3. Click the child entity or table.

Views
In a physical model, you can create a view, which is really a SQL query that is permanently stored in the
database. Typically, a view is used to present specific database information for a target audience. As an example,
the accounting department of a video store chain presumably uses the Customer Invoice view to generate a
billing invoice. In ERWin and in the database, a view is really a virtual table. In ERWin, a view table (box) and
relationship line are both drawn with dashed-lines. In the physical model, you can use view tools in the ERWin
toolbox to draw the view table and connect the view relationship to a source table. When you do, the columns
from the source table migrate to the view. Behind the scenes, ERWin writes the SQL query for the view, which you
can view and edit in the Views Editor.

Adding Views
ERWin includes a tool in the ERWin toolbox for creating views. This is available only in the physical model.
To add a view in the physical model:
1. Click the tool in the ERWin toolbox and then click in the diagram window.
2. Repeat for each entity that you want to add.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 25 of 293

3. ERWin automatically numbers the objects (e.g., V_1, V_2, V_3, and so on).
4. Click any column in any table that you wish to have in a view
5. Drag and drop in to the view.

3.5. Model Explorer


The Model Explorer organizes all of the objects in your data model in a hierarchical text-based view. The contents
of the Model Explorer change based on the type of the model. For example, you will see Tables and Views in the
Model Explorer for a Physical model, but not for a Logical model. An icon on the Model Explorer represents each
generic object type, such as Entities or Tables. Specific objects from your model appear under the generic object.
If you see a plus sign before the object name, click it to expand the list and see related objects. To collapse a list,
click on the minus sign next to an object.

Model Explorer Panes


To further organize your model, the Model Explorer has three panes: Model, Subject Areas, and Domains. Click
on the tab at the bottom of the Model Explorer to switch to a different pane.
The Model pane includes every object in your model including subject areas and domains.
The Subject Areas pane displays model objects sorted by subject area.
The Domains displays a list of all the domains that are in the Domain Dictionary. You can sort the
Domains hierarchically or alphabetically
When you make changes to an object in the Model Explorer, the graphical view of the model is immediately
updated with the same change. For example, if you rename a table in the Model Explorer, the new table name
replaces the existing table name in the Diagram Window and the related editors. Similarly, if you make a change
in the Diagram Window or in an editor, you will immediately see the change in the Model Explorer.

In addition to navigating through a model, the Model Explorer provides a whole range of useful features that will
help you easily create and modify your data model. Just to name a few tasks that you can perform in the Model
Explorer, you can:
Create new objects
Rename existing objects
Go to objects in the Diagram Window
Open editors to view or change object properties
Drag some objects from the Model Explorer onto the Diagram Window
Move, copy, and delete objects from one place to another

Creating & Maintaining Domains

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH Data Warehousing Modeling Page 26 of 293

In the Model Explorer, the Domains lists all of the domains for the current model, which include all of the default domains as well
as any that you created. To add a domain in the Model Explorer, right-click on the Domains folder and select Properties from the
context menu to open the Domain Dictionary, in which you create a new domain and assign properties to it. To switch the sort
order of the Domains from hierarchical to alphabetical, right-click on the Domains folder in the Model or Domains pane, and
select the sort option that you prefer.

Subject Areas
The Subject Areas pane displays model objects sorted by subject area. You can expand each subject area to see
a list of the members of the subject area as well as any stored displays, which appear in folders below the Subject
Area to which they belong. Subject areas reference the objects in the Main Subject Area, which include all of the
objects in the model. So if you change an object in one subject area, the change applies to all subject areas to
which the object belongs. You will immediately see the changes applied throughout the Model Explorer and
Diagram Window.

3.6. Domains
In ERWin, a domain is a model object that you can use to quickly assign properties to an attribute or column. By
using domains you will promote consistency because a domain can be reused as many times as you like in a
single or multiple data models. Domains also reduce the time spent on development and maintenance. If you
change the domain, all attributes or columns associated with the domain are also changed.

Domain Dictionary
You can create and modify both physical and logical domains using the Domain Dictionary. Some of the domain
properties include:
Domain name and column name
Column data type
Default value
Valid value
Domain comment or note
Column comment or name
User Defined Property
The tabs and options in the editor will change based on whether the model is logical-only, physical-only, or logical
and physical
Inherited and Non-Inherited Properties
In ERWin, domains have two types of properties:
Non-inheritable properties do not migrate to child domains or attributes and columns associated with the
domain because they are properties of the domain itself.
Inheritable properties do migrate to child domains and to the attributes and columns associated with the
domain.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 27 of 293

Creating Domains
ERWin supplies a set of default domains that you can use as they are or you can modify their properties. You can
also create new domains in just a few quick steps.
1. From the Model menu, choose Domain Dictionary.
2. In the dictionary, select the parent domain, which can be any existing domain.
3. Then click New.
4. When the New Domain dialog opens, the parent domain that you selected is highlighted.
5. Type a name for the new domain at the bottom of the dialog. Then click OK.

Domains and the Data types


A data type is a domain property. By default a new domain is assigned the same data type as its parent domain.
You can change it at any time. The available data types for the current model always display in the list in the
Domain Dictionary
To assign a data type:
1. Select the domain.
2. Click the data type tab in the Logical model or the <Database> tab in the Physical model.
3. Select the new data type for the domain.

3.7. Entity Relationships


In an ERWin data model, a relationship shows an association between two entities or tables. Depending on the
diagram notation you choose, a relationship line may be solid or dashed and has symbols at one or both ends.
Relationship Tools
Depending on the model type (logical or physical) and the diagram notation, the relationship tools in the ERWin
Toolbox change. But basically, there are two types of relationships:
Identifying Relationship
Non-Identifying Relationship
An identifying relationship is a relationship between two entities in which an instance of a child entity is identified
through its association with a parent entity, which means the child entity is dependent on the parent entity for its
identify and cannot exist without it. In an identifying relationship, one instance of the parent entity is related to
multiple instances of the child. A non-identifying relationship is a relationship between two entities in which an
instance of the child entity is not identified through its association with a parent entity, which means the child entity
is not dependent on the parent entity for its identify and can exist without it. In a non-identifying relationship, one
instance of the parent entity is related to multiple instances of the child.

Logical
Logical Physical
Physical

Integrated
IntegratedDefinition
Definitionfor
for Dimension
Dimension
information
informationmodeling
modeling Modeling
Modeling
Information
Information
Engineering
Engineering ERWin Notations

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 28 of 293

Relationship Editor
Once you create a relationship, you can double-click on the relationship line to open the Relationship Editor. You
can edit many of the relationships properties including:
Parent and child verb phrases
Relationship definition
Role name
Referential Integrity
Cardinality

Relationship
Relationshipbetween
between

Relationship
RelationshipCardinality
Cardinality

Relationship
RelationshipType
Type
Relationship Editor

Unification
If the foreign key attribute has the same name as an owned attribute in the child entity, ERWin automatically
unifies the two instances into one attribute because it assumes that they are the same attribute. The process of
combining or unifying identical attributes in an entity is called unification.

Stored Displays
If you want to quickly change the graphic presentation of your data model without resetting the display options
each time, you can create a stored display for each set of display options. To create a stored display:
1. From the Format menu, choose Stored Display Settings.
2. Click New and type a name for the stored display.
3. Click on the property tabs in the editor and then select the display option settings.
4. Click OK to save the new stored display.
For each stored display you create, ERWin adds a tab to the bottom of the Diagram Window. When you save a
data model, ERWin saves all stored displays that are associated with the data model. In order to see the Stored
Display tabs, on the View menu, be sure to check the Stored Display Tab option.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 29 of 293

Unification
If the foreign key attribute has the same name as an owned attribute in the child entity, ERWin automatically
unifies the two instances into one attribute because it assumes that they are the same attribute. The process of
combining or unifying identical attributes in an entity is called unification.
Stored Displays
If you want to quickly change the graphic presentation of your data model without resetting the display options
each time, you can create a stored display for each set of display options. To create a stored display:
1. From the Format menu, choose Stored Display Settings.
2. Click New and type a name for the stored display.
3. Click on the property tabs in the editor and then select the display option settings.
4. Click OK to save the new stored display.
For each stored display you create, ERWin adds a tab to the bottom of the Diagram Window. When you save a
data model, ERWin saves all stored displays that are associated with the data model. In order to see the Stored
Display tabs, on the View menu, be sure to check the Stored Display Tab option.

3.8. Display Levels


Figure 3.7

Display Levels Display


DisplayLevels
Levels

Logical
Logical Physical
Physical Stored
StoredDisplays
Displays

1.1.Entity
Entity 1.1.Table
Table

2. Attribute 2.2.Column
Figure 3.8 2. Attribute Column

3.3.Primary
Primarykey
key 3.3.Primary
Primarykey
key
11 11
4.4.Definition
Definition 4.4.Comment
Comment

5.5.Icon 5.5.Physical 22
Icon PhysicalColumn
ColumnOrder
Order
22

33 33

44
44

55

55

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
Logical & Physical
Logical
Logical Display Levels Physical
Physical
DWH ERWin Page 30 of 293

3.9. Subject Areas


A subject area is a subset of objects taken from the whole pool of objects in your diagram. By default, a new data
model includes one subject area called the Main Subject Area, which includes all of the objects in the data model.
But you can create others.

Figure 3.9
For every subject area in a Logical/Physical model, ERWin automatically creates a corresponding subject area for
the other model type. So, if you create a Customer Subject Area in logical model, ERWin creates a Customer
Area in the physical model.
It is important to understand that the subject areas are not copies of the data model, but are dynamic subsets of
the data model. In other words, if you add members to a subject area those objects are added to the current
subject area and the Main Subject Area. If you add an attribute or column to an existing entity or table, the new
object is added to every subject area in which the entity or table is a member.

Creating a Subject Area


To create a new subject area: Subject Areas
1. Select Subject Areas from the Model menu.
5. In the Subject Areas Editor, click New, and type a name for the new Subject Area.
6. Click the Members tab and use the arrows to include the objects in the new subject area (on the
right).
The Model Explorer provides a quick view of the subject areas in the data model. In the Model pane, you can see
the Subject Areas along with all of the other data model objects. But, if you prefer to view all of the subject areas
at a glance, just switch to the Subject Areas pane.

3.10. Indexes
You already know that an index in a book helps you to quickly find information by listing all of the pages where a
particular topic is discussed. Similarly, an index table helps to quickly locate a record in a database by pointing to
a specific column and row in a table. So, for example, to locate a customer in the database, an index on the
Customer table references the Customer Number (account number).
ERWin supports four types of indexes:
Primary Key Index
Foreign Key Index
Alternate Key Index
Inversion Entry Index
A Primary Key Index is an index on primary key of a particular table. You can have only one such index per table.
However, each index can contain multiple columns. A primary key index is unique. So, indexed columns cannot
have duplicated values. ERWin automatically creates primary key indexes for each table that contains a primary
key.
A Foreign Key Index is an index on one or more foreign keys migrated through a single relationship to the child
table. ERWin automatically creates foreign key indexes for each set of foreign key columns that are migrated
through a relationship to a child table.
Alternate Key Index is a unique index that provides an alternative unique index in addition to the primary key
index. For example, to locate a customer quickly, the Primary Key index may include only the Customer Account
Number. As an alternative, the Alternate Key index may include the Customer Phone Number column, which must
be a unique number associated with a customer record.
Inversion Entry Index is a non-unique, or inversion entry (IE) index lets you quickly access records using values
that are not unique, such as Employee Last Name. Duplicate values in the inversion entry index are allowed.
Imagine a database for a video rental chain. It probably has hundreds of thousands of customer records. But,
when its time to check out a video, its important for each customer record to be quickly located for better
customer service, right? If the video store clerk knows the customers Account Number the customers record will

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 31 of 293

be found quickly because the Primary Key index references a unique Customer Number. Often the customers
Account Number is not available. But, a customers record will be found quickly if the video store database has
another unique index that points to the Customer Phone column.
Alternatively, the video store may also want to look up a customer record by Last Name, even though the search
may produce multiple records. In this case, a non-unique index may be created on the Customer table that points
to the Customer Last Name column alone.

Index Editor
In the physical model, you can use the Index Editor to create an index.
1. To view the indexes for a table, select the table from the list at the top of the editor.
2. To view the properties of an index, select the index from the list.
3. The Members tab lists all of the available columns in the table (on the left) and those columns
already assigned to the index (on the right).
4. Depending on the target database, other index properties may be available.
5. By default, ERWin automatically creates the Primary Key index, which is unique and includes all
primary key columns.

Creating Indexes
1. Click New to open the New Index dialog. ERWin assigns a default index name, which you can
accept or change.
2. To create an AK index check the unique option. To create an IE index uncheck the unique option.
Click OK to close the New Index dialog.
3. In the Index Editor, select the index columns from the Available Columns list and use the arrow
button to move them to the Index Members list.

3.11. Specifying & Connecting To Target Database


Target Server is in to which we are interested to forward engineer our ERWin model. The following figure shows
the Target Server Dialog box and the wide range of DBMSs that ERWin supports. We then, connect to the target
database that we just selected. The following picture presentation shows selecting the target database and
connecting to it.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 32 of 293

3.12. Forward & Reverse Engineering


Forward Engineering
Forward engineering is a process that generates the physical database schema from the data model. You can use
Forward Engineering feature to design and create your database without writing a single SQL Create Table or
Create Index statement. When you generate a schema, you can choose to generate:
Tables
Triggers
Stored procedures
Indexes
Constraints
Other database features as supported by the target DBMS

To begin forward engineering or generating a schema, from the Tools menu, choose the Forward
Database
Engineer/Schema Generation. If the options on this menu are not available, you will need to switch to a Physical
Model. ERWin lets you view and set schema generation options by category. The target server you select
determines the options that appear in the editor. In the Schema Generation Editor, the left panel lists all the
categories and the right panel lists all the options for the selected category. When you choose Forward
ERWin
Engineering/Schema Model on the Tools menu, ERWin opens the Schema Generation Options dialog. Click
Generation
on the Preview button to view the generated schema. When you preview the generated schema, you can see how
the options you selected appear in the schema script. Once you are satisfied with the content of the generated
schema, ERWin gives you two choices. You can:
Forward Engineering
1. Save a SQL DDL (Data Definition Language) script as an ASCII text file by clicking on the Report
button.
2. Connect ERWin directly to the target server and generate the schema in one step by clicking on
the Generate button.

Database

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
ERWin Model
Reverse Engineering
DWH ERWin Page 33 of 293

1.Switch to Physical Model

2.Click Forward Engineering

3.Schema Generation Wizard

Click Here to generate

Click OK

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 34 of 293

Reverse Engineering
ERWin lets you quickly create a data model by reverse engineering an existing physical database. During reverse
engineering, ERWin first captures the information in your database or script files, including:
Tables
Columns
Relationships
Triggers
Stored procedures
Validation rules
Physical storage properties
ERWin then automatically creates a physical model in your diagram based on this information. After you create a
data model by reverse engineering, you can use the ERWin tools and editors to add new database objects,
redesign the database structure based on changing requirements, annotate or modify it in any way!

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 35 of 293

Specify the user name,


Select Model Password and Database
Connecting String

Select Database

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 36 of 293

3.13. Complete Compare


Complete Compare is a powerful tool that enables you to view and resolve the differences between two models,
or a model and a database or script file. When you choose Complete Compare from the Tools menu, an easily
navigated wizard provides a wide range of compare criteria, enabling you to resolve the differences between the
models, database, or script file.
After you select the models you want to compare, you can narrow the range of the comparison by filtering out
specific properties and objects. In simpler scenarios, you can accept the preset compare defaults and begin the
compare process immediately. You can create custom option sets for filtering objects, and save the option sets for
future use. You can save a compare session for future reference.
When you begin to compare your models, changes are made in real time to the open models in the workspace.
You can undo, redo, and reverse compare actions. When a database or script file is included in the compare, you
can also generate alter scripts for the models.
The full sequence of steps available in the Complete Compare wizard includes the following:
1. Make a selection for the model in the Right Model Selection dialog.
2. Make a selection for the model in the Left Model Selection dialog
3. If necessary, resolve type conflicts using features in the Type Resolution dialog.
4. Set object filter options for the left model in the Left Object Selection dialog.
5. Set object filter options for the right model in the Right Object Selection dialog
6. Select advanced options for the compare process in the Advanced Options dialog.
7. Click Compare to display the Resolve Differences dialog.

Complete Compare Right Model Selection Dialog:


Use the options in the Complete compare Right Model Selection dialog to select a model, script, database, or
AllFusion MM file that will be displayed in the right pane of the Resolve Differences dialog.
Load From - You can load a model, script, or database from any of the following sources:
File - Select this option to open a model from a local file. Click Load to browse for the file.
Database/Script - Select this option to select a script file or database. You are prompted to select the script file or
database using the AllFusion ERwin DM Reverse Engineering wizard. You use the Reverse Engineering wizard to
create the list of model objects and properties in the right model pane. The selected database or script is reverse
engineered to a new model in the workplace
Allow Demand Loading (only applicable when reverse engineering from a database) - Clear this check box to load
all the database objects and properties. When working with large databases, in order to improve performance,
select this check box to only load the names of top-level objects from the database catalog. When you perform an
action on a partially loaded object in the Resolve Differences dialog, it is fully loaded in order to complete the
compare action.
Set selected model as read-only - Select this check box to load the model as read-only. When you work with the
model in the Resolve Differences dialog, you cannot make changes to it. Use this feature to perform a "one-way"
compare.
Back - Display the Complete Compare: Left Model Selection Dialog.
Next - Display the Complete Compare: Type Selection Dialog. More
Compare - Begin the compare process with the options selected.
Close - Close the Complete Compare Wizard.
Load Session - Load an Complete Compare session saved to a file.
Save Session - Save the current Complete Compare session to a file.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 37 of 293

Complete Compare Left Model Selection Dialog


Use the options in the Complete Compare Left Model Selection dialog to select a model, script, database, model
source, or AllFusion MM file that will be displayed in the left pane of the Resolve Differences dialog.

Complete Compare Type Selection Dialog


Use the Type Selection dialog to set the compare level to a model type - logical level, physical level, or database
level. Your choice then sets a default option set for the compare process. You can also indicate a customized
option set for your compare, create a new option set, and edit, rename, or delete an existing option set.
Compare Level - Depending on the model, database, or script file you chose for the left and right pane of the
selection dialog, you can set the selection type for your compare. Note: When you select a compare level, each
pane of the Complete Compare wizard subsequently displays this information for easy reference.
Logical Level - All objects contained on the logical level of a model.
Physical Level - All objects contained on the physical level of a model.
Database Level - All objects on the database level.

Option Set - Two default option sets are provided:


Standard Default Option Set - this option set filters out many objects and properties from the selection tree. It
excludes "physical-only" object types, and includes a minimal set of property types. Use this default for standard
compares where it is not necessary to include all objects and properties in the compare process.
Advanced Default Option Set - this option set includes all objects in the selection tree, except those that are
assigned generated values during forward or reverse engineering. Use this default option set for advanced
compares, in which you want all objects to participate in the compare process.
You can filter out specific objects by clearing the check box next to the object name. You can save your selections
in a new option set, or you can make selections and use them for the current compare session only.

Complete Compare Left Object Selection Dialog


Use the options in the Complete Compare Left Object Selection dialog to further refine the criteria for object
comparison between the left and right models. You can select a narrower range of objects for the model in the left
pane (for example, by restricting the compare to a subject area, owner, and so on).

Choose Objects
The selected objects displayed in the Selected Object pane changes based on the kind of compare your are
performing. For example, if your left model has been reverse engineered for Complete Compare, you can choose
from selection sets that allow you to perform a complete compare on new objects, system objects, or matching
objects.

Complete Compare Right Object Selection Dialog

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 38 of 293

Use the options in the Complete Compare Right Object Selection dialog to further refine the criteria for object
comparison between the left and right models. You can select a narrower range of objects for the model in the
right pane (for example, by restricting the compare to a subject area, owner, and so on).

The selected objects displayed in the Selected Object pane changes based on the kind of compare your are
performing. For example, if your left model has been reverse engineered for Complete Compare, you can choose
from selection sets that allow you to perform a complete compare on new objects, system objects, or matching
objects.

Resolving Differences Using Complete Compare


You compare models using the features in the Resolve Differences dialog of the Complete Compare wizard. This
dialog appears after you have selected the models to compare and narrowed your compare criteria by applying
any of the available filters. When you click the Compare button, the Resolve Differences dialog displays, where
you can view the status of the conflicts, and resolve them by moving or matching selected items.

Display Filters
The filter buttons at the top of the dialog allow you change the display of differences The four filter buttons are
selected by default. To show all differences between your left and right model, select the four filters. You can
select any combination of filters. Note: To show only the differences between the models, deselect the "Equal"
button and select the three remaining buttons.
Equal - Select this button to display objects and properties that are the same in both models.
Not Equal - Select this button to display objects and properties that are not the same in both models.
Left Empty - Select this button to display objects that do not exist in the left model, but are present in the right
model. Note that this filter does not affect property rows.
Right Empty - Select this button display objects that do not exist in the right model, but are present in the left
model. Note that this filter does not affect property rows.

Review Differences
The Object View and Property View panels display a comparison tree of the differences between the two models.
Icons are used to identify the status of the differences.
Items that are matched do not display any of the difference icons.

Compare two data models

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 39 of 293

You can compare two AllFusion ERwin Data Modeler data models using Complete Compare. This example
illustrates opening one model in the workplace before starting the Complete Compare Wizard. Note that you can
start a Complete Compare session even if no models are open, by selecting Complete Compare from the Tools
menu
To compare two models, follow these steps
1. Open one of the models you want to participate in the compare. This example assumes you are working
with an open model, and want to compare your changes to a version of the model previously saved.
2. From the Tools menu, choose Complete Compare. Your open model is selected as the "Right Model" by
default.
3. Click the Left Model option in the navigation pane to select the second model. In the Left Model
Selection dialog, click Load... to browse for your *.erwin file. When you select it, the model opens in the
workplace. This allows you to update the model in "real-time" as you continue the compare process.
4. Use the options in the Complete Compare Wizard to set the compare level and filter by objects for either
model.
5. Click Compare to start the compare process. The Resolve Differences dialog appears.
6. Use features in the Resolve Differences dialog to compare and reconcile any detected differences
between the models.
7. Click Finish. You are returned to the Complete Compare Wizard. You can change any options in the
wizard and begin the compare process again, or click Close to close the wizard.
8. You are prompted to close your models, or leave them open in the workspace for additional editing or
changes. If you select to close the models, you may see additional prompts to save your changes before
closing the model.

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 40 of 293

Select the Model

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 41 of 293

Select the Objects to


compare in left Object
Selection

Select the Objects to


compare in Right Object
Selection

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 42 of 293

Click on
Compare

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 43 of 293

Comparison Between Two reports

Select the Left


Model

Choose logical/Physical

Select Current
user
www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07
info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 44 of 293

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 45 of 293

Specify the user name, Password &


Connecting String information of a Database

If you want to view the report in Excel format then from tools menu
select Report Tab & select the Option Open With Microsoft Excel

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 46 of 293

Report in Excel Format

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 47 of 293

Exercises
Exercise #3.1

Customers
Cust_No Number P.K.
First_Name Varchar2(20) Not Null
Last_Name Varchar2(20)
Address Varchar2(60) Not Null
City Varchar2(20) Not Null
State Varchar2(20) Not Null
Postal_Code Varchar2(10)
Phone_No Number Not Null
Email Varchar2(40)
Movies
Movie_No Number P.K
Title Varchar2(20) Not Null
Director Varchar2(20)
Rent Number Not Null
Star_1 Varchar2(20)
Star_2 Varchar2(10)
Star_3 Varchar2(10)

Movie_Copies
Movie_Copy_No Number P.K
Movie_No Number F.K
Condition [GOOD/DAMAGED/] Char(1) Not Null
Format [CD/DVD] Char(1) Not Null
Rent Number Not Null
Star_1 Varchar2(20)
Star_2 Varchar2(10)
Star_3 Varchar2(10)

Movie_Rental_Record
Movie_Copy_No Number F.K
Cust_No Number F.K
Rented_Date Date/Time Not Null
Due_Date Date/Time Not Null
Returned_Date Date/Time Not Null
Rented_Condition Char(1) Not Null
Returned_Condition Char(1) Not Null
OverDue_Amount Number Not Null
Damaged_Amount Number Not Null
www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07
Total_Amount
info@wilshiresoft.com Number Not Null
Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 48 of 293

Exercise #3.2

DepartmentsDepartment_IDNumberP.K.Department_Name
Varchar2(20)Not
NullManager_IDNumberF.K.Location_IDNumberF.K

EmployeesEmployee_IDNumberP.KFirst_NameVarchar2(20
)Not NullLast_NameVarchar2(20)Phone_NoNumberNot
NullEmailVarchar(40)Hire_DateDate/TimeNot
NullJob_IDNumberF.KSalaryNumberNot
NullCommissionNumberManager_IDNumberF.KDepartmen
t_IDNumberF.K

JobsJob_IDNumberP.KJob_TitleVarchar2(20)Not
NullMin_SalaryNumberNot NullMax_SalaryNumberNot Null

Job_HistoryEmployee_IDNumberNot
NullStart_DateDate/TimeNot NullEnd_DateDate/TimeNot
NullJob_IDNumberF.KDepartment_IDNumberF.K

LocationsLocation_IDNumberP.KStreet_AddressVarchar2(
20)Not NullCityVarchar2(20)Not NullStateVarchar2(20)Not
NullCountryVarchar2(20)Not Null

NOTE
Students are advised to consider database structures in figure 2.3 (Star schema) and figure 2.4
(Snowflake schema) as additional exercises.
www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07
info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5
DWH ERWin Page 49 of 293

www.wilshiresoft.com Wilshire Software Technologies Rev. Dt: 18-Oct-07


info@wilshiresoft.com Ph: 2761-2214 / 6677-2214 / 6452-6173 Version: 5

You might also like