You are on page 1of 6

Dataware Housing Concepts

BASIC DEFINITIONS
Datawarehousing :
DWH (Datawarehousing) is a repository of integrated information, specifically structured for Queries
and analysis. Data and information are extracted from heterogeneous sources as they are generated.
This makes it much easier and more efficient to run queries over data that originally came from
different sources.
"A
data
warehouse
is
a subject-oriented, integrated, time-variant,
andnonvolatile Collection of data in support of managements decision making process".
Subject-oriented a DW is organized around major subjects; excludes data that is not
useful in the decision support process.
Integrated a DW is constructed by integrating numerous data sources (relational DB, flat
files, legacy systems. DW provides mechanisms for cleaning and standardizing of the data.
Time-variant data is stored to provide information from a historical prospective. Every key
structure in the data warehouse contains, either implicitly or explicitly, an element of time.
Nonvolatile a DW is physically separated from the operational environment. Due to this
separation it does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires
Two operations: initial loading of data and access of data.

Data Warehouse is an architecture constructed by integrating data from multiple


heterogeneous sources to support structured and/or ad hoc queries, analytical reporting and
decision making.
Data Warehousing is a process of constructing and using data warehouses.
A Multi-Subject Information Store

Typically 100s of Gigabytes to Terabytes


Data Mart : It is a collection of subject areas organized for decision support based on the needs of a
given department. Ex: sales, marketing etc. the data mart is designed to suit the needs of a
department. Data mart is much less granular than the ware house data.
Data Mart is

A Single Subject Data Warehouse

Often Departmental or Line of Business Oriented

Typically Less Than a 100 Gigabytes


Differences between DWH & Data Mart : DWH is used on an enterprise level, while data marts are
used on a business division / department level. Data warehouses are arranged around the corporate
subject areas found in the corporate data model. Data warehouses contain more detail information
while most data marts contain more summarized or aggregated data.
OLTP : OLTP is Online Transaction Processing. This is standard, normalized database structure.
OLTP is designed for Transactions, which means that inserts, updates and deletes must be fast.
OLAP : OLAP is Online Analytical Processing. Read-only, historical, aggregated data.
Difference between OLTP and OLAP:

Fact Table :
It contains the quantitative measures about the business.
Fact tables that contain aggregated facts are often called summary tables.Dimension Table :
It is a descriptive data about the facts (business).
Aggregate tables :
Aggregate Tables are pre-stored summarized tables. Usage of Aggregates can increase the
performance of Queries by several times.
Conformed dimensions :
Conformed dimensions are a dimension table shared by fact tables. These tables connect
separate star schemas into an enterprise star schema.
Schema :
A schema is a collection of database objects, including tables, views, indexes, and synonyms.
There are a variety of ways of arranging schema objects in the schema models designed for
data warehousing. Most data warehouses use a dimensional model.
Star Schema :
Star Schema is a set of tables comprised of a single, central fact table surrounded by denormalized dimensions. Star schema implement dimensional data structures with denormalized dimensions
Snow Flake Schema:
Snow Flake Schema is a set of tables comprised of a single, central fact table surrounded by
normalized dimension hierarchies. Snowflake schema implement dimensional data structures
with fully normalized dimensions.
Queries :
The DWH contains 2 types of queries. There will be
Fixed queries

that are clearly defined and well understood, such as regular reports. Ad Hoc Query: Is the starting
point for any analysis into a database. The ability to run any query when desired and expect a
reasonable response that makes the data warehouse worthwhile and makes the design such a
significant challenge. There will also be ad hoc queries that are unpredictable, both in quantity and
frequency.
The end-user access tools are capable of automatically generating the database
query that answers any question posted by the user.
Canned Queries:
are pre-defined queries. Canned queries contain prompts that allow you to customize the query for
your specific needs
Kimball (Bottom up) vs Inmon (Top down) approaches :
Bottom up:
Acc. To Ralph Kimball, when you plan to design analytical solutions for an enterprise, try building data
marts. When you have 3 or 4 such data marts, you would be having an enterprise wide data
warehouse built up automatically without time and effort from exclusively spent on building the EDWH.
Because the time required for building a data mart is lesser than for an EDWH.
Top down:

Try to build an Enterprise wide Data warehouse first and all the data marts will be the subsets of the
EDWH. Acc. To him, independent data marts cannot make up an enterprise data warehouse under
any circumstance, but they will remain isolated pieces of information stove pieces.
ER Diagram :ER model is a conceptual data model that views the real world
as entitiesand Relationships. A basic component of the model is the Entity-Relationship diagramwhich
is used to visually represent data objects.
ETL : Extraction, Transformation & Loading. ETL Tools in the market for eg, Informatica, Ascential
Data stage, Acta ,Oracle Warehouse Builder(OWB) etc.,

Staging Area :
It is the work place where raw data is brought in, cleaned, combined, archived and exported to one or
more data marts. The purpose of data staging area is to get data ready for loading into a presentation
layer.
Slowly Changing Dimensions :
Dimensions are said to be slowly changing dimensions when their attributes remain almost constant,
requiring minor alterations.
Eg Marital status
Bitmap index, B tree index are the indexing mechanism use for a typical data warehouse.
OLAP, MOLAP, ROLAP, DOLAP, HOLAP :
OLAP:
Online Analytical Processing. OLAP tools in the market eg Business Objects, Brio, Cognos
,Microstrategy , Alphablock, Crystal Reports etc.,
ROLAP:
Relationnal OLAP, the users see cubes but under the hood it is pure relational table, Micro-Strategy is
a ROLAP product.
MOLAP:
Multi dimensionnal OLAP, the users see cubes and under the hood there a big cube, Oracle Express
used to be a MOLAP product.
DOLAP:
Desktop OLAP, the users see many cubes and under the hood there are many small cubes, Cognos
PowerPlay.
HOLAP:
Hybrid OLAP, combines MOLAP and ROLAP, Essbase
Types of Facts:
a.
Additive

1.
Able to add the facts along all the dimensions
2.
Discrete numerical measures eg. Retail sales in $
b.
Nonadditive
1.
Numeric measures that cannot be added across any dimensions
2.
Intensity measure averaged across all dimensions eg. Room temperature
3.
Textual facts - AVOID THEM
c.
Semi Additive
1.
Snapshot, taken at a point in time
2.
Measures of Intensity
3.
Not additive along time dimension eg. Account balance, Inventory balance
4.
Added and divided by number of time period to get a time-average.
Attributes :
A field represented by a column within an object (entity). An object may be a table, view or report. An
attribute is also associated with an SGML(HTML) tag used to further define the usage.
Business Activity Monitoring (BAM) :
BAM is a business solution that is supported by an advanced technical infrastructure that enables
rapid insight into new business strategies, the reduction of operating cost by real-time identification of
issues and improved process performance.
Business Intelligence (BI) :Business intelligence is actually an environment in which business users
receive data that is reliable,consistent, understandable, easily manipulated and timely. With this data,
business users are able to conduct analyses that yield overall understanding of where the business
has been, where it is now and where it will be in the near future. Business intelligence serves two
main purposes. It monitors the financial and operational health of the organization (reports, alerts,
alarms, analysis tools, key performance indicators and dashboards). It also regulates the operation of
the organization providing two- way integration with operational systems and information feedback
analysis.
Data Integration :
Pulling together and reconciling dispersed data for analytic purposes that organizations have
maintained in multiple, heterogeneous systems. Data needs to be accessed and extracted, moved
and loaded, validated and cleaned, and standardized and transformed.
Data Mapping :
The process of assigning a source data element to a target data element.
Data Mining :
A technique using software tools geared for the user who typically does not know exactly what he's
searching for, but is looking for particular patterns or trends. Data mining is the process of shifting
through large amounts of data to produce data content relationships. It can predict future trends and
behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as
data surfing.
Data Modeling :
A method used to define and analyze data requirements needed to support the business functions of
an enterprise. These data requirements are recorded as a conceptual data model with associated
data definitions. Data modeling defines the relationships between data elements and structures.
Drill Down:
A method of exploring detailed data that was used in creating a summary level of data. Drill down
levels depend on the granularity of the data in the data warehouse.
Meta Data:
Meta data is data that expresses the context or relativity of data. Examples of meta data include data
element descriptions, data type descriptions, attribute/property descriptions, range/domain
descriptions and process/method descriptions. The repository environment encompasses all

corporate meta data resources: database catalogs, data dictionaries and navigation services. Meta
data includes name, length, valid values and description of a data element. Meta data is stored in a
data dictionary and repository. It insulates the data warehouse from changes in the schema of
operational systems.
Normalization:
The process of reducing a complex data structure into its simplest, most stable structure. In general,
the process entails the removal of redundant attributes, keys, and relationships from a conceptual
data model.
Surrogate Key:
A surrogate key is a single-part, artificially established identifier for an entity. Surrogate key
assignment is a special case of derived data - one where the primary key is derived. A common way
of deriving surrogate key values is to assign integer values sequentially.
MOLAP, ROLAP, and HOLAP
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and
ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional
cube. The storage is not in the relational database, but in proprietary formats.
Advantages:
1.
Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing
and dicing operations.
2.
Can perform complex calculations: All calculations have been pre-generated when the cube is
created. Hence, complex calculations are not only doable, but they return quickly.
Disadvantages:
1.
Limited in the amount of data it can handle: Because all calculations are performed when the
cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say
that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But
in this case, only summary-level information will be included in the cube itself.
2.
Requires additional investment: Cube technology are often proprietary and do not already
exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments
in human and capital resources are needed.
ROLAPThis methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing
and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is the
limitation on data size of the underlying relational database. In other words, ROLAP itself places no
limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of the
relational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the underlying data size
is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL
statements to query the relational database, and SQL statements do not fit all needs (for example, it is
difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally

limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-ofthe-box complex functions as well as the ability to allow users to define their own functions.
HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type
information, HOLAP leverages cube technology for faster performance. When detail information is
needed, HOLAP can "drill through" from the cube into the underlying relational data.

You might also like