You are on page 1of 23

ISAS

Presentation
On
Data WareHousing

Submitted By:

Submitted To:

Siddharth Gaur

Sajal Mathur

Anuj Dutta
Kanchan Bhatnagar

Computer Fundamentals
Input Devices (keyboard), Output Devices (monitor) and the CPU
A model expression for computer operation is the following:

INPUT - PROCESS - OUTPUT


This means that information is delivered (input) to the computer,
the computer processes the information and then sends (outputs)
processed information to the user.
Thus, a computer is composed of hardware that performs these
three functions, input, process, output.
The most common imput devices are the keyboard and
the mouse although many other input devices exist. For example, a
computer may monitor a hospital patient's heartbeart via contacts
placed on the patients chest. The contacts are an input device.
The most common output devices are the monitor and printer.
These enable the computer to present information in a format that
humans can understand.
The CPU (Central Processing Unit) has primary responsibility
for processing

data. All the complex

processing that

a computer performs, no

matter how

sophisticated, can be

done with three

simple functions

(instructions).

These are 1) add two

numbers, 2) move a number from one location to another and 3)


compare two numbers and act upon the result of the comparison.

Data WareHousing
A Definition of Data Warehousing
By Michael Reed
The data warehousing market consists of tools, technologies, and
methodologies that allow for the construction, usage, management, and
maintenance of the hardware and software used for a data warehouse, as well
as the actual data itself.
In order to clear up some of the confusion that is rampant in the market, here
are some definitions:

Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined
in the following way: "A warehouse is a subject-oriented, integrated, timevariant and non-volatile collection of data in support of management's decision
making process". He defined the terms in the sentence as follows:

Subject Oriented:
Data that gives information about a particular subject instead of about a
company's ongoing operations.

Integrated:
Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.

Time-variant:
All data in the data warehouse is identified with a particular time period.

Non-volatile
Data is stable in a data warehouse. More data is added but data is never
removed. This enables management to gain a consistent picture of the business.

Source System Identification


Source System Identification:
In order to build the data warehouse, the appropriate data must be located.
Typically, this will involve both the current OLTP (On-Line Transaction
Processing) system where the "day-to-day" information about the business
resides, and historical data for prior periods, which may be contained in some
form of "legacy" system. Often these legacy systems are not relational
databases, so much effort is required to extract the appropriate data.

Data Warehouse Design and Creation:


This describes the process of designing the warehouse, with care taken to
ensure that the design supports the types of queries the warehouse will be used
for. This is an involved effort that requires both an understanding of the
database schema to be created, and a great deal of interaction with the user
community. The design is often an iterative process and it must be modified a

number of times before the model can be stabilized. Great care must be taken at
this stage, because once the model is populated with large amounts of data,
some of which may be very difficult to recreate, the model can not easily be
changed.

Data Acquisition:
This is the process of moving company data from the source systems into the
warehouse. It is often the most time-consuming and costly effort in the data
warehousing project, and is performed with software products known as ETL
(Extract/Transform/Load) tools. There are currently over 50 ETL tools on the
market. The data acquisition phase can cost millions of dollars and take months
or even years to complete. Data acquisition is then an ongoing, scheduled
process, which is executed to keep the warehouse current to a pre-determined
period in time, (i.e. the warehouse is refreshed monthly).

Changed Data Capture:


The periodic update of the warehouse from the transactional system(s) is
complicated by the difficulty of identifying which records in the source have
changed since the last update. This effort is referred to as "changed data
capture". Changed data capture is a field of endeavor in itself, and many
products are on the market to address it. Some of the technologies that are used
in this area are Replication servers, Publish/Subscribe, Triggers and Stored
Procedures, and Database Log Analysis.

Data Cleansing:
A data warehouse that contains incorrect data is not only useless, but also very
dangerous. The whole idea behind a data warehouse is to enable decision-making. If a
high level decision is made based on incorrect data in the warehouse, the company
could suffer severe consequences, or even complete failure. Data cleansing is a
complicated process that validates and, if necessary, corrects the data before it is
inserted into the warehouse. For example, the company could have three "Customer
Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.",
and one as "International Business Machines". Obviously, these are all the same
customer. Someone in the organization must make a decision as to which is correct,
and then the data cleansing tool will change the others to match the rule. This process
is also referred to as "data scrubbing" or "data quality assurance". It can be an
extremely complex process, especially if some of the warehouse inputs are from older
mainframe file systems (commonly referred to as "flat files" or "sequential files").

Business Intelligence (BI)


A very broad field indeed, it contains technologies such as Decision Support Systems
(DSS), Executive Information Systems (EIS), On-Line Analytical Processing
(OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid
OLAP (HOLAP, a combination of MOLAP and ROLAP), and more. BI can be broken
down into four broad fields:
Multi-dimensional Analysis Tools:

Tools that allow the user to look at the data from a number of different "angles".
These tools often use a multi-dimensional database referred to as a "cube".

Query tools:
Tools that allow the user to issue SQL (Structured Query Language) queries against
the warehouse and get a result set back.

Data Mining Tools:


Tools that automatically search for patterns in data. These tools are usually driven by
complex statistical formulas. The easiest way to distinguish data mining from the
various forms of OLAP is that OLAP can only answer questions you know to ask,
data mining answers questions you didn't necessarily know to ask.

Data Visualization Tools:


Tools that show graphical representations of data, including complex threedimensional data pictures. The theory is that the user can "see" trends more effectively
in this manner than when looking at complex statistical graphs. Some vendors are
making progress in this area using the Virtual Reality Modeling Language (VRML).

Objectives of a Data warehouse and Data flow


The primary objective of data warehousing is to provide a consolidated, flexible
meaningful data repository to the end user for reporting and analysis. All other
objectives of Data warehousing are derived from this primary objective. The data
flowin the warehouse also is determined by the objectives of data warehousing.

The data in a data warehouse is extracted from a variety of sources. OLTP


databases, historical repositories and external data sources offload their data into
the data warehouse. Achieving a constant and efficient connection to the data
source is one of the objectives of data warehousing. This process is known
as Data Source Interaction.

The data extracted from diverse sources will have to be checked for integrity and
will have to be cleaned and then loaded into the warehouse for meaningful
analysis. Therefore, harnessing efficient data cleaning and loading technologies

(ETLExtraction, Transformation and Loading) to the warehousing system will be


another objective of the data warehouse. This process is known as Data
Transformation service or Data preparation and staging.
The cleaned and stored data will have to be partitioned, summarized and stored
for efficient query and analysis. Creating of subject oriented data marts,
dimensional models of data and use of data mining technologies would follow, as
the next objective of data warehousing. This process is called Data Storage.
Finally tools necessary for query, analysis and reporting on data would have to be
built into the system to the process to deliver a rich end user experience. This
process is known as Data Presentation.
Users need to understand what rules applied while cleaning and transforming
data before storage. This information needs to be stored separately in a relational
database called Metadata.

Metadata is data about data. Mapping rules and the maps between the data
sources and the warehouse; Translation, transformation and cleaning rules; date
and time stamps, system of origin, type of filtering, matching; Pre-calculated or
derived fields and rules thereof are all stored in this database. In addition the
metadata database contains a description of the data in the data warehouse; the
navigation paths and rules for browsing the data in the data warehouse; the data
directory; the list of pre-designed and built in queries available to the users.

Metadata Management
Throughout the entire process of identifying, acquiring, and querying the data,
metadata management takes place. Metadata is defined as "data about data". An
example is a column in a table. The datatype (for instance a string or integer) of the
column is one piece of metadata. The name of the column is another. The actual value
in the column for a particular row is not metadata - it is data. Metadata is stored in a
Metadata Repository and provides extremely useful information to all of the tools
mentioned previously. Metadata management has developed into an exacting science
that can provide huge returns to an organization. It can assist companies in analyzing
the impact of changes to database tables, tracking owners of individual data elements
("data stewards"), and much more. It is also required to build the warehouse, since the
ETL tool needs to know the metadata attributes of the sources and targets in order to
"map" the data properly. The BI tools need the metadata for similar reasons.

History of data warehousing

The concept of data warehousing dates back to the late 1980s [2] when IBM
researchers Barry Devlin and Paul Murphy developed the "business data warehouse".
In essence, the data warehousing concept was intended to provide an architectural
model for the flow of data from operational systems to decision support
environments. The concept attempted to address the various problems associated with
this flow - mainly, the high costs associated with it. In the absence of a data
warehousing architecture, an enormous amount of redundancy was required to
support multiple decision support environments. In larger corporations it was typical
for multiple decision support environments to operate independently. Each
environment served different users but often required much of the same data. The
process of gathering, cleaning and integrating data from various sources, usually long
existing operational systems (usually referred to as legacy systems), was typically in
part replicated for each environment. Moreover, the operational systems were
frequently reexamined as new decision support requirements emerged. Often new
requirements necessitated gathering, cleaning and integrating new data from the
operational systems that were logically related to prior gathered data.
Based on analogies with real-life warehouses, data warehouses were intended as
large-scale collection/storage/staging areas for corporate data. Data could be retrieved
from one central point or data could be distributed to "retail stores" or "data marts"
that were tailored for ready access by users.

Key developments in early years of data warehousing were:

1960s - General Mills and Dartmouth College, in a joint research project,


develop the terms dimensions and facts.[3]
1970s - ACNielsen and IRI provide dimensional data marts for retail sales.[3]
1983 - Teradata introduces a database management system specifically
designed for decision support.
1988 - Barry Devlin and Paul Murphy publish the article An architecture for a
business and information systems in IBM Systems Journal where they
introduce the term "business data warehouse".
1990 - Red Brick Systems introduces Red Brick Warehouse, a database
management system specifically for data warehousing.
1991 - Prism Solutions introduces Prism Warehouse Manager, software for
developing a data warehouse.
1991 - Bill Inmon publishes the book Building the Data Warehouse.
1995 - The Data Warehousing Institute, a for-profit organization that promotes
data warehousing, is founded.
1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit.
1997 - Oracle 8, with support for star queries, is released.

Data warehouse architecture

Architecture, in the context of an organization's data warehousing efforts, is a


conceptualization of how the data warehouse is built. There is no right or wrong
architecture, rather multiple architectures exist to support various environments and
situations. The worthiness of the architecture can be judged in how the
conceptualization aids in the building, maintenance, and usage of the data warehouse.
One possible simple conceptualization of a data warehouse architecture consists of the
following interconnected layers:
Operational database layer
The source data for the data warehouse - An organization's Enterprise
Resource Planning systems fall into this layer.
Data access layer
The interface between the operational and informational access layer - Tools to
extract, transform, load data into the warehouse fall into this layer.
Metadata layer
The data directory - This is usually more detailed than an operational system
data directory. There are dictionaries for the entire warehouse and sometimes
dictionaries for the data that can be accessed by a particular reporting and
analysis tool.
Informational access layer
The data accessed for reporting and analyzing and the tools for reporting and
analyzing data - Business intelligence tools fall into this layer. And the InmonKimball differences about design methodology, discussed later in this article,
have to do with this layer.

Normalized versus dimensional approach for storage


of data

There are two leading approaches to storing data in a data warehouse - the
dimensional approach and the normalized approach.
In the dimensional approach, transaction data are partitioned into either "facts", which
are generally numeric transaction data, or "dimensions", which are the reference
information that gives context to the facts. For example, a sales transaction can be
broken up into facts such as the number of products ordered and the price paid for the
products, and into dimensions such as order date, customer name, product number,
order ship-to and bill-to locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the
user to understand and to use. Also, the retrieval of data from the data warehouse
tends to operate very quickly. The main disadvantages of the dimensional approach
are: 1) In order to maintain the integrity of facts and dimensions, loading the data
warehouse with data from different operational systems is complicated, and 2) It is
difficult to modify the data warehouse structure if the organization adopting the
dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a
degree, database normalization rules. Tables are grouped together by subject areas
that reflect general data categories (e.g., data on customers, products, finance, etc.)
The main advantage of this approach is that it is straightforward to add information
into the database. A disadvantage of this approach is that, because of the number of
tables involved, it can be difficult for users both to 1) join data from different sources
into meaningful information and then 2) access the information without a precise
understanding of the sources of data and of the data structure of the data warehouse.
These approaches are not mutually exclusive. Dimensional approaches can involve
normalizing data to a degree.

Conforming information
Another important fact in designing a data warehouse is which data to conform and
how to conform the data. For example, one operational system feeding data into the
data warehouse may use "M" and "F" to denote sex of an employee while another
operational system may use "Male" and "Female". Though this is a simple example,
much of the work in implementing a data warehouse is devoted to making similar
meaning data consistent when they are stored in the data warehouse. Typically,
extract, transform, load tools are used in this work.
Master Data Management has the aim of conforming data that could be considered
"dimensions".

APPLICATIONS

A Decision Support System (DSS) implemented by building a


data warehouse is an asset to an organization whose success
depends to a large extent on customer satisfaction. It would
therefore find application in the following sectors:

Government organizations

Banks

Insurance Companies

Utilities Providers

Health Care Providers

Financial Services Companies

Telecommunications Service Providers

Travel, Transport and Tourism Companies

Security Agencies

Evolution in organization use of data warehouses

Organizations generally start off with relatively simple use of data warehousing. Over
time, more sophisticated use of data warehousing evolves. The following general
stages of use of the data warehouse can be distinguished:

Off line Operational Database


Data warehouses in this initial stage are developed by simply copying the data
off an operational system to another server where the processing load of
reporting against the copied data does not impact the operational system's
performance.

Off line Data Warehouse


Data warehouses at this stage are updated from data in the operational systems
on a regular basis and the data warehouse data is stored in a data structure
designed to facilitate reporting.

Real Time Data Warehouse


Data warehouses at this stage are updated every time an operational system
performs a transaction (e.g. an order or a delivery or a booking.)

Integrated Data Warehouse


Data warehouses at this stage are updated every time an operational system
performs a transaction. The data warehouses then generate transactions that
are passed back into the operational systems.

Data Mining

Data Mining is the process of analyzing data from different perspectives and
summarizing it into useful information that can be used to increase revenue,
cut costs, or both.
Data mining software is one of a number of analytical tools for analyzing data.
It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases.
They support the OLAP concept and include query-and-reporting tools,
intelligent agents, multi-dimensional analysis tools, and statistical tools.

Use Of Data Mining & Data Warehousing


In order to obtain a competitive advantage and make the best use of our data we
implemented the use of data mining.
Throughout the year we gather huge amounts of information (DATA) such as: book &
magazine sales, best sellers, customer information on the customer loyalty program
etc. That information is very valuable as it can help with future pricing, stock
allocation and customer satisfaction. As this information is so large, every 3 months
we transfer our current Data in our operational data bases to our data warehouse
where information in their dates back for years. It enables Our company to determine
relationships among"internal" factors such as price, product positioning, or staff
skills, and "external" factors such as suppliers, independent surveys, publishers,
economic indicators and competition.

How We Benefit

Seeing the time periods in which products sell(which month/season) e.g. more
educational books are sold in the months of August, September and October.

Gather Information about the activities and trends of our On-Line Users.
Comparing sales of a particular book/Magazine to that of a competitor.
Identify which books might sell together(even possibly what book and
magazine).
Customers use of coupons & special offers.
Customers preferences and behaviours(on the customer loyalty Program).

Data Mart
A data mart is a subset of an organizational data store,
usually oriented to a specific purpose or major data subject,
that may be distributed to support business needs. Data marts
are analytical data stores designed to focus on specific
business functions for a specific community within an
organization. Data marts are often derived from subsets of
data in a data warehouse, though in the bottom-up data
warehouse design methodology the data warehouse is created
from the union of organizational data marts.

Reasons for creating a data mart


Easy access to frequently needed data

Creates collective view by a group of users


Improves end-user response time
Ease of creation
Lower cost than implementing a full Data
warehouse
Potential users are more clearly defined than in a
full Data warehouse

Dependent data mart


According to the Inmon school of data warehousing,
a dependent data mart is a logical subset (view) or
a physical subset (extract) of a largerdata warehouse,
isolated for one of the following reasons:
A need for a special data model or schema: e.g., to
restructure for OLAP
Performance: to offload the data mart to a
separate computer for greater efficiency or to obviate
the need to manage that workload on the centralized
data warehouse.

Security: to separate an authorized data subset


selectively

Expediency: to bypass the data governance and


authorizations required to incorporate a new
application on the Enterprise Data Warehouse
Proving Ground: to demonstrate the viability and
ROI (return on investment) potential of an
application prior to migrating it to the Enterprise
Data Warehouse
Politics: a coping strategy for IT (Information
Technology) in situations where a user group has
more influence than funding or is not a good citizen
on the centralized data warehouse.
Politics: a coping strategy for consumers of data
in situations where a data warehouse team is
unable to create a usable data warehouse.

According to the Inmon school of data warehousing,


tradeoffs inherent with data marts include limited

scalability, duplication of data, data inconsistency with


other silos of information, and inability to leverage
enterprise sources of data.

Benefits of data warehousing


Some of the benefits that a data warehouse provides are as follows:

A data warehouse provides a common data model for all data of interest
regardless of the data's source. This makes it easier to report and analyze
information than it would be if multiple data models were used to retrieve
information such as sales invoices, order receipts, general ledger charges, etc.
Prior to loading data into the data warehouse, inconsistencies are identified
and resolved. This greatly simplifies reporting and analysis.
Information in the data warehouse is under the control of data warehouse users
so that, even if the source system data is purged over time, the information in
the warehouse can be stored safely for extended periods of time.
Because they are separate from operational systems, data warehouses provide
retrieval of data without slowing down operational systems.
Data warehouses can work in conjunction with and, hence, enhance the value
of operational business applications, notably customer relationship
management (CRM) systems.
Data warehouses facilitate decision support system applications such as trend
reports (e.g., the items with the most sales in a particular area within the last
two years), exception reports, and reports that show actual performance versus
goals.

DATA WAREHOUSING IN eGOVERNANCE


Need for Data Warehouse for eGovernance

Information is one of the valuable assets to any Government. When used


properly, it can help planners and decision makers in making informed
decisions leading to positive impact on targeted group of citizens. However to
use information to it's fullest potential, the planners and decision makers need
instant access to relevant data in a properly summarized form. In spite of
taking lots of initiative for computerization, the Government decision makers
are currently having difficulty in obtaining meaningful information in a timely

manner because they have to request and depend on IT staff for making
special reports which often takes long time to generate. An Information
Warehouse can deliver strategic intelligence to the decision makers and
provide an insight into the overall situation. This greatly facilitates decisionmakers in taking micro level decisions in a timely manner without the need to
depend on their IT staff. By organizing person and land-related data into a
meaningful Information Warehouse, the Government decision makers can be
empowered with a flexible tool that enables them to make informed policy
decisions for citizen facilitation and accessing their impact over the intended
section of the population.

Benefits of a Data Warehouse for eGovernance

Citizen facilitation is the core objective of any Government body. For


facilitating the citizens of a state or a country, it is important to have the right
information about the people and the places of the concerned territory. Hence
a data warehouse built for eGovernance can typically have data related to
person and land. Such a data warehouse can be beneficial to both the
Government decision makers and citizens as well in the following manner:

How can decision makers benefit

They do not have to deal with the heterogeneous and sporadic information
generated by various state-level computerization projects as they can access
current data with a high granularity from the information warehouse.
They can take micro-level decisions in a timely manner without the need to
depend on their IT staff.
They can obtain easily decipherable and comprehensive information without
the need to use sophisticated tools.
They can perform extensive analysis of stored data to provide answers to the
exhaustive queries to the administrative cadre. This helps them to formulate
more effective strategies and policies for citizen facilitation

How can citizens benefit

They are the ultimate beneficiaries of the new policies formulated by the
decision makers and policy planner's extensive analysis on person and landrelated data.

They can view frequently asked queries whose results will already be there in
the database and will be immediately shown to the user saving the time
required for processing.

They can have easy access to the Government policies of the state.

The web access to Information Warehouse enables them to access the public
domain data from anywhere.

A Data Warehouse for eGovernance

The Centre for Development of Advanced Computing (C-DAC) in


collaboration with the Andhra Pradesh Technology Services (APTS) has
developed a data warehouse for aiding the state level decision makers of
Andhra Pradesh (AP) Government in their decision making process. The main
objective of this effort is to organize the Multipurpose Household Survey
(MPHS) data and the land records data of the AP Government into a
meaningful information warehouse for enabling the decision makers in
making informed decisions and accessing their impact over the intended
section of the population.

Disadvantages of data warehouses


There are also disadvantages to using a data warehouse. Some of them are:

Data warehouses are not the optimal environment for unstructured data.
Because data must be extracted, transformed and loaded into the warehouse,
there is an element of latency in data warehouse data.
Over their life, data warehouses can have high costs. The data warehouse is
usually not static. Maintenance costs are high.
Data warehouses can get outdated relatively quickly. There is a cost of
delivering suboptimal information to the organization.
There is often a fine line between data warehouses and operational systems.
Duplicate, expensive functionality may be developed. Or, functionality may be
developed in the data warehouse that, in retrospect, should have been
developed in the operational systems and vice versa.

The future of data warehousing


Data warehousing, like any technology niche, has a history of
innovations that did not receive market acceptance.
A 2009 Gartner Group paper predicted these developments in
business intelligence/data warehousing market .

Because of lack of information, processes, and tools,


through 2012, more than 35 per cent of the top 5,000
global companies will regularly fail to make insightful
decisions about significant changes in their business and
markets.
By 2012, business units will control at least 40 per cent of
the total budget for business intelligence.
By 2010, 20 per cent of organizations will have an
industry-specific analytic application delivered via
software as a service as a standard component of their
business intelligence portfolio.
In 2009, collaborative decision making will emerge as a
new product category that combines social software with
business intelligence platform capabilities.
By 2012, one-third of analytic applications applied to
business processes will be delivered through coarsegrained application mashups.

Bibliography

B.Sc.(IT) Book

Kuvempu University
(Karnataka)

You might also like