Professional Documents
Culture Documents
DATA MINING
Team Members
• Rohan Cambell (11)
• Vishal Chaudhari (12)
• Namitha Chitnis (13)
• Anurag Chivilkar (14)
• Poonam Dhakol (15)
Introduction to Data
Warehousing
• A single, complete
and consistent
store of data
obtained from a
variety of
different sources
made available
to end users in a
what they can
understand and
use in a business
Introduction to Data
Warehousing
A data warehouse is a
• subject-oriented
• integrated
• time-varying
• non-volatile
collection of data that is used
primarily in organizational
decision making. – Bill Inmon,
Building the Data Warehouse
1996
Subject Oriented
• Data is arranged and
optimized to provide
answer to questions from
diverse functional areas.
• Data is organized and
summarized by topic:
• Sales/Marketing/Finance/
Distribution etc.
Time Variant
• The Data Warehouse
represents the flow
of data through
time
• Can contain projected
data from statistical
models
• Data is periodically
uploaded then time-
dependent data is
Non Volatile
• Once data is entered it is NEVER
removed
• Represents the company’s
entire history
– Near term history is continually
added to it
– Always growing
– Must support terabyte databases
and multiprocessors
• Read-Only database for data
W H Y D A TA
W A R E H O U S IN G ????
A Producer wants to
know…….
Which
Which are
are our
our
lowest
lowest//highest
highest
margin
margin Who
Who are
are my
my
Which customers
customers ?? customers
customers
Which is
is and
the most
the most and what
what
effective products
products
effective are
distributio
distributio are they
they
n buying?
buying?
Whichn
Which
channel?
channel?
product Which
Which
product customers
prom
prom-- customers
--otions are
are most
most
otions likely
have
have the
the likely to
to go
go
biggest
biggest What
What impact
impact
impact to
to the
the
impact on on will
will competition
new
new competition
revenue? products ??
revenue? products//ser
ser
vices
vices
have
have on
on 9
revenue
revenue
Comparing a Data
Warehouse
and an Operational
D a ta W a re h Database
o u se O p e ra tio n a l
D a ta b a se
Subject oriented Application oriented
Metadata Warehouse
Integratio
n
3. Metadata Layer
• The data directory -
This is usually more
detailed than an
operational system
data directory.
• There are dictionaries
for the entire
warehouse and
sometimes
dictionaries for the
data that can be
accessed by a
requirements as
• budgeting,
• forecasting,
• product line and customer
profitability,
• sales analysis,
• financial consolidations
• manufacturing mix analysis
• --applications that use
historical, projected and
Technologies Involved In Data
Warehousing
• source system identification:
• data warehouse design and creation:
• changed data capture:
• data acquisition:
• data cleansing:
• data aggregation
• multi-dimensional analysis tools:
• business intelligence (bi)
• metadata management
• data mining tools:
• data visualization tools:
• query tools:
Benefits of Data
Warehouse
• A data warehouse provides a
common data model for all data
of interest regardless of the
data's source.
• Prior to loading data into the
data warehouse,
inconsistencies are identified
and resolved. This greatly
simplifies reporting and
analysis.
• Information can be stored for
long
Benefits of Data
Warehouse
• Data warehouses provide retrieval
of data without slowing down
operational systems as they are
different from O/S.
• Data Warehouse can work in
conjunction with operational
business applications such as
CRM.
• Data warehouses facilitate decision
support system applications such
as
Disadvantages of Data
Warehouse
• Data warehouses are not
the optimal environment
for unstructured data.
• Because data must be
extracted, transformed
and loaded into the
warehouse, there is an
element of latency in data
warehouse data.
• Over their life, data
Disadvantages of Data
Warehouse
• Data warehouses can get
outdated relatively
quickly. There is a cost of
delivering suboptimal
information to the
organization.
• There is a fine line
between data warehouse
and operational systems.
So the functionality
developed may be
Very Large Databases
Terabytes -- 10^12 Walmart -- 24
bytes : Terabytes
– Clustering / similarity
matching
– Association rules and variants
– Deviation detection
Classification
• Given old data about customers
and payments, predict new
applicant’s loan eligibility.
Classifiers Decision Rules
Previous
Customers
Salary > 5L
Age Good /
Salary Prof. = Exec
Profession
Location Bad
Customer Type
New Applicant ’ s
Data
• Regression :- A tte m p ts to fin d a
fu n ctio n w h ich m o d e ls th e d a ta
w ith th e le a st e rro r.