You are on page 1of 138

Data Warehouse

Models
and OLAP
Operations
Decision Support
Information technology to help the
knowledge worker (executive, manager,
analyst) make faster & better decisions
What were the sales volumes by region and product
category for the last year?
How did the share price of comp. manufacturers
correlate with quarterly profits over the past 10 years?
Which orders should we fill to maximize revenues?
On-line analytical processing (OLAP) is an
element of decision support systems
(DSS)

CS 336 2
Three-Tier Decision
Support Systems
Warehouse database server
Almost always a relational DBMS, rarely flat files
OLAP servers
Relational OLAP (ROLAP): extended relational
DBMS that maps operations on multidimensional
data to standard relational operators
Multidimensional OLAP (MOLAP): special-purpose
server that directly implements multidimensional
data and operations
Clients
Query and reporting tools
Analysis tools
Data mining tools

CS 336 3
The Complete Decision Support System

Information Sources Data Warehouse OLAP Servers Clients


Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
Semistructured Analysis
Sources
Data
Warehouse serve

extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DBs Data Mining
serve

Data Marts

CS 336 4
Approaches to OLAP
Servers
Relational DBMS as Warehouse Servers
Two possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to store
and manage warehouse data
OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
The greatest advantage of MOLAP systems in
comparison with ROLAP is that multidimensional
operations can be performed in an easy, natural
way with MOLAP without any need for complex
join operations. For this reason, MOLAP system
CS 336 5
performance is excellent
Warehouse Models &
Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other

CS 336 6
Multi-Dimensional Data
Measures - numerical data being tracked
Dimensions - business parameters that define a
transaction
Example: Analyst may want to view sales data
(measure) by geography, by time, and by
product (dimensions)
Dimensional modeling is a technique for
structuring data around the business concepts
ER models describe entities and
relationships
Dimensional models describe measures and
dimensions
CS 336 7
The Multi-Dimensional
Model
Sales by product line over the past six months
Sales by store between 1990 and 1995

Store Info Key columns joining fact table


to dimension tables Numerical Measures

Prod Code Time Code Store Code Sales Qty

Fact table for


Product Info
measures

Dimension tables Time Info

...
CS 336 8
Dimensional Modeling

Dimensions are organized into hierarchies


E.g., Time dimension: days weeks
quarters
E.g., Product dimension: product product
line brand
Dimensions have attributes

CS 336 9
ROLAP: Dimensional
Modeling Using
Relational DBMS
Special schema design: star, snowflake
Special indexes: bitmap, multi-table join
Special tuning: maximize query
throughput
Proven technology (relational model,
DBMS),
Products
IBM DB2, Oracle, Sybase IQ, RedBrick,
Informix

CS 336 10
MOLAP: Dimensional
Modeling Using the Multi
Dimensional Model data model): a
MDDB (Multi dimensional
special-purpose data model
Facts stored in multi-dimensional arrays
Dimensions used to index array
Sometimes on top of relational DB
Products
Pilot, Arbor Essbase, Gentia

CS 336 11
Star Schema (in RDBMS)

CS 336 12
Star Schema Example

CS 336 13
Star Schema
with Sample
Data

CS 336 14
Star Schema
A single fact table, with
Store Dimension Fact Table
STORE KEY
Time Dimension detail and summary data
STORE KEY PERIOD KEY
Store Description PRODUCTKEY
Period Desc
Fact table primary key
City PERIOD KEY
State
District ID
Dollars
Year
Quarter
has only one key column
District Desc.
Region_ID
Units
Price
Month per dimension
Day
Region Desc.
Regional Mgr.
Product Dimension Current Flag Each key is generated
Resolution
Level PRODUCTKEY
Product Desc.
Sequence
Each dimension is a
Brand
Color single table, highly
Size
Manufacturer denormalized
Level

Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low
maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary levels,
huge dimension tables a problem
15
CS 336
Star Schema
Store Dimension Fact Table Time Dimension The biggest drawback: dimension tables
STORE KEY STORE KEY
Store Description PRODUCTKEY
PERIOD KEY must carry a level indicator for every
Period Desc
City
State
PERIOD KEY
Year record and every query must use it. In the
Dollars Quarter
District ID
District Desc.
Units
Month
example below, without the level
Price
Region_ID
Region Desc.
Day
Current Flag
constraint, keys for all stores in the
Product Dimension
Regional Mgr.
Level PRODUCTKEY
Resolution NORTH region, including aggregates for
Sequence
Product Desc. region and district will be pulled from the
Brand
Color fact table, resulting in error.
Size
Manufacturer
Level

Example:
Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Level is needed
Fact_Table A whenever aggregates
where A.STORE_KEY in (select STORE_KEY are stored with detail
from Store_Dimension B facts.
where region = North and Level = 2)
and etc...CS 336 16
The Snowflake Schema
Store
Dimension
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
District Desc.
Region_ID
Region Desc.
Regional Mgr. Store Fact Table District Fact Table RegionFact Table
Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars Units
Units Price
Dollars Price
Units
Price

17
CS 336
Advantages of ROLAP
Dimensional Modeling

Define complex, multi-dimensional data


with simple model
Reduces the number of joins a query has
to process
Allows the data warehouse to evolve with
rel. low maintenance
HOWEVER! Star schema and relational
DBMS are not the magic solution
Query optimization is still problematic

CS 336 18
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11
p1 s3 1 50
p2 s2 1 8 81
p1 s1 2 44
p1 s2 2 4

CS 336 19
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 ans date sum
p1 s3 1 50 1 81
p2 s2 1 8 2 48
p1 s1 2 44
p1 s2 2 4

CS 336 20
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44 p1 2 48
p1 s2 2 4

rollup

drill-down

CS 336 21
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)

CS 336 22
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing

CS 336 23
The MOLAP Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8

dimensions = 2

CS 336 24
3-D Cube
Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 s1 s2 s3
day 2
p1 s3 1 50 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1
p1 12 50
p1 s2 2 4 p2 11 8

dimensions = 3

CS 336 25
Example
roll-up to region
Dimensions:
NY
o re SF Time, Product, Store
St roll-up to brand
Attributes:
LA
Product (upc, price, )
Juice 10
Store
Product

Milk 34
56

Coke
32
Hierarchies:
Cream
12 Product Brand
Soap
Bread 56 roll-up to week Day Week Quarter
M T W Th F S S Store Region Country
Time
56 units of bread sold in LA on M

CS 336 26
Cube Aggregation: Roll-up
Example: computing sums
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
CS 336 27
Cube Operators for Roll-up
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)

CS 336 28
Extended Cube
* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81

CS 336 29
Aggregation Using
Hierarchies
s1 s2 s3
day 2
p1 44 4 store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region

country

region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)

CS 336 30
Slicing
s1 s2 s3
day 2
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8

TIME = day 1

s1 s2 s3
p1 12 50
p2 11 8

CS 336 31
Slicing & Sales
($ millions)
Products Time
Pivoting Store s1 Electronics
d1
$5.2
d2

Toys $1.9
Clothing $2.3
Cosmetics $1.1
Store s2 Electronics $8.9
Toys $0.75
Clothing $4.6
Cosmetics $1.5

Sales
($ millions)
Products d1
Store s1 Store s2
Store s1 Electronics $5.2 $8.9
Toys $1.9 $0.75
Clothing $2.3 $4.6
Cosmetics $1.1 $1.5
Store s2 Electronics
Toys
Clothing
CS 336 32
Summary
Aggregation (roll-up)
of Operations
aggregate (summarize) data to the next higher
dimension element
e.g., total sales by city, year total sales by region, year
Navigation to detailed data (drill-down)
Selection (slice) defines a subcube
e.g., sales where city =Gainesville and date = 1/15/90
Calculation and ranking
e.g., top 3% of cities by average income
Visualization operations (e.g., Pivot)
Time functions
e.g., time average

CS 336 33
Query & Analysis Tools
Query Building
Report Writers (comparisons, growth, graphs,
)
Spreadsheet Systems
Web Interfaces
Data Mining

CS 336 34
Fact Tables
Contains two or more foreign keys
Tend to have huge numbers of records
Useful facts tend to be numeric and
additive
Dimension Tables
Contain text and descriptive information
1 in a 1-M relationship
Generally the source of interesting
constraints
Typically contain the attributes for the
SQL answer set.
Warehouse Models &
Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
CS 336 37
Summary of Operations
Aggregation (roll-up)
aggregate (summarize) data to the next higher
dimension element
e.g., total sales by city, year total sales by
region, year
Navigation to detailed data (drill-down)
Selection (slice) defines a subcube
e.g., sales where city =Gainesville and date =
1/15/90
Calculation and ranking
e.g., top 3% of cities by average income
Visualization operations (e.g., Pivot)
Time functions
e.g., time average

CS 336 38
The MOLAP Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8

dimensions = 2

CS 336 39
3-D Cube
Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 s1 s2 s3
p1 s3 1 50 day 2 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1 p1 12 50
p1 s2 2 4 p2 11 8

dimensions = 3

CS 336 40
Slicing
s1 s2 s3
day 2 p1 44 4
p2 s1 s2 s3
day 1 p1 12 50
p2 11 8

TIME = day 1

s1 s2 s3
p1 12 50
p2 11 8

CS 336 41
Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44 p1 2 48
p1 s2 2 4

rollup

drill-down

CS 336 42
Outline
Overview of Data Mining
What is Data Mining?
Steps in Data Mining
Overview of Data Mining techniques
Points to Remember
DATA MINING

Data mining refers to the mining or discovery of


new information in terms of patterns or rules from
vast amount of data.
To be practically useful, data mining must be
carried out efficiently on large files and databases.
This chapter briefly reviews the state-of-the-art of
this extensive field of data mining.
Data mining uses techniques from such areas as
machine learning,
statistics,
neural networks
genetic algorithms.

44
1. OVERVIEW OF DATA MINING

Data Mining as a Part


of the Knowledge
Knowledge Discovery in Databases,
Discovery
abbreviated Process.
as KDD, encompasses more
than data mining.

The knowledge discovery process comprises


six phases: data selection, data cleansing,
enrichment, data transformation or
encoding, data mining and the reporting and
displaying of the discovered information.

45
Example
Consider a transaction database maintained by a specially
consumer goods retails. Suppose the client data includes a
customer name, zip code, phone number, date of
purchase, item code, price, quantity, and total amount.
A variety of new knowledge can be discovered by KDD
processing on this client database.
During data selection, data about specific items or
categories of items, or from stores in a specific region or
area of the country, may be selected.
The data cleansing process then may correct invalid zip
codes or eliminate records with incorrect phone prefixes.
Enrichment enhances the data with additional sources of
information. For example, given the client names and
phone numbers, the store may purchases other data about
age, income, and credit rating and append them to each
record.
Data transformation and encoding may be done to reduce
the amount of data.
46
Example (cont.)
The result of mining may be to discover
the following type of new information:
Association rules e.g., whenever a customer buys video
equipment, he or she also buys another electronic gadget.
Sequential patterns e.g., suppose a customer buys a
camera, and within three months he or she buys
photographic supplies, then within six months he is likely
to buy an accessory items. This defines a sequential
pattern of transactions. A customer who buys more than
twice in the regular periods may be likely buy at least
once during the Christmas period.
Classification trees e.g., customers may be classified by
frequency of visits, by types of financing used, by amount
of purchase, or by affinity for types of items, and some
revealing statistics may be generated for such classes.

47
We can see that many possibilities exist for discovering
new knowledge about buying patterns, relating factors
such as age, income group, place of residence, to what
and how much the customers purchase.
This information can then be utilized
to plan additional store locations based on demographics,
to run store promotions,
to combine items in advertisements, or to plan seasonal
marketing strategies.
As this retail store example shows, data mining must be
preceded by significant data preparation before it can
yield useful information that can directly influence
business decisions.
The results of data mining may be reported in a variety of
formats, such as listings, graphic outputs, summary
tables, or visualization.

48
Goals of Data Mining and
Knowledge Discovery
Data mining is carried out with some end goals.
These goals fall into the following classes:
Prediction Data mining can show how certain
attributes within the data will behave in the future.
Identification Data patterns can be used to identify
the existence of an item, an event or an activity.
Classification Data mining can partition the data so
that different classes or categories can be identified
based on combinations of parameters.
Optimization One eventual goal of data mining may
be to optimize the use of limited resources such as
time, space, money, or materials and to maximize
output variables such as sales or profits under a given
set of constraints.

49
Data Mining: On What Kind
of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
World Wide Web

50
Types of Knowledge
Discovered During Data
Mining.
Data mining addresses inductive knowledge, which
discovers new rules and patterns from the supplied
data.
Knowledge can be represented in many forms: In
an unstructured sense, it can be represented by
rules. In a structured form, it may be represented
in decision trees, semantic networks, or hierarchies
of classes or frames.
It is common to describe the knowledge discovered
during data mining in five ways:
Association rules These rules correlate the presence
of a set of items with another range of values for another
set of variables.

51
Types of Knowledge
Discovered (cont.)
Classification hierarchies The goal is to work
from an existing set of events or transactions to
create a hierarchy of classes.
Patterns within time series
Sequential patterns: A sequence of actions or
events is sought. Detection of sequential patterns is
equivalent to detecting associations among events
with certain temporal relationship.
Clustering A given population of events can be
partitioned into sets of similar elements.

52
Main function phases of the
KD process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge

53
Main phases of data mining

Pattern Evaluation/
Presentation

Data Mining Patterns

Task-relevant Data

Data Warehouse Selection/Transformation

Data
Cleaning
Data Integration

54
Data Sources
What is Data Mining?
Data mining is an analytic process designed to explore large
amounts of data in search of consistent patterns and/or
systematic relationships between variables, and then to validate
the findings by applying the detected patterns to new subsets of
data.
Data Mining is a process of torturing the data until
they confess
The typical goals of data mining projects are:
Identification of groups, clusters, strata, or dimensions
in data that display no obvious structure,
The identification of factors that are related to a particular
outcome of interest (root-cause analysis)
Accurate prediction of outcome variable(s) of interest (in
the future, or in new customers, clients, applicants, etc.; this
application is usually referred to as predictive data mining)
What is Data Mining?
Data mining is used to
Detect fraudulent patterns in credit card
transactions, insurance claims, etc.
Detect default patterns
Model customer buying patterns and behavior
for cross-selling, up selling, and customer
acquisition
Optimize engine performance and several other
complex manufacturing processes
Data mining can be utilized in any organization
that needs to find patterns or relationships in
their data.
DM: Overview

CS490D 57
DM: Phases
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data analysis and data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for repeated/continuous mining of the data
CS490D 58
Phases in the DM Process
(1)
Business
Understanding:
Statement of
Business Objective
Statement of Data
Mining objective
Statement of
Success Criteria

CS490D 59
Phases in the DM Process
(2)
Data
Understanding
Collect data
Describe data
Explore the data
Verify the quality
and identify
outliers

CS490D 60
Phases in the DM Process (3)
Data preparation:
Can take over 90% of the time
Consolidation and Cleaning
table links, aggregation
level, missing values,
etc
Data selection
Remove noisy data,
repetitions, etc
Remove outliers?
Select samples
visualization tools
Transformations - create
new variables, formats

CS490D 61
Phases in cw DM Process (3)
Data preparation:
May take up to 90% of the time
Select Data
Rationale for Inclusion /
Exclusion: if it isnt really from
your domain remove
Clean Data
Remove repetitions
Remove headers, footers,
tables, pictures etc (BootCat
does this automatically)
Transform Data
Convert to plain text (ditto)
Reduce to word-frequency list,
keyword-freqs can be features
in machine-learning

CS490D 62
Phases in the DM
Process(4)
Model building
Selection of the
modeling
techniques is based
upon the data
mining objective
Modeling can be an
iterative process;
may model for
either description or
prediction

CS490D 63
Phases in the DM
Process(5)
Model Evaluation
Evaluation of model:
how well it performed,
how well it met business
needs
Methods and criteria
depend on model type:
e.g., confusion matrix
with classification
models, mean error rate
with regression models
Interpretation of model:
important or not, easy or
hard depends on
algorithm

CS490D 64
Phases in the DM Process (6)
Deployment
Determine how the
results need to be utilized
Who needs to use them?
How often do they need
to be used
Deploy Data Mining
results by:
Utilizing results as
business rules
Publishing report for
users, with
recommendations to
improve their business

CS490D 65
Why DM?: Concept
Description
Descriptive vs. predictive data mining
Descriptive mining: describes concepts or task-
relevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining: Based on data and analysis,
constructs models from the data-set, and predicts
the trend and properties of unknown data
Concept description:
Characterization: provides a concise and succinct
summarization of the given collection of data
Comparison: provides descriptions comparing two
or more collections of data
DM vs. OLAP
Data Mining:
can handle complex data types of the
attributes and their aggregations
a more automated process
Online Analytic Processing (visualization):
restricted to a small number of dimension
and measure types
user-controlled process

CS490D 67
DM: Summary
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for repeated/continuous mining of the data

CS490D 68
Steps in Data Mining
Stage 1: Precise statement of the
problem.
Stage 2: Initial exploration.
Stage 3: Model building and validation.
Stage 4: Deployment.
Steps in Data Mining
Stage 1: Precise statement of the problem.

Before opening a software package and running an analysis, the


analyst must be clear as to what question he wants to answer. If
you have not given a precise formulation of the problem you are
trying to solve, then you are wasting time and money.

Stage 2: Initial exploration.

This stage usually starts with data preparation that may involve the
cleaning of the data (e.g., identification and removal of incorrectly
coded data, etc.), data transformations, selecting subsets of
records, and, in the case of data sets with large numbers of
variables (fields), performing preliminary feature selection. Data
description and visualization are key components of this stage (e.g.
descriptive statistics, correlations, scatterplots, box plots, etc.).
Steps in Data Mining
Stage 3: Model building and validation.

This stage involves considering various models and choosing


the best one based on their predictive performance.

Stage 4: Deployment.

When the goal of the data mining project is to predict or


classify new cases (e.g., to predict the credit worthiness of
individuals applying for loans), the third and final stage
typically involves the application of the best model or
models (determined in the previous stage) to generate
predictions
Initial exploration
Cleaning of data,
Identification and removal of incorrectly coded data,
e.g., Degree=Graduate, salary=100.

Data transformations,
Data may be skewed (that is, outliers in one direction or another
may be present). Log transformation, Box-Cox transformation, etc.

Data reduction, Selecting subsets of records, and, in the case of data sets
with large numbers of variables (fields), performing preliminary feature
selection.

Data description and visualization are key components of this stage (e.g.
descriptive statistics, correlations, scatterplots, box plots, brushing tools,
etc.)
Data description allows you to get a snapshot of the important
characteristics of the data (e.g. central tendency and dispersion).
Model building and validation.
Modelis building
A model and validation.
typically rated according to 2 aspects:
Accuracy
Understandability
These aspects often conflict with one another.
Decision trees and linear regression models are less
complicated and simpler than models such as neural
networks, boosted trees, etc. and thus easier to
understand, however, you might be giving up some
predictive accuracy.
Remember not to confuse the data mining model
with reality (a road map is not a perfect
representation of the road) but it can be used as a
useful guide.
Model building and validation.
Validation of the model requires that you
train the model on one set of data and
evaluate on another independent set of
data.
There are two main methods of validation
Split data into train/test datasets (75-25 split)
If you do not have enough data to have a
holdout sample, then use v-fold cross
validation.
Model building and validation.
Model Validation Measures
Possible validation measures
Classification accuracy
Total cost/benefit when different errors involve
different costs
Lift and Gains curves
Error in Numeric predictions
Error rate
Proportion of errors made over the whole set of
instances
Training set error rate: is way too optimistic!
You can find patterns even in random data
Deployment.
A model is built once, but can be used over and
over again.

Model should be easily deployable.


A linear regression is easily deployed. Simply gather
the regression coefficients
For example, if a new observed data vector comes in
{x1, x2, x3}, then simply plug into linear equation to
generate predicted value,
Prediction = B0 + B1*X1 + B2*X2 + B3*X3

What about for more complicated models, such


as neural networks?

Within STATISTICA, we will use Rapid Deployment


module in order to easily deploy models.
Data Mining Techniques
Neural Networks
Generalized EM And K-means Cluster Analysis
General CART Models
General CHAID Models
Interactive Trees (C&RT and CHAID)
Boosted Tree Classifiers and Regression
Association Rules
MARSPlines
Machine Learning(Bayesian, Support Vectors and Nearest
neighbors)
Random Forests for Regression and Classification
Generalized Additive Models (GAM)
Feature Selection and Variable Screening
Data Mining techniques
Supervised Learning
Supervised learning is a machine learning technique for
deducing a function from training data.
The training data consist of pairs of input variable and desired
outputs. The task of the supervised learner is to predict the
value of the function for any valid input object after having seen
a number of training examples.
Classification and Regression are very popular techniques of
supervised learning.
Unsupervised Learning
In unsupervised learning training data set is not available in
the form of input and output variable.
unsupervised learning is a class of problems in which
researcher seeks to determine how the data are organized
Cluster analysis, and Principal component analysis are very
popular techniques for unsupervised learning.
Points to Remember..
Data mining is a tool, not a magic box.

Data mining will not automatically discover


solutions without guidance.

To ensure meaningful results, its vital that


you understand your data.
User-centric interactive process which
leverages analytic technologies and
computing power.

Data mining central quest: Find true patterns


and avoid overfitting (finding random
patterns by searching too many possibilities)
Classification and Regression.
Databases are rich with hidden information that
can be used to make intelligent business
decisions.
Classification and Regression are two form of
data analysis that can be used to extract
models, describing important data classes or to
predict future data trends.

Classification is used to predict or classify categorical


response variable, like to predict Iris type of flowers
(Setosa,Verginica,Versocol).
Regression is used to predict quantitative
response variable, average income of
household.
Statistical learning plays a key role in many areas of
science, finance, industry many other applications.
Here are some examples of learning problems:
Predict whether a patient, hospitalized due to a
heart attack, will have a second heart attack. The
prediction is to be based on demographic, diet and
clinical measurements for that patient.
Predict the price of a stock in 6 months from now, on
the basis of company performance measures and
economic data.
Identify the customers who will be beneficial for the
banker in loan application.
Identify the numbers in a handwritten ZIP code,
from a digitized image.
Estimate the amount of glucose in the blood of a
diabetic person, from the infrared absorption
spectrum of that persons blood.
Steps of Classification and Regression
models

Step 1: In the first step a model is built


describing a predetermined set of
data classes. (Supervised learning).

Step 2: In the second step the predictive accuracy


of the model is estimated.

Step 3: If the accuracy of the model is


considered acceptable, then the
model can be used to classify future
data for which the class label is unknown.
Techniques.
Different kind of Classification and Regression
techniques are available in STATISTICA,
including
1. Classification and Regression, through
STATISTICA Automated Neural Network.
2. General Classification and Regression tree.
3. General CHAID model.
4. Boosted Tree Classification and Regression.
5. Random Forest for Classification and
Regression, etc.
2. ASSOCIATION RULES
What Is Association Rule
Mining?
Association rule mining is finding frequent
patterns, associations, correlations, or causal
structures among sets of items or objects in
transaction databases, relational databases,
and other information repositories.
Applications:
Basket data analysis,
cross-marketing,
catalog design,
clustering, classification, etc.
Rule form: Body Head [support,
confidence].
85
Association rule mining

Examples.
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) takes(x, DB) grade(x, A) [1%,
75%]

Association Rule Mining Problem:


Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one
set of items with that of another set of items
E.g., 98% of people who purchase tires and auto
accessories also get automotive services done.
86
Rule Measures: Support and
Confidence
Let J = {i1, i2,,im} be a set of items. Let D, the task-
relevant data, be a set of database transactions
where each transaction T is a set of items such that
T J. Each transaction T is said to contain A if and
only if A T.
An association rule is an implication of the form A
B where A J, B J and A B = .
The rule A B holds in the transaction set D with
support s, where s is the percentage of transactions
in D that contain A B (i.e. both A and B). This is
taken to be the probability P(A B ).
The rule A B has the confidence c in the
transaction set D if c is the percentage of
transactions in D containing A that also contain B.

87
Support and confidence
That is.
support, s, probability that a transaction contains
{A B }
s = P(A B )
confidence, c, conditional probability that a
transaction having A also contains B.
c = P(A|B).
Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.

88
Frequent item set
A set of items is referred as an itemset. An itemset
that contains k items is a k-itemset. The occurrence
frequency of an itemset is the number of transactions
that contain the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of
transactions in D. The number of transactions required
for the itemset to satisfy minimum support is referred
to as the minimum support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.

89
Example 2.1

Transaction-ID Items_bought
-------------------------------------------
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F

Let minimum support 50%, and minimum


confidence 50%, we have
A C (50%, 66.6%)
C A (50%, 100%)
90
How to mine association rules
from large databases?
Association rule mining is a two-step process:
1. Find all frequent itemsets (the sets of items that have
minimum support)
A subset of a frequent itemset must also be a frequent
itemset. (Apriori principle)
i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to


k (k-itemset)
2.Generate strong association rules from the frequent
itemsets.

The overall performance of mining association rules is


determined by the first step.

91
3. CLASSIFICATION
Classification is the process of learning a model
that describes different classes of data. The
classes are predetermined.
Example: In a banking application, customers who
apply for a credit card may be classify as a good
risk, a fair risk or a poor risk. Hence, this type
of activity is also called supervised learning.
Once the model is built, then it can be used to
classify new data.

92
The first step, of learning the model, is accomplished by using
a training set of data that has already been classified. Each
record in the training data contains an attribute, called the
class label, that indicates which class the record belongs to.
The model that is produced is usually in the form of a decision
tree or a set of rules.
Some of the important issues with regard to the model and
the algorithm that produces the model include:
the models ability to predict the correct class of the new
data,
the computational cost associated with the algorithm
the scalability of the algorithm.
Let examine the approach where the model is in the form of a
decision tree.
A decision tree is simply a graphical representation of the
description of each class or in other words, a representation of
the classification rules.

93
Example 3.1
Example 3.1: Suppose that we have a database of
customers on the AllEletronics mailing list. The database
describes attributes of the customers, such as their name,
age, income, occupation, and credit rating. The customers
can be classified as to whether or not they have purchased
a computer at AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an
upcoming computer sale. To send out promotional literature
to every new customers in the database can be quite costly.
A more cost-efficient method would be to target only those
new customers who are likely to purchase a new computer.
A classification model can be constructed and used for this
purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.

94
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.

A decision tree for the concept buys_computer, indicating whether or not


a customer at AllElectronics is likely to purchase a computer.

95
Decision Trees
For example, consider the widely referenced Iris data
classification problem introduced by Fisher (1936).
The purpose of the analysis is to learn how one can discriminate
between the three types of flowers, based on the four measures
of width and length of petals and sepals.
A classification tree will determine a set of logical if-then
conditions (instead of linear equations) for predicting or
classifying cases.
Advantages of tree
methods.
Simplicity of results.
In most cases, the interpretation of results summarized in a
tree is very simple. This simplicity is useful not only for
purposes of rapid classification of new observations .
Often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular manner .
e.g., when analyzing business problems, it is much easier to
present a few simple if-then statements to management, than
some elaborate equations.

Tree methods are nonparametric and nonlinear.


The final results of using tree methods for classification or
regression can be summarized in a series of logical if-then
conditions .
Therefore, there is no implicit assumption that the underlying
relationships between the predictor variables and the
dependent variable are linear, follow some specific non-linear
link function , or that they are even monotonic in nature.
General Classification and Regression tree
The STATISTICA General Classification and
Regression Trees module (GC&RT) will build
classification and regression trees for predicting
continuous dependent variables (regression) and
categorical predictor variables (classification).
The program supports the classic C&RT algorithm
and includes various methods for pruning and
cross-validation, as well as the powerful v-fold
cross-validation methods.
Classification and Regression Trees (C&RT)
In most general terms, the purpose of the analyses
via tree-building algorithms is to determine a set of
if-then logical (split) conditions that permit accurate
prediction or classification of cases.
Classification Trees

The example data file Irisdat.sta reports the lengths and widths of
sepals and petals of three types of irises (Setosa, Versicol, and Virginic). The
purpose of the analysis is to learn how one can discriminate between the
three types of flowers, based on the four measures of width and length of
petals and sepals.
Discriminant function analysis will estimate several linear combinations of
predictor variables for computing classification scores (or probabilities) that
allow the user to determine the predicted classification for each
observation.
A classification tree will determine a set of logical if-then conditions
(instead of linear equations) for predicting or classifying cases.

Regression Trees.

The general approach to derive predictions from few simple if-then


conditions can be applied to regression problems as well. Example 1 is
based on the data file Poverty.sta, which contains 1960 and 1970 Census
figures for a random selection of 30 counties. The research question (for
that example) was to determine the correlates of poverty, that is, the
variables that best predict the percent of families below the poverty line in
a county.
Extracting Classification Rules
from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand.
Example
IF age = <=30 AND student = no THEN buys_computer =
no
IF age = <=30 AND student = yes THEN buys_computer =
yes
IF age = 3140 THEN buys_computer = yes
IF age = >40 AND credit_rating = excellent THEN
buys_computer = no
IF age = >40 AND credit_rating = fair THEN buys_computer
= yes

100
Neural Networks and Classification
Neural network is a technique derived from AI that
uses generalized approximation and provides an
iterative method to carry it out. ANNs use the
curve-fitting approach to infer a function from a
set of samples.
This technique provides a learning approach; it
is driven by a test sample that is used for the
initial inference and learning. With this kind of
learning method, responses to new inputs may be
able to be interpolated from the known samples.
This interpolation depends on the model
developed by the learning method.

101
ANN and classification
ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop
internal representations without sample outputs are
called unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks
and are therefore useful in data mining.

102
Information processing at a neuron
in an ANN

103
Machine Learning
Algorithms.
STATISTICA Machine Learning provides a
number of advanced statistical methods for
handling regression and classification tasks
with multiple dependent and independent
variables.
These methods include
Support Vector Machines (SVM)
( for regression and classification).

Naive Bayes (for classification)

K-Nearest Neighbors (KNN)


( for regression and classification.)
Support Vector Machines
STATISTICA Support Vector Machine (SVM) is primarily a classier
method that performs classification tasks by constructing
hyperplanes in a multidimensional space that separates cases
of different class labels.
STATISTICA SVM supports both regression and classification
tasks and can handle multiple continuous and categorical
variables.

To construct an optimal hyperplane, SVM employees an


iterative training algorithm, which is used to minimize an error
function. According to the form of the error function, SVM
models can be classified into four distinct groups:

Classification SVM Type 1 (also known as C-SVM classification).


Classification SVM Type 2 (also known as nu-SVM classification).
Regression SVM Type 1 (also known as epsilon-SVM regression).
Regression SVM Type 2 (also known as nu-SVM regression).
Naive-Bayes Classification
Bayesian Classifiers are Statistical
classifiers, which can predict class
membership probabilities, such as the
probability that a given sample belongs to
a particular class .
Bayesian Classification is based on
Bayes-theorem.
Bayesian classifier has also high accuracy
and speed when applied to large data set.
Bayes Theorem.
Let X be a data sample whose class label is unknown.
Let H be some hypothesis, such as that the data sample
X belongs to a specified class C. For classification
problem we want to determine P(H|X),the probability
that the hypothesis H holds given the observed data
sample X.
P(H|X) is called the posterior probability.
Suppose the world of data samples consists of
fruits,describing by their color and shape.
Suppose x is red and round and that H is hypothesis that
X is an apple.
Then P(H|X) reflects our confidence that X is an apple
given that we have seen X is red and round.
K-Nearest Neighbors .
STATISTICA K-Nearest Neighbors (KNN) is a memory-
based model defined by a set of objects known as
examples for which the outcome are known (i.e., the
examples are labeled).

The independent and dependent variables can be


either continuous or categorical. For continuous
dependent variables, the task is regression; otherwise it
is a classification. Thus, STATISTICA KNN can handle
both regression and classification tasks.

Given a new case of dependent values (query point), we


would like to estimate the outcome based on the KNN
examples. STATISTICA KNN achieves this by finding K
examples that are closest in distance to the query point,
hence, the name K-Nearest Neighbors. For regression
problems, KNN predictions are based on averaging the
outcomes of the K nearest neighbors; for classification
problems, a majority of voting is used.
Cross-Validation
K can be regarded as one of the most
important factors of the model that can
strongly influence the quality of
predictions.
There should be an optimal value for K
that achieves the right trade off between
the bias and the variance of the model.
STATISTICA KNN can provide an estimate
of K using an algorithm known as Cross-
validation .
Cross-Validation
Cross-validation is a well established technique that can be used to
obtain estimates of model parameters that are unknown. Here we
discuss the applicability of this technique to estimating K.
The general idea of this method is to divide the data sample into a
number of v folds (randomly drawn, disjointed sub-samples or
segments).
For a fixed value of K, we apply the KNN model to make predictions
on the vth segment (i.e., use the v-1 segments as the examples) and
evaluate the error.
The most common choice for this error for regression is sum-of-squared
and for classification it is most conveniently defined as the accuracy
(the percentage of correctly classified cases).
This process is then successively applied to all possible choices of v.
At the end of the v folds (cycles), the computed errors are averaged to
yield a measure of the stability of the model (how well the model
predicts query points).
The above steps are then repeated for various K and the value
achieving the lowest error (or the highest classification accuracy) is then
selected as the optimal value for K (optimal in a cross-validation sense).

Note that cross-validation is computationally expensive and you


should be prepared to let the algorithm run for some time especially
when the size of the examples sample is large.
Association Rule.
The goal of the Association rule is to detect
relationships or associations among a large set of
data items.

It is an important data mining model studied


extensively by the database and data mining
community.
Assume all data are categorical.
Initially used for Market Basket Analysis to find
how items purchased by customers are related.

The discovery of such association rule can help


people to develop marketing strategies by gaining
insight into, which items are frequently purchased
together by customer.
Transaction data: supermarket
data
Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}

tn: {biscuit, eggs, milk}
Concepts:
An item: an item/article in a basket
I: the set of all items sold in the store
A transaction: items purchased in a basket; it
may have TID (transaction ID)
A transactional dataset: A set of transactions
The model: rules
A transaction t contains X, a set of items
(itemset) in I, if X t.
An association rule is an implication of the
form:
X Y, where X, Y I, and X Y =

An itemset is a set of items.


E.g., X = {milk, bread, cereal} is an itemset.
A k-itemset is an itemset with k items.
E.g., {milk, bread, cereal} is a 3-itemset
Rule strength measures
Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X Y.
sup = Pr(X Y)= Count (XY)/total count.
Confidence: The rule holds in T with
confidence conf if conf% of transactions that
contain X also contain Y.
conf = Pr(Y | X)=support(X,Y)/support(X).
An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.
An Example.
Transaction data t1: Beef, Chicken, Milk
Assume:
minsup = 30% t2: Beef, Cheese
minconf = 80% t3: Cheese, Boots
An example frequent itemset: t4: Beef, Chicken, Cheese
{Chicken, Clothes, Milk} [sup = 3/7] t5: Beef, Chicken, Clothes,
Association rules from the itemset:
Cheese, Milk
Clothes Milk,Chicken[sup = 3/7, conf =
3/3] t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
Clothes, Chicken Milk[sup = 3/7, conf =
3/3]
Cluster Analysis.
The process of grouping the data into
classes or clusters so that objects within a
cluster have high similarity in comparison
to one another, but are very dissimilar to
objects in other clusters.
Clustering is an example of unsupervised
learning, where the learning do not rely on
predefined classes and class labeled
training examples.
For the above reason , Clustering is the
form of Learning by observation , rather
than learning by Example.
Area of Application.
Market Research.
Clustering can help marketers discover
distinct groups in their customer bases and
characterize customer groups based on
purchasing patterns.
Biology.
Biologist can use cluster to discover
distinct groups of species depending on
some useful parameters.
k-Means clustering. The basic operation of this algorithm is relatively simple:
Given a fixed number of (desired or hypothesized) k clusters, assign observations
to those clusters so that the means across clusters (for all variables) are as
different from each other as possible.

Extensions and generalizations. The methods implemented in the Generalized


EM and k-Means Cluster Analysis module of STATISTICA extend this basic approach
to clustering in three important ways:

Instead of assigning cases or observations to clusters so as to maximize the


differences in means for continuous variables, the EM (expectation maximization)
clustering algorithm rather computes probabilities of cluster memberships based
on one or more probability distributions. The goal of the clustering algorithm is to
maximize the overall probability or likelihood of the data, given the (final) clusters.

Unlike the classic implementation of k-Means clustering in the Cluster Analysis


module, the k-Means and EM algorithms in the Generalized EM and k-Means Cluster
Analysis module then can be applied to both continuous and categorical variables.

A major shortcoming of k-Means clustering has been that you need to specify the
number of clusters before starting the analysis (i.e., the number of clusters must
be known a priori); the Generalized EM and k-Means Cluster Analysis module uses
a modified v-fold cross-validation scheme , to determine the best number of
clusters from the data. This extension makes the Generalized EM and k-Means
Cluster Analysis module an extremely useful data mining tool for unsupervised
learning and pattern recognition.
5. CLUSTERING
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters.
Cluster analysis
Grouping a set of data objects into clusters.
Clustering is unsupervised learning: no
predefined classes, no class-labeled training
samples.
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
123
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
World Wide Web
Document classification
Cluster Weblog data to discover groups of similar
access patterns
124
Examples of Clustering
Applications
Marketing: Help marketers discover distinct
groups in their customer bases, and then use
this knowledge to develop targeted marketing
programs.
Land use: Identification of areas of similar land
use in an earth observation database.
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location.
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults.
125
Partitioning Algorithms:
Basic Concept
Partitioning method: Construct a partition of a database
D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion.
- Global optimal: exhaustively enumerate all
partitions
- Heuristic methods: k-means and k-medoids
algorithms
k-means (MacQueen67): Each cluster is represented by
the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw87): Each cluster is represented by one of
the objects in the cluster
126
The K-Means Clustering Method
Input: a database D, of m records, r1, r2,,rm
and a desired number of clusters k.
Output: set of k clusters that minimizes the
square error criterion.
Given k, the k-means algorithm is implemented
in 4 steps:
Step 1: Randomly choose k records as the initial
cluster centers.
Step 2: Assign each records ri, to the cluster such
that the distance between ri and the cluster centroid
(mean) is the smallest among the k clusters.
Step 3: recalculate the centroid (mean) of each
cluster based on the records assigned to the cluster.
Step 4: Go back to Step 2, stop when no more new
assignment.

127
The algorithm begins by randomly choosing k records to
represent the centroids (means), m1, m2,,mk of the
clusters, C1, C2,,Ck. All the records are placed in a given
cluster based on the distance between the record and
the cluster mean. If the distance between mi and record
rj is the smallest among all cluster means, then record is
placed in cluster Ci.
Once all records have been placed in a cluster, the mean
for each cluster is recomputed.
Then the process repeats, by examining each record
again and placing it in the cluster whose mean is closest.
Several iterations may be needed, but the algorithm will
converge, although it may terminate at a local optimum.

128
Example 4.1: Consider the K-means clustering algorithm that
works with the (2-dimensional) records in Table 2. Assume
that the number of desired clusters k is 2.
RID Age Years of Service
--------------------------------------
1 30 5
2 50 25
3 50 15
4 25 5
5 30 10
6 30 25
Let the algorithm choose records with RID 3 for cluster C 1 and
RID 6 for cluster C2. as the initial cluster centroids.
The first iteration:
distance(r1, C1) = (50-30)2+(15-5)2 = 22.4; distance(r1, C2) = 32.0,
so r1 C1. distance(r2, C1) = 10.0 and distance(r2, C2) = 5.0 so r2
C2 .
distance(r4, C1) = 25.5 and distance(r4, C2) = 36.6 so r4 C1
distance(r5, C1) = 20.6 and distance(r5, C2) = 29.2 so r5 C1
Now the new means (centroids) for the two clusters are computed.

129
Clustering of a set of objects based on the k-means method.

130
Hierarchical Clustering
A hierarchical clustering method works by grouping data
objects into a tree of clusters.
In general, there are two types of hierarchical clustering
methods:
Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster and
then merges these atomic clusters into larger and larger
clusters, until all of the objects are in a single cluster or until a
certain termination conditions are satisfied. Most hierarchical
clustering methods belong to this category. They differ only in
their definition of intercluster similarity.
Divisive hierarchical clustering: This top-down strategy does
the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster. It subdivides the cluster into
smaller and smaller pieces, until each object forms a cluster on
its own or until it satisfied certain termination condition, such as
a desired number clusters is obtained or the distance between
two closest clusters is above a certain threshold distance.

131
Agglomerative and divisive hierarchical clustering on data objects {a, b, c,
d, e}

132
Hierarchical Clustering

In DIANA, all of the objects are used to form one initial


cluster. The cluster is split according to some
principle, such as the maximum Euclidean distance
between the closest neighboring objects in the cluster.
The cluster splitting process repeats until, eventually,
each new cluster contains only a single objects.
In general, divisive methods are more computationally
expensive and tend to be less widely used than
agglomerative methods.
There are a variety of methods for defining the
intercluster distance D(Ck, Ch). However, local
pairwise distance measures (i.e., between pairs of
clusters) are especially suited to hierarchical methods.

133
7. POTENTIAL APPLICATIONS OF DM

Database analysis and decision support


Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and
Web analysis.
Intelligent query answering

134
Market Analysis and
Management
Where are the data sources for analysis?
Credit card transactions, discount coupons,
customer complaint calls, plus (public) lifestyle
studies
Target marketing
Find clusters of model customers who share the
same characteristics: interest, income level,
spending habits, etc.
Determine customer purchasing patterns
over time
Conversion of single to a joint bank account:
marriage, etc.
135
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central
tendency and variation)

136
Fraud Detection and
Management
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
medical insurance: detect professional patients and ring
of doctors and ring of references

137
Some representative data
mining tools
Oracle (Oracle Data Mining) classification, prediction,
regression, clustering, association, feature selection, feature
extraction, anomaly selection.
Weka system (http://www.cs.waikato.ac.nz/ml/weka) University
of Waikato, Newzealand. The system is written in Java. The
platforms: Linux, Windows, Macintosh.

Acknosoft (Kate) Decision trees, case-based reasoning


DBMiner Technology (DBMiner) OLAP analysis, Associations,
classification, clustering.
IBM (Intelligent Miner) Classification, Association rules,
predictive models.
NCR (Management Discovery Tool) Association rules
SAS (Enterprise Miner) Decision trees, Association rules, neural
networks, Regression, clustering
Silicon Graphics (MineSet) Decision trees, Association rules

138

You might also like