Professional Documents
Culture Documents
Models
and OLAP
Operations
Decision Support
Information technology to help the
knowledge worker (executive, manager,
analyst) make faster & better decisions
What were the sales volumes by region and product
category for the last year?
How did the share price of comp. manufacturers
correlate with quarterly profits over the past 10 years?
Which orders should we fill to maximize revenues?
On-line analytical processing (OLAP) is an
element of decision support systems
(DSS)
CS 336 2
Three-Tier Decision
Support Systems
Warehouse database server
Almost always a relational DBMS, rarely flat files
OLAP servers
Relational OLAP (ROLAP): extended relational
DBMS that maps operations on multidimensional
data to standard relational operators
Multidimensional OLAP (MOLAP): special-purpose
server that directly implements multidimensional
data and operations
Clients
Query and reporting tools
Analysis tools
Data mining tools
CS 336 3
The Complete Decision Support System
extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DBs Data Mining
serve
Data Marts
CS 336 4
Approaches to OLAP
Servers
Relational DBMS as Warehouse Servers
Two possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to store
and manage warehouse data
OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
The greatest advantage of MOLAP systems in
comparison with ROLAP is that multidimensional
operations can be performed in an easy, natural
way with MOLAP without any need for complex
join operations. For this reason, MOLAP system
CS 336 5
performance is excellent
Warehouse Models &
Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
CS 336 6
Multi-Dimensional Data
Measures - numerical data being tracked
Dimensions - business parameters that define a
transaction
Example: Analyst may want to view sales data
(measure) by geography, by time, and by
product (dimensions)
Dimensional modeling is a technique for
structuring data around the business concepts
ER models describe entities and
relationships
Dimensional models describe measures and
dimensions
CS 336 7
The Multi-Dimensional
Model
Sales by product line over the past six months
Sales by store between 1990 and 1995
...
CS 336 8
Dimensional Modeling
CS 336 9
ROLAP: Dimensional
Modeling Using
Relational DBMS
Special schema design: star, snowflake
Special indexes: bitmap, multi-table join
Special tuning: maximize query
throughput
Proven technology (relational model,
DBMS),
Products
IBM DB2, Oracle, Sybase IQ, RedBrick,
Informix
CS 336 10
MOLAP: Dimensional
Modeling Using the Multi
Dimensional Model data model): a
MDDB (Multi dimensional
special-purpose data model
Facts stored in multi-dimensional arrays
Dimensions used to index array
Sometimes on top of relational DB
Products
Pilot, Arbor Essbase, Gentia
CS 336 11
Star Schema (in RDBMS)
CS 336 12
Star Schema Example
CS 336 13
Star Schema
with Sample
Data
CS 336 14
Star Schema
A single fact table, with
Store Dimension Fact Table
STORE KEY
Time Dimension detail and summary data
STORE KEY PERIOD KEY
Store Description PRODUCTKEY
Period Desc
Fact table primary key
City PERIOD KEY
State
District ID
Dollars
Year
Quarter
has only one key column
District Desc.
Region_ID
Units
Price
Month per dimension
Day
Region Desc.
Regional Mgr.
Product Dimension Current Flag Each key is generated
Resolution
Level PRODUCTKEY
Product Desc.
Sequence
Each dimension is a
Brand
Color single table, highly
Size
Manufacturer denormalized
Level
Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low
maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary levels,
huge dimension tables a problem
15
CS 336
Star Schema
Store Dimension Fact Table Time Dimension The biggest drawback: dimension tables
STORE KEY STORE KEY
Store Description PRODUCTKEY
PERIOD KEY must carry a level indicator for every
Period Desc
City
State
PERIOD KEY
Year record and every query must use it. In the
Dollars Quarter
District ID
District Desc.
Units
Month
example below, without the level
Price
Region_ID
Region Desc.
Day
Current Flag
constraint, keys for all stores in the
Product Dimension
Regional Mgr.
Level PRODUCTKEY
Resolution NORTH region, including aggregates for
Sequence
Product Desc. region and district will be pulled from the
Brand
Color fact table, resulting in error.
Size
Manufacturer
Level
Example:
Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Level is needed
Fact_Table A whenever aggregates
where A.STORE_KEY in (select STORE_KEY are stored with detail
from Store_Dimension B facts.
where region = North and Level = 2)
and etc...CS 336 16
The Snowflake Schema
Store
Dimension
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
District Desc.
Region_ID
Region Desc.
Regional Mgr. Store Fact Table District Fact Table RegionFact Table
Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars Units
Units Price
Dollars Price
Units
Price
17
CS 336
Advantages of ROLAP
Dimensional Modeling
CS 336 18
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
CS 336 19
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
CS 336 20
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44 p1 2 48
p1 s2 2 4
rollup
drill-down
CS 336 21
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
CS 336 22
ROLAP vs. MOLAP
ROLAP:
Relational On-Line Analytical Processing
MOLAP:
Multi-Dimensional On-Line Analytical
Processing
CS 336 23
The MOLAP Cube
dimensions = 2
CS 336 24
3-D Cube
Fact table view: Multi-dimensional cube:
dimensions = 3
CS 336 25
Example
roll-up to region
Dimensions:
NY
o re SF Time, Product, Store
St roll-up to brand
Attributes:
LA
Product (upc, price, )
Juice 10
Store
Product
Milk 34
56
Coke
32
Hierarchies:
Cream
12 Product Brand
Soap
Bread 56 roll-up to week Day Week Quarter
M T W Th F S S Store Region Country
Time
56 units of bread sold in LA on M
CS 336 26
Cube Aggregation: Roll-up
Example: computing sums
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
rollup p1 110
p2 19
drill-down
CS 336 27
Cube Operators for Roll-up
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
CS 336 28
Extended Cube
* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
CS 336 29
Aggregation Using
Hierarchies
s1 s2 s3
day 2
p1 44 4 store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region
country
region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
CS 336 30
Slicing
s1 s2 s3
day 2
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
TIME = day 1
s1 s2 s3
p1 12 50
p2 11 8
CS 336 31
Slicing & Sales
($ millions)
Products Time
Pivoting Store s1 Electronics
d1
$5.2
d2
Toys $1.9
Clothing $2.3
Cosmetics $1.1
Store s2 Electronics $8.9
Toys $0.75
Clothing $4.6
Cosmetics $1.5
Sales
($ millions)
Products d1
Store s1 Store s2
Store s1 Electronics $5.2 $8.9
Toys $1.9 $0.75
Clothing $2.3 $4.6
Cosmetics $1.1 $1.5
Store s2 Electronics
Toys
Clothing
CS 336 32
Summary
Aggregation (roll-up)
of Operations
aggregate (summarize) data to the next higher
dimension element
e.g., total sales by city, year total sales by region, year
Navigation to detailed data (drill-down)
Selection (slice) defines a subcube
e.g., sales where city =Gainesville and date = 1/15/90
Calculation and ranking
e.g., top 3% of cities by average income
Visualization operations (e.g., Pivot)
Time functions
e.g., time average
CS 336 33
Query & Analysis Tools
Query Building
Report Writers (comparisons, growth, graphs,
)
Spreadsheet Systems
Web Interfaces
Data Mining
CS 336 34
Fact Tables
Contains two or more foreign keys
Tend to have huge numbers of records
Useful facts tend to be numeric and
additive
Dimension Tables
Contain text and descriptive information
1 in a 1-M relationship
Generally the source of interesting
constraints
Typically contain the attributes for the
SQL answer set.
Warehouse Models &
Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
CS 336 37
Summary of Operations
Aggregation (roll-up)
aggregate (summarize) data to the next higher
dimension element
e.g., total sales by city, year total sales by
region, year
Navigation to detailed data (drill-down)
Selection (slice) defines a subcube
e.g., sales where city =Gainesville and date =
1/15/90
Calculation and ranking
e.g., top 3% of cities by average income
Visualization operations (e.g., Pivot)
Time functions
e.g., time average
CS 336 38
The MOLAP Cube
dimensions = 2
CS 336 39
3-D Cube
Fact table view: Multi-dimensional cube:
dimensions = 3
CS 336 40
Slicing
s1 s2 s3
day 2 p1 44 4
p2 s1 s2 s3
day 1 p1 12 50
p2 11 8
TIME = day 1
s1 s2 s3
p1 12 50
p2 11 8
CS 336 41
Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 s1 1 12 sale prodId date amt
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 s1 2 44 p1 2 48
p1 s2 2 4
rollup
drill-down
CS 336 42
Outline
Overview of Data Mining
What is Data Mining?
Steps in Data Mining
Overview of Data Mining techniques
Points to Remember
DATA MINING
44
1. OVERVIEW OF DATA MINING
45
Example
Consider a transaction database maintained by a specially
consumer goods retails. Suppose the client data includes a
customer name, zip code, phone number, date of
purchase, item code, price, quantity, and total amount.
A variety of new knowledge can be discovered by KDD
processing on this client database.
During data selection, data about specific items or
categories of items, or from stores in a specific region or
area of the country, may be selected.
The data cleansing process then may correct invalid zip
codes or eliminate records with incorrect phone prefixes.
Enrichment enhances the data with additional sources of
information. For example, given the client names and
phone numbers, the store may purchases other data about
age, income, and credit rating and append them to each
record.
Data transformation and encoding may be done to reduce
the amount of data.
46
Example (cont.)
The result of mining may be to discover
the following type of new information:
Association rules e.g., whenever a customer buys video
equipment, he or she also buys another electronic gadget.
Sequential patterns e.g., suppose a customer buys a
camera, and within three months he or she buys
photographic supplies, then within six months he is likely
to buy an accessory items. This defines a sequential
pattern of transactions. A customer who buys more than
twice in the regular periods may be likely buy at least
once during the Christmas period.
Classification trees e.g., customers may be classified by
frequency of visits, by types of financing used, by amount
of purchase, or by affinity for types of items, and some
revealing statistics may be generated for such classes.
47
We can see that many possibilities exist for discovering
new knowledge about buying patterns, relating factors
such as age, income group, place of residence, to what
and how much the customers purchase.
This information can then be utilized
to plan additional store locations based on demographics,
to run store promotions,
to combine items in advertisements, or to plan seasonal
marketing strategies.
As this retail store example shows, data mining must be
preceded by significant data preparation before it can
yield useful information that can directly influence
business decisions.
The results of data mining may be reported in a variety of
formats, such as listings, graphic outputs, summary
tables, or visualization.
48
Goals of Data Mining and
Knowledge Discovery
Data mining is carried out with some end goals.
These goals fall into the following classes:
Prediction Data mining can show how certain
attributes within the data will behave in the future.
Identification Data patterns can be used to identify
the existence of an item, an event or an activity.
Classification Data mining can partition the data so
that different classes or categories can be identified
based on combinations of parameters.
Optimization One eventual goal of data mining may
be to optimize the use of limited resources such as
time, space, money, or materials and to maximize
output variables such as sales or profits under a given
set of constraints.
49
Data Mining: On What Kind
of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
World Wide Web
50
Types of Knowledge
Discovered During Data
Mining.
Data mining addresses inductive knowledge, which
discovers new rules and patterns from the supplied
data.
Knowledge can be represented in many forms: In
an unstructured sense, it can be represented by
rules. In a structured form, it may be represented
in decision trees, semantic networks, or hierarchies
of classes or frames.
It is common to describe the knowledge discovered
during data mining in five ways:
Association rules These rules correlate the presence
of a set of items with another range of values for another
set of variables.
51
Types of Knowledge
Discovered (cont.)
Classification hierarchies The goal is to work
from an existing set of events or transactions to
create a hierarchy of classes.
Patterns within time series
Sequential patterns: A sequence of actions or
events is sought. Detection of sequential patterns is
equivalent to detecting associations among events
with certain temporal relationship.
Clustering A given population of events can be
partitioned into sets of similar elements.
52
Main function phases of the
KD process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
53
Main phases of data mining
Pattern Evaluation/
Presentation
Task-relevant Data
Data
Cleaning
Data Integration
54
Data Sources
What is Data Mining?
Data mining is an analytic process designed to explore large
amounts of data in search of consistent patterns and/or
systematic relationships between variables, and then to validate
the findings by applying the detected patterns to new subsets of
data.
Data Mining is a process of torturing the data until
they confess
The typical goals of data mining projects are:
Identification of groups, clusters, strata, or dimensions
in data that display no obvious structure,
The identification of factors that are related to a particular
outcome of interest (root-cause analysis)
Accurate prediction of outcome variable(s) of interest (in
the future, or in new customers, clients, applicants, etc.; this
application is usually referred to as predictive data mining)
What is Data Mining?
Data mining is used to
Detect fraudulent patterns in credit card
transactions, insurance claims, etc.
Detect default patterns
Model customer buying patterns and behavior
for cross-selling, up selling, and customer
acquisition
Optimize engine performance and several other
complex manufacturing processes
Data mining can be utilized in any organization
that needs to find patterns or relationships in
their data.
DM: Overview
CS490D 57
DM: Phases
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data analysis and data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for repeated/continuous mining of the data
CS490D 58
Phases in the DM Process
(1)
Business
Understanding:
Statement of
Business Objective
Statement of Data
Mining objective
Statement of
Success Criteria
CS490D 59
Phases in the DM Process
(2)
Data
Understanding
Collect data
Describe data
Explore the data
Verify the quality
and identify
outliers
CS490D 60
Phases in the DM Process (3)
Data preparation:
Can take over 90% of the time
Consolidation and Cleaning
table links, aggregation
level, missing values,
etc
Data selection
Remove noisy data,
repetitions, etc
Remove outliers?
Select samples
visualization tools
Transformations - create
new variables, formats
CS490D 61
Phases in cw DM Process (3)
Data preparation:
May take up to 90% of the time
Select Data
Rationale for Inclusion /
Exclusion: if it isnt really from
your domain remove
Clean Data
Remove repetitions
Remove headers, footers,
tables, pictures etc (BootCat
does this automatically)
Transform Data
Convert to plain text (ditto)
Reduce to word-frequency list,
keyword-freqs can be features
in machine-learning
CS490D 62
Phases in the DM
Process(4)
Model building
Selection of the
modeling
techniques is based
upon the data
mining objective
Modeling can be an
iterative process;
may model for
either description or
prediction
CS490D 63
Phases in the DM
Process(5)
Model Evaluation
Evaluation of model:
how well it performed,
how well it met business
needs
Methods and criteria
depend on model type:
e.g., confusion matrix
with classification
models, mean error rate
with regression models
Interpretation of model:
important or not, easy or
hard depends on
algorithm
CS490D 64
Phases in the DM Process (6)
Deployment
Determine how the
results need to be utilized
Who needs to use them?
How often do they need
to be used
Deploy Data Mining
results by:
Utilizing results as
business rules
Publishing report for
users, with
recommendations to
improve their business
CS490D 65
Why DM?: Concept
Description
Descriptive vs. predictive data mining
Descriptive mining: describes concepts or task-
relevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining: Based on data and analysis,
constructs models from the data-set, and predicts
the trend and properties of unknown data
Concept description:
Characterization: provides a concise and succinct
summarization of the given collection of data
Comparison: provides descriptions comparing two
or more collections of data
DM vs. OLAP
Data Mining:
can handle complex data types of the
attributes and their aggregations
a more automated process
Online Analytic Processing (visualization):
restricted to a small number of dimension
and measure types
user-controlled process
CS490D 67
DM: Summary
Business Understanding
Understanding project objectives and requirements
Data mining problem definition
Data Understanding
Initial data collection and familiarization
Identify data quality issues
Initial, obvious results
Data Preparation
Record and attribute selection
Data cleansing
Modeling
Run the data mining tools
Evaluation
Determine if results meet business objectives
Identify business issues that should have been addressed earlier
Deployment
Put the resulting models into practice
Set up for repeated/continuous mining of the data
CS490D 68
Steps in Data Mining
Stage 1: Precise statement of the
problem.
Stage 2: Initial exploration.
Stage 3: Model building and validation.
Stage 4: Deployment.
Steps in Data Mining
Stage 1: Precise statement of the problem.
This stage usually starts with data preparation that may involve the
cleaning of the data (e.g., identification and removal of incorrectly
coded data, etc.), data transformations, selecting subsets of
records, and, in the case of data sets with large numbers of
variables (fields), performing preliminary feature selection. Data
description and visualization are key components of this stage (e.g.
descriptive statistics, correlations, scatterplots, box plots, etc.).
Steps in Data Mining
Stage 3: Model building and validation.
Stage 4: Deployment.
Data transformations,
Data may be skewed (that is, outliers in one direction or another
may be present). Log transformation, Box-Cox transformation, etc.
Data reduction, Selecting subsets of records, and, in the case of data sets
with large numbers of variables (fields), performing preliminary feature
selection.
Data description and visualization are key components of this stage (e.g.
descriptive statistics, correlations, scatterplots, box plots, brushing tools,
etc.)
Data description allows you to get a snapshot of the important
characteristics of the data (e.g. central tendency and dispersion).
Model building and validation.
Modelis building
A model and validation.
typically rated according to 2 aspects:
Accuracy
Understandability
These aspects often conflict with one another.
Decision trees and linear regression models are less
complicated and simpler than models such as neural
networks, boosted trees, etc. and thus easier to
understand, however, you might be giving up some
predictive accuracy.
Remember not to confuse the data mining model
with reality (a road map is not a perfect
representation of the road) but it can be used as a
useful guide.
Model building and validation.
Validation of the model requires that you
train the model on one set of data and
evaluate on another independent set of
data.
There are two main methods of validation
Split data into train/test datasets (75-25 split)
If you do not have enough data to have a
holdout sample, then use v-fold cross
validation.
Model building and validation.
Model Validation Measures
Possible validation measures
Classification accuracy
Total cost/benefit when different errors involve
different costs
Lift and Gains curves
Error in Numeric predictions
Error rate
Proportion of errors made over the whole set of
instances
Training set error rate: is way too optimistic!
You can find patterns even in random data
Deployment.
A model is built once, but can be used over and
over again.
Examples.
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) takes(x, DB) grade(x, A) [1%,
75%]
87
Support and confidence
That is.
support, s, probability that a transaction contains
{A B }
s = P(A B )
confidence, c, conditional probability that a
transaction having A also contains B.
c = P(A|B).
Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.
88
Frequent item set
A set of items is referred as an itemset. An itemset
that contains k items is a k-itemset. The occurrence
frequency of an itemset is the number of transactions
that contain the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of
transactions in D. The number of transactions required
for the itemset to satisfy minimum support is referred
to as the minimum support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.
89
Example 2.1
Transaction-ID Items_bought
-------------------------------------------
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F
91
3. CLASSIFICATION
Classification is the process of learning a model
that describes different classes of data. The
classes are predetermined.
Example: In a banking application, customers who
apply for a credit card may be classify as a good
risk, a fair risk or a poor risk. Hence, this type
of activity is also called supervised learning.
Once the model is built, then it can be used to
classify new data.
92
The first step, of learning the model, is accomplished by using
a training set of data that has already been classified. Each
record in the training data contains an attribute, called the
class label, that indicates which class the record belongs to.
The model that is produced is usually in the form of a decision
tree or a set of rules.
Some of the important issues with regard to the model and
the algorithm that produces the model include:
the models ability to predict the correct class of the new
data,
the computational cost associated with the algorithm
the scalability of the algorithm.
Let examine the approach where the model is in the form of a
decision tree.
A decision tree is simply a graphical representation of the
description of each class or in other words, a representation of
the classification rules.
93
Example 3.1
Example 3.1: Suppose that we have a database of
customers on the AllEletronics mailing list. The database
describes attributes of the customers, such as their name,
age, income, occupation, and credit rating. The customers
can be classified as to whether or not they have purchased
a computer at AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an
upcoming computer sale. To send out promotional literature
to every new customers in the database can be quite costly.
A more cost-efficient method would be to target only those
new customers who are likely to purchase a new computer.
A classification model can be constructed and used for this
purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.
94
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.
95
Decision Trees
For example, consider the widely referenced Iris data
classification problem introduced by Fisher (1936).
The purpose of the analysis is to learn how one can discriminate
between the three types of flowers, based on the four measures
of width and length of petals and sepals.
A classification tree will determine a set of logical if-then
conditions (instead of linear equations) for predicting or
classifying cases.
Advantages of tree
methods.
Simplicity of results.
In most cases, the interpretation of results summarized in a
tree is very simple. This simplicity is useful not only for
purposes of rapid classification of new observations .
Often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular manner .
e.g., when analyzing business problems, it is much easier to
present a few simple if-then statements to management, than
some elaborate equations.
The example data file Irisdat.sta reports the lengths and widths of
sepals and petals of three types of irises (Setosa, Versicol, and Virginic). The
purpose of the analysis is to learn how one can discriminate between the
three types of flowers, based on the four measures of width and length of
petals and sepals.
Discriminant function analysis will estimate several linear combinations of
predictor variables for computing classification scores (or probabilities) that
allow the user to determine the predicted classification for each
observation.
A classification tree will determine a set of logical if-then conditions
(instead of linear equations) for predicting or classifying cases.
Regression Trees.
100
Neural Networks and Classification
Neural network is a technique derived from AI that
uses generalized approximation and provides an
iterative method to carry it out. ANNs use the
curve-fitting approach to infer a function from a
set of samples.
This technique provides a learning approach; it
is driven by a test sample that is used for the
initial inference and learning. With this kind of
learning method, responses to new inputs may be
able to be interpolated from the known samples.
This interpolation depends on the model
developed by the learning method.
101
ANN and classification
ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop
internal representations without sample outputs are
called unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks
and are therefore useful in data mining.
102
Information processing at a neuron
in an ANN
103
Machine Learning
Algorithms.
STATISTICA Machine Learning provides a
number of advanced statistical methods for
handling regression and classification tasks
with multiple dependent and independent
variables.
These methods include
Support Vector Machines (SVM)
( for regression and classification).
A major shortcoming of k-Means clustering has been that you need to specify the
number of clusters before starting the analysis (i.e., the number of clusters must
be known a priori); the Generalized EM and k-Means Cluster Analysis module uses
a modified v-fold cross-validation scheme , to determine the best number of
clusters from the data. This extension makes the Generalized EM and k-Means
Cluster Analysis module an extremely useful data mining tool for unsupervised
learning and pattern recognition.
5. CLUSTERING
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters.
Cluster analysis
Grouping a set of data objects into clusters.
Clustering is unsupervised learning: no
predefined classes, no class-labeled training
samples.
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
123
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
World Wide Web
Document classification
Cluster Weblog data to discover groups of similar
access patterns
124
Examples of Clustering
Applications
Marketing: Help marketers discover distinct
groups in their customer bases, and then use
this knowledge to develop targeted marketing
programs.
Land use: Identification of areas of similar land
use in an earth observation database.
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location.
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults.
125
Partitioning Algorithms:
Basic Concept
Partitioning method: Construct a partition of a database
D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion.
- Global optimal: exhaustively enumerate all
partitions
- Heuristic methods: k-means and k-medoids
algorithms
k-means (MacQueen67): Each cluster is represented by
the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw87): Each cluster is represented by one of
the objects in the cluster
126
The K-Means Clustering Method
Input: a database D, of m records, r1, r2,,rm
and a desired number of clusters k.
Output: set of k clusters that minimizes the
square error criterion.
Given k, the k-means algorithm is implemented
in 4 steps:
Step 1: Randomly choose k records as the initial
cluster centers.
Step 2: Assign each records ri, to the cluster such
that the distance between ri and the cluster centroid
(mean) is the smallest among the k clusters.
Step 3: recalculate the centroid (mean) of each
cluster based on the records assigned to the cluster.
Step 4: Go back to Step 2, stop when no more new
assignment.
127
The algorithm begins by randomly choosing k records to
represent the centroids (means), m1, m2,,mk of the
clusters, C1, C2,,Ck. All the records are placed in a given
cluster based on the distance between the record and
the cluster mean. If the distance between mi and record
rj is the smallest among all cluster means, then record is
placed in cluster Ci.
Once all records have been placed in a cluster, the mean
for each cluster is recomputed.
Then the process repeats, by examining each record
again and placing it in the cluster whose mean is closest.
Several iterations may be needed, but the algorithm will
converge, although it may terminate at a local optimum.
128
Example 4.1: Consider the K-means clustering algorithm that
works with the (2-dimensional) records in Table 2. Assume
that the number of desired clusters k is 2.
RID Age Years of Service
--------------------------------------
1 30 5
2 50 25
3 50 15
4 25 5
5 30 10
6 30 25
Let the algorithm choose records with RID 3 for cluster C 1 and
RID 6 for cluster C2. as the initial cluster centroids.
The first iteration:
distance(r1, C1) = (50-30)2+(15-5)2 = 22.4; distance(r1, C2) = 32.0,
so r1 C1. distance(r2, C1) = 10.0 and distance(r2, C2) = 5.0 so r2
C2 .
distance(r4, C1) = 25.5 and distance(r4, C2) = 36.6 so r4 C1
distance(r5, C1) = 20.6 and distance(r5, C2) = 29.2 so r5 C1
Now the new means (centroids) for the two clusters are computed.
129
Clustering of a set of objects based on the k-means method.
130
Hierarchical Clustering
A hierarchical clustering method works by grouping data
objects into a tree of clusters.
In general, there are two types of hierarchical clustering
methods:
Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster and
then merges these atomic clusters into larger and larger
clusters, until all of the objects are in a single cluster or until a
certain termination conditions are satisfied. Most hierarchical
clustering methods belong to this category. They differ only in
their definition of intercluster similarity.
Divisive hierarchical clustering: This top-down strategy does
the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster. It subdivides the cluster into
smaller and smaller pieces, until each object forms a cluster on
its own or until it satisfied certain termination condition, such as
a desired number clusters is obtained or the distance between
two closest clusters is above a certain threshold distance.
131
Agglomerative and divisive hierarchical clustering on data objects {a, b, c,
d, e}
132
Hierarchical Clustering
133
7. POTENTIAL APPLICATIONS OF DM
134
Market Analysis and
Management
Where are the data sources for analysis?
Credit card transactions, discount coupons,
customer complaint calls, plus (public) lifestyle
studies
Target marketing
Find clusters of model customers who share the
same characteristics: interest, income level,
spending habits, etc.
Determine customer purchasing patterns
over time
Conversion of single to a joint bank account:
marriage, etc.
135
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central
tendency and variation)
136
Fraud Detection and
Management
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
medical insurance: detect professional patients and ring
of doctors and ring of references
137
Some representative data
mining tools
Oracle (Oracle Data Mining) classification, prediction,
regression, clustering, association, feature selection, feature
extraction, anomaly selection.
Weka system (http://www.cs.waikato.ac.nz/ml/weka) University
of Waikato, Newzealand. The system is written in Java. The
platforms: Linux, Windows, Macintosh.
138