Business Intelligence & Data Mining-6-7

Data Mining
Data Mining
Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
by using pattern recognition technologies as well as
statistical and mathematical techniques.
Data Mining Functionalities

Discrimination
Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions (in image processing)
Association (correlation and causality)

age(X, 20..29) ^ income(X, 20..29K) buys(X, PC)
buys(X, computer) buys(X, software)
Classification and Prediction

Finding models that describe and distinguish classes or
concepts for future prediction
Prediction of some unknown or missing values
Data Mining Functionalities

Cluster detection
Group data to form new classes, e.g., cluster customers based on
similarity of some sort
Based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis
Supervised vs. Unsupervised Learning

Supervised learning (e.g. classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations or actual outcome
New data is classified based on the model obtained from
the training set
Unsupervised learning (e.g. clustering)

The class labels of training data are unknown
Given a set of measurements, observations, etc., the aim is
to establish the existence of classes, clusters, links or
associations in the data
Classification
Classification vs. Prediction

Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set
and the values (class labels) of a target / class attribute and
uses it for classifying new data
Prediction:
models continuous-valued functions, i.e., predicts unknown
or missing values
Typical Applications of Classification

credit approval
target marketing
medical diagnosis
ClassificationA Two-Step Process

Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is known as the training
set
The model is represented as classification rules, decision trees, a
trained neural network or mathematical formulae
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting can
occur
Model usage: for classifying future or unknown objects
Classification Process: Model Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS Ind. Chrg.
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = professor
OR years > 6
THEN Ind. Chrg. = yes
Classification Process: Use the Model

(Testing & Prediction)
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS Ind. Chrg.
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Ind. Chrg.?
Data Preparation for Classification and

Prediction
Data cleaning
Preprocess data in order to reduce noise and handle missing
values
Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Decision Tree Classification Algorithms
Decision Tree Algorithms

Income group 1
age > 27
Income group 2
Income group 3
or 4
age >41
Saw Independence Day
Income
group 3
age < 41
Income
group 4
age < 41
Saw Birdcage
Income group 1
age < 27
Saw Courage Under Fire
Saw Nutty Professor
Saw Birdcage
Saw Courage
Under Fire
Classification by Decision Tree Induction

Decision tree
A tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels
Decision tree generation consists of two phases

Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample

Test the attribute values of the sample against the decision tree
Training Dataset
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
income student credit_rating

high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for

buys_computer
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Some algorithms assume attributes to be categorical or ordinal (if
continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning

All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority
voting is employed for classifying the leaf
There are no samples left
Attribute Selection Measure

Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continuous-valued
attributes
Gini index (IBM IntelligentMiner)

All attributes are assumed continuous-valued
Assume there exist several possible split values
for each attribute
Can be modified for categorical attributes
Information Gain (ID3/C4.5)

Select the attribute with the highest information gain
Assume there are two classes, P and N
Let the set of examples S contain p elements of class P and n
elements of class N
The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
p
p
n
n
I ( p, n) =
log 2
log 2
p+n
p+n p+n
p+n
Information Gain in Decision

Tree Induction
Assume that using attribute A a set S will be
partitioned into subsets {S1, S2 , , Sk}
If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subsets Si is
E ( A) =
i =1
p i + ni
I ( p i , ni )
p+n
The encoding information that would be gained by

branching on A is
Gain( A) = I ( p, n) E ( A)
Attribute Selection by Information

Gain Computation
g Class P: buys_computer =
yes
5
4
I ( 2 ,3 ) +
I ( 4 ,0 )
14
14
5
+
I ( 3 , 2 ) = 0 . 69
14
E ( age ) =
g Class N: buys_computer =
no
Hence
g I(p, n) = I(9, 5) =0.940
Gain(age) = I ( p, n) E (age)
g Compute the entropy for

age:
= 0.94 0.69 = 0.25

Similarly
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
age
<=30
3040
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes, gini index,
gini(T) is defined as
n
gini (T ) = 1 p 2j
j =1
where pj is the relative frequency of class j in T.

If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as
N 1 gini ( ) + N 2 gini ( )
(
T
)
=
gini split
T1
T2
N
N
The attribute that provides the smallest ginisplit (T) is chosen to
split the node
Avoid Overfitting in Classification

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due to
noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting

Prepruning: Halt tree construction earlydo not split a
node if this would result in the goodness measure falling
below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the best pruned tree
Approaches to Determine the Final

Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
Use minimum description length (MDL) principle:

halting growth of the tree when the encoding is
minimized
Classification Accuracy: Estimating

Error Rates
Partition: Training-and-testing
use two independent data sets, e.g., training set (2/3), test
set(1/3)
used for data set with large number of samples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-sample as
test data --- k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
for small size data
Boosting
Boosting increases classification accuracy
Applicable to decision trees
Learn a series of classifiers, where each
classifier in the series pays more attention to the
examples misclassified by its predecessor
Boosting requires only linear time and constant
space
Decision Trees (Strengths & Weaknesses)

Generates understandable rules
Scalability: Fast classification of new cases - can
classify data sets with millions of examples and
hundreds of attributes with reasonable speed

Handles both continuous and categorical values
Indicates best fields
Comparable classification accuracy with other

methods
Expensive to train (but still better than most other
classification methods)
Uses rectangular regions
Accuracy can decrease due to over-fitting of data

Business Intelligence &amp; Data Mining-6-7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Intelligence &amp; Data Mining-6-7

Uploaded by

Copyright:

Available Formats

Data Mining

Data Mining Functionalities

Association (correlation and causality)

Classification and Prediction

Data Mining Functionalities

Supervised vs. Unsupervised Learning

Unsupervised learning (e.g. clustering)

Classification vs. Prediction

Typical Applications of Classification

ClassificationA Two-Step Process

Model usage: for classifying future or unknown objects

Classification Process: Model Construction

Classification Process: Use the Model

Data Preparation for Classification and

Relevance analysis (feature selection)

Decision Tree Classification Algorithms

Decision Tree Algorithms

Saw Independence Day

Saw Courage Under Fire

Saw Nutty Professor

Classification by Decision Tree Induction

Decision tree generation consists of two phases

Use of decision tree: Classifying an unknown sample

income student credit_rating

Output: A Decision Tree for

Algorithm for Decision Tree Induction

Conditions for stopping partitioning

Attribute Selection Measure

Gini index (IBM IntelligentMiner)

Information Gain (ID3/C4.5)

Information Gain in Decision

The encoding information that would be gained by

Attribute Selection by Information

g I(p, n) = I(9, 5) =0.940

g Compute the entropy for

= 0.94 0.69 = 0.25

Gini Index (IBM IntelligentMiner)

where pj is the relative frequency of class j in T.

Avoid Overfitting in Classification

Two approaches to avoid overfitting

Approaches to Determine the Final

Use minimum description length (MDL) principle:

Classification Accuracy: Estimating

Decision Trees (Strengths & Weaknesses)

hundreds of attributes with reasonable speed

Comparable classification accuracy with other

You might also like

Business Intelligence & Data Mining-6-7

Business Intelligence & Data Mining-6-7