You are on page 1of 27

Data Mining

Data Mining
Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
by using pattern recognition technologies as well as
statistical and mathematical techniques.

Data Mining Functionalities


Discrimination
Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions (in image processing)

Association (correlation and causality)


age(X, 20..29) ^ income(X, 20..29K) buys(X, PC)
buys(X, computer) buys(X, software)

Classification and Prediction


Finding models that describe and distinguish classes or
concepts for future prediction
Prediction of some unknown or missing values

Data Mining Functionalities


Cluster detection
Group data to form new classes, e.g., cluster customers based on
similarity of some sort
Based on the principle: maximizing the intra-class similarity and
minimizing the interclass similarity

Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis

Supervised vs. Unsupervised Learning


Supervised learning (e.g. classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations or actual outcome
New data is classified based on the model obtained from
the training set

Unsupervised learning (e.g. clustering)


The class labels of training data are unknown
Given a set of measurements, observations, etc., the aim is
to establish the existence of classes, clusters, links or
associations in the data

Classification

Classification vs. Prediction


Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set
and the values (class labels) of a target / class attribute and
uses it for classifying new data

Prediction:
models continuous-valued functions, i.e., predicts unknown
or missing values

Typical Applications of Classification


credit approval
target marketing
medical diagnosis

ClassificationA Two-Step Process


Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction is known as the training
set
The model is represented as classification rules, decision trees, a
trained neural network or mathematical formulae
Estimate accuracy of the model
The known label of test sample is compared with the classified
result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting can
occur

Model usage: for classifying future or unknown objects

Classification Process: Model Construction

Training
Data

NAME
Mike
Mary
Bill
Jim
Dave
Anne

RANK
YEARS Ind. Chrg.
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no

Classification
Algorithms

Classifier
(Model)

IF rank = professor
OR years > 6
THEN Ind. Chrg. = yes

Classification Process: Use the Model


(Testing & Prediction)
Classifier
Testing
Data

Unseen Data
(Jeff, Professor, 4)

NAME
Tom
Merlisa
George
Joseph

RANK
YEARS Ind. Chrg.
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes

Ind. Chrg.?

Data Preparation for Classification and


Prediction
Data cleaning
Preprocess data in order to reduce noise and handle missing
values

Relevance analysis (feature selection)


Remove the irrelevant or redundant attributes

Data transformation
Generalize and/or normalize data

Decision Tree Classification Algorithms

Decision Tree Algorithms


Income group 1
age > 27

Income group 2

Income group 3
or 4
age >41

Saw Independence Day

Income
group 3
age < 41

Income
group 4
age < 41

Saw Birdcage

Income group 1
age < 27

Saw Courage Under Fire

Saw Nutty Professor

Saw Birdcage

Saw Courage
Under Fire

Classification by Decision Tree Induction


Decision tree

A tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels

Decision tree generation consists of two phases


Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers

Use of decision tree: Classifying an unknown sample


Test the attribute values of the sample against the decision tree

Training Dataset
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40

income student credit_rating


high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent

buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no

Output: A Decision Tree for


buys_computer
age?
<=30
student?

overcast
30..40
yes

>40
credit rating?

no

yes

excellent

fair

no

yes

no

yes

Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Some algorithms assume attributes to be categorical or ordinal (if
continuous-valued, they are discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

Conditions for stopping partitioning


All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority
voting is employed for classifying the leaf
There are no samples left

Attribute Selection Measure


Information gain (ID3/C4.5)
All attributes are assumed to be categorical
Can be modified for continuous-valued
attributes

Gini index (IBM IntelligentMiner)


All attributes are assumed continuous-valued
Assume there exist several possible split values
for each attribute
Can be modified for categorical attributes

Information Gain (ID3/C4.5)


Select the attribute with the highest information gain
Assume there are two classes, P and N
Let the set of examples S contain p elements of class P and n
elements of class N
The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as

p
p
n
n
I ( p, n) =
log 2

log 2
p+n
p+n p+n
p+n

Information Gain in Decision


Tree Induction
Assume that using attribute A a set S will be
partitioned into subsets {S1, S2 , , Sk}
If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subsets Si is
E ( A) =

i =1

p i + ni
I ( p i , ni )
p+n

The encoding information that would be gained by


branching on A is

Gain( A) = I ( p, n) E ( A)

Attribute Selection by Information


Gain Computation
g Class P: buys_computer =
yes

5
4
I ( 2 ,3 ) +
I ( 4 ,0 )
14
14
5
+
I ( 3 , 2 ) = 0 . 69
14

E ( age ) =

g Class N: buys_computer =
no

Hence

g I(p, n) = I(9, 5) =0.940

Gain(age) = I ( p, n) E (age)

g Compute the entropy for


age:

= 0.94 0.69 = 0.25


Similarly
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048

age
<=30
3040
>40

pi
2
4
3

ni I(pi, ni)
3 0.971
0 0
2 0.971

Gini Index (IBM IntelligentMiner)


If a data set T contains examples from n classes, gini index,
gini(T) is defined as
n
gini (T ) = 1 p 2j
j =1

where pj is the relative frequency of class j in T.


If a data set T is split into two subsets T1 and T2 with sizes N1
and N2 respectively, the gini index of the split data contains
examples from n classes, the gini index gini(T) is defined as

N 1 gini ( ) + N 2 gini ( )
(
T
)
=
gini split
T1
T2
N
N
The attribute that provides the smallest ginisplit (T) is chosen to
split the node

Avoid Overfitting in Classification


The generated tree may overfit the training data
Too many branches, some may reflect anomalies due to
noise or outliers
Result is in poor accuracy for unseen samples

Two approaches to avoid overfitting


Prepruning: Halt tree construction earlydo not split a
node if this would result in the goodness measure falling
below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training data to
decide which is the best pruned tree

Approaches to Determine the Final


Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training
but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution

Use minimum description length (MDL) principle:


halting growth of the tree when the encoding is
minimized

Classification Accuracy: Estimating


Error Rates
Partition: Training-and-testing
use two independent data sets, e.g., training set (2/3), test
set(1/3)
used for data set with large number of samples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one sub-sample as
test data --- k-fold cross-validation
for data set with moderate size
Bootstrapping (leave-one-out)
for small size data

Boosting
Boosting increases classification accuracy
Applicable to decision trees
Learn a series of classifiers, where each
classifier in the series pays more attention to the
examples misclassified by its predecessor
Boosting requires only linear time and constant
space

Decision Trees (Strengths & Weaknesses)


Generates understandable rules
Scalability: Fast classification of new cases - can
classify data sets with millions of examples and

hundreds of attributes with reasonable speed


Handles both continuous and categorical values
Indicates best fields

Comparable classification accuracy with other


methods
Expensive to train (but still better than most other
classification methods)
Uses rectangular regions
Accuracy can decrease due to over-fitting of data

You might also like