Fuzzyppt

Chapter 7: Classification
Introduction
Classification problem, evaluation of classifiers
Bayesian Classifiers
Optimal Bayes classifier, naive Bayes classifier, applications
Nearest Neighbor Classifier

Basic notions, choice of parameters, applications
Decision Tree Classifiers

Basic notions, split strategies, overfitting, pruning of decision
trees
Scalability to Large Databases
SLIQ, SPRINT, RainForest
Further Approaches to Classification

Neural networks, genetic algorithm, rough set approach, fuzzy
set approaches, support vector machines, prediction
WS 2003/04 Data Mining Algorithms 7 – 81
Scalability to Large Databases:

Motivation
Construction of decision trees is one of the most important tasks in

classification
We considered up to now
small data sets
main memory resident data
New requirements
larger and larger commercial databases
necessity to use secondary storage algorithms
Scalability for databases of arbitrary (i.e., unbounded) size

Approaches
Sampling
use a subset of the data as training set such that
sample fits into main memory
evaluate sample of all potential splits (for numerical
attributes)
Æ poor quality of resulting decision trees
Support by indexing structures (secondary storage)

Use all data as training set (not just a sample)
Management of the data by a database system
Indexing structures may provide high efficiency
Æ no loss in the quality of decision trees


Storage and Indexing Structures
Identify expensive operations:

Evaluation of potential splits and selection of best split
for numerical attributes
sorting the attribute values

evaluation of attribute values as potential split points
for categorial attributes
O(2m) potential binary splits for m distinct attribute values
Partitioning of training data
according to the selected split point
read and write operations to access the training data
Effort for growth phase dominates the overall effort

SLIQ: Introduction
[Mehta, Agrawal & Rissanen 1996]
SLIQ: Scalable decision tree classifier

Binary splits
Evaluation of the splits by using the Gini-Index
k
gini (T ) = 1 − ∑ p 2j for k classes ci with
frequencies pi
j =1
Special data structures

avoid sorting of the training data
for every node of the decision tree
for each numerical attribute
SLIQ: Data Structures
Attribute lists
values of an attribute in ascending order
in combination with reference to respective entry in class list
sequential access
secondary storage resident
Class list
contains class label for each training object and
reference to the respective leaf node in the decision tree
random access
main memory resident
Histograms
for each leaf node of the decision tree
frequencies of the individual classes per partition

SLIQ: Example
Attribute lists
Age Id
Training data
23 2
Id Age Income Class 30 1
1 30 65 G 40 3
2 23 15 B Class list
45 6
3 40 75 G Id Class Leaf 55 5
4 55 40 B 1 G N1 55 4
5 55 100 G 2 B N1
6 45 60 G 3 G N1
4 B N1
5 G N1 Income Id
6 G N1 15 2
40 4
N1 60 6
65 1
75 3
100 5
SLIQ: Algorithm
Breadth first strategy

For all leaf nodes on the same level of the decision
tree, evaluate all possible splits for all attributes

Standard decision tree classifiers follow a depth first strategy
Split of numerical attributes

Sequentially scan the attribute list of attribute a, and
for each value v in the list do:

Determine the respective entry e in the class list
Let k be the value of the „leaf“ attribute of e
Update the histogram of k based on the value of the „class“
attribute of e
SPRINT: Introduction
[Shafer, Agrawal & Mehta 1996]
Shortcomings of SLIQ
Size of class list linearly grows with the size of the
database, i.e. with the number of training examples
SLIQ scales well only if sufficient main memory for
the entire class list is available
Goals of SPRINT
Scalability for arbitrarily large databases
Simple parallelization of the method
SPRINT: Data Structures
Class list
there is no class list any longer
additional attribute „class“ for the attribute lists

(resident in secondary storage)
no main memory data structures any longer
scalable to arbitrarily large databases
Attribute lists
no single attribute list for the entire training set
separated attribute lists for each node of the

decision tree instead
waiving of central data structures supports a simple
parallelization of SPRINT

SPRINT: Example
Age Class Id car type class Id
17 high 1 family high 0
20 high 5 sportive high 1
23 high 0
Attribute lists sportive high 2
32 low 4 for node N1 family low 3
43 high 2 truck low 4
68 low 3 family high 5
N1
Age ≤ 27.5 Age > 27.5
Age Class Id age class Id
N2 N3
17 high 1 32 low 4
20 high 5 43 high 2
23 Hoch 0 Attribute Attribute 68 low 3
lists for lists for
car type class Id node N2 node N3 car type class Id
family high 0 sportive high 2
sportive high 1 family low 3
family high 5 truck low 4
SPRINT: Experimental Evaluation

8000
runtime (in seconds)
7000
6000
5000 SPRINT
4000
3000
2000
1000 SLIQ number of objects
0
0 0.5 1.0 1.5 2.0 2.5 3.0 (in millions)
SLIQ is more efficient than SPRINT as long as the class

list fits into main memory
SLIQ is not applicable for data sets with more than one
million entries
RainForest: Introduction
[Gehrke, Ramakrishnan & Ganti 1998]
Shortcomings of SPRINT
Does not exploit the available main memory
Is applicable to breadth first decision tree construction only
Goals of RainForest
Exploits the available main memory to increase the efficiency
Applicable to all known algorithms
RainForest: Basic idea

Separate scalability aspects from quality aspects of a decision
tree classifier
RainForest: Data Structures
AVC set for attribute a and node k

Contains a class histogram for each value of a
For all training objects that belong to the partition of node k
Entries: (ai, cj, count)
AVC group for node k

Set of AVC sets of node k for all attributes
For categorial attributes:

AVC set is significantly smaller than attribute lists
At least one of the AVC sets fits into main memory
Potentially, the entire AVC group fits into main memory

RainForest: Example
Training data
Id age income class AVC set „age“ for N1 AVC set „income“ for N1
1 young 65 G value class count value class count
2 young 15 B young B 1 15 B 1
3 young 75 G young G 2 40 B 1
4 senior 40 B senior B 1 60 G 1
5 senior 100 G senior G 2 65 G 1
75 G 1
6 senior 60 G
100 G 1
N1
age = young age = senior
N2 N3
AVC set „income“ for N2
AVC set „age“ for N2
value class count
15 B 1 value class count
65 G 1 young B 1
75 G 1 young G 2
RainForest: Algorithms
Assumption
The entire AVC group of the root node fits into main memory
Then, the AVC groups of each node also fit into main memory
Algorithm RF_Write
Construction of the AVC group of node k in main memory by
sequential scan over the training set
Determination of the optimal split for node k by using the AVC
group
Reading the training set and distribution (writing) to the
partitions
Æ training set is read twice and written once

RainForest: Algorithms
Algorithm RF_Read
Avoids explicit writing of the partitions to secondary storage
Reading of desired partitions from the entire training data set
Simultaneous creation of AVC groups for as many partitions as
possible
Training database is read for each tree level multiple times
Algorithm RF_Hybrid
Usage of RF_Read as long as the AVC groups of all nodes from
the current level of the decision tree fit into main memory
Subsequent materialization of the partitions by using RF_Write
RainForest: Experimental Evaluation

runtime (in seconds)
20,000
number of
training objects
SPRINT (in millions)
10,000
RainForest
1.0 2.0 3.0
for all RainForest algorithms, the runtime linearly

increases with the number n of training objects
RainForest is significantly more efficient than SPRINT
Boosting and Bagging
Techniques to increase classification accuracy

Bagging
Basic idea: Learn a set of classifiers and decide the
class prediction by following the majority of the

individual votes
Boosting
Basic idea: Learn a series of classifiers, where each
classifier in the series pays more attention to the

examples misclassified by its predecessor
Applicable to decision trees or Bayesian classifier
Boosting: Algorithm
Algorithm
Assign every example an equal weight 1/N
For t = 1, 2, …, T do
Obtain a hypothesis (classifier) h(t) under w(t)

Calculate the error of h(t) and re-weight the examples based
on the error
Normalize w(t+1) to sum to 1.0
Output a weighted sum of all the hypothesis, with

each hypothesis weighted according to its accuracy

on the training set
Boosting requires only linear time and constant space

Chapter 7: Classification
Introduction
Classification problem, evaluation of classifiers
Bayesian Classifiers
Optimal Bayes classifier, naive Bayes classifier, applications
Nearest Neighbor Classifier

Basic notions, choice of parameters, applications
Decision Tree Classifiers

Basic notions, split strategies, overfitting, pruning of decision
trees
Scalability to Large Databases
SLIQ, SPRINT, RainForest
Further Approaches to Classification

Neural networks, genetic algorithm, rough set approach, fuzzy
set approaches, support vector machines, prediction
Neural Networks
Advantages
prediction accuracy is generally high
robust, works when training examples contain errors
output may be discrete, real-valued, or a vector of
several discrete or real-valued attributes

fast evaluation of the learned target function
Criticism
long training time
difficult to understand the learned function (weights),
no explicit knowledge generated

not easy to incorporate domain knowledge

A Neuron
µk (bias for input k)
x1 w1
x2 w2
… …
Σ f
output y
xn wn
input weight weighted activation
vector x vector w sum function
The n-dimensional input vector x = (x1, x2, …, xn) is mapped

into variable y by means of the scalar product and a
nonlinear function mapping
Network Training
The ultimate objective of training

obtain a set of weights that makes almost all the
tuples in the training data classified correctly

Steps
Initialize weights with random values
Feed the input tuples into the network one by one
For each unit
Compute the net input to the unit as a linear combination

of all the inputs to the unit
Compute the output value using the activation function
Compute the error
Update the weights and the bias
Multi-Layer Perceptron
Output vector
Errj = O j (1 − O j )∑k Errk w jk
Output nodes
θ j = θ j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I
1+ e j
Input nodes
I j = ∑ wij Oi + θ j
i
Input vector: xi
Network Pruning and Rule Extraction
Network pruning
Fully connected network will be hard to articulate
N input nodes, h hidden nodes and m output nodes lead to
h⋅(m+N) weights
Pruning: Remove some of the links without affecting classification
accuracy of the network
Extracting rules from a trained network
Discretize activation values; replace individual activation value by
the cluster average maintaining the network accuracy
Enumerate the output from the discretized activation values to
find rules between activation value and output
Find the relationship between the input and activation value
Combine the above two to have rules relating the output to input
Genetic Algorithms
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly
generated rules
e.g., “If A1 and Not A2 then C2” can be encoded as 100
Based on the evolutionary notion of survival of the fittest,

a new population is formed that consists of the fittest
rules and their offsprings
The fitness of a rule is represented by its classification
accuracy on a set of training examples
Offsprings are generated by crossover and mutation
Rough Set Approach

Rough sets are used to approximately or “roughly”
define equivalent classes
A rough set for a given class C is approximated by two
sets: a lower approximation (certain to be in C) and an
upper approximation (cannot be described as not
belonging to C)
Finding the minimal subsets (reducts) of attributes (for
feature reduction) is NP-hard but a discernibility matrix
is used to reduce the computation intensity

Fuzzy Set
Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to

represent the degree of membership (such as using
fuzzy membership graph)
Attribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value may
apply
Each applicable rule contributes a vote for membership
in the categories
Typically, the truth values for each predicted category
are summed
© and acknowledgements: Prof. Dr. Hans-Peter Kriegel and Matthias Schubert (LMU Munich)
and Dr. Thorsten Joachims (U Dortmund and Cornell U)
Support Vector Machines (SVM)
Motivation: Linear Separation Vectors in ℜ d represent objects

Objects belong to exactly one of
two respective classes
For the sake of simpler formulas,
the used class labels are:
y = –1 and y = +1
Classification by linear separation:

determine hyperplane which
separates both vector sets with a
„maximal stability“
Assign unknown elements to the
separating hyperplane halfspace in which they reside

Support Vector Machines
Problems of linear separation

Definition and efficient determination of the
maximum stable hyperplane
Classes are not always linearly separable
Computation of selected hyperplanes is very

expensive
Restriction to two classes
…
Approach to solve these problems

Support Vector Machines (SVMs) [Vapnik 1979, 1995]
Maximum Margin Hyperplane

Observation: There is no unique hyperplane to separate p1 from p2
Question: which hyperplane separates the classes best?
p2 p2
p1 p1
Criteria
Stability at insertion
Distance to the objects of both classes

Support Vector Machines: Principle
Basic idea: Linear separation with the

maximum margin hyperplane Maximum Margin Hyperplane (MMH)
Distance to points from any of the
two sets is maximal, i.e. at least ξ
Minimal probability that the
p2 separating hyperplane has to be
ξ
ξ moved due to an insertion
p1 Best generalization behaviour
MMH is „maximally stable“
MMH only depends on points pi whose
margin distance to the hyperplane exactly is ξ
pi is called a support vector

Recall some algebraic notions for feature space FS
r r r r
Inner product of two vectors x , y ∈ FS : x ,y
r r
e.g., canonical scalar product: x , y = ∑ ( xi ⋅ yi )
d

i =1
Hyperplane H(w,b) with normal vector w and value b:

H (w , b ) = { x ∈ FS , w , x + b = 0 }
r r r r
Distance of a vector x to the hyperplane H(w,b):

1
r r
dist ( x , H ( w, b ) ) = r r ⋅ ( , x + b)
r r
w
w, w
Computation of the
Two assumptions for classifying xi (class 1: yi = +1, class 2: yi = –1):

1) The classification error is zero
r r
yi = −1 ⇒ w, xi + b < 0
yi ⋅ ( w, xi + b ) > 0
r r
r r  ⇔
yi = +1 ⇒ w, xi + b > 0
2) The margin is maximal
Let ξ denote the minimum 1

( , xi + b )
r r

distance of any training ξ = min

r r r ⋅ w
x i ∈TR
object xi to the hyperplane w, w
H(w,b):
1
r r ⋅ ( w, xi + b ) ≥ ξ for i ∈ [1..n]
r r
Then: Maximize ξ subject to yi ⋅
w, w

1
r r ⋅ ( w, xi + b) ≥ ξ
r r
Maximize ξ subject to ∀i ∈ [1..n]: yi ⋅
w, w
1
Let ξ = r r
w, w
and reformulate the condition:
y i ⋅ ξ ⋅ ( w, xi + b ) ≥ ξ
r r
∀i ∈ [1..n]:
y i ⋅ ( w, x i + b ) ≥ 1
r r
∀i ∈ [1..n]:
1 r r
Maximization of r r
w, w
corresponds to a minimization of w, w
Primary optimization problem:

r r
Find a vector w that minimizes w, w
(rr
subject to ∀i ∈ [1..n]: yi ⋅ w, xi + b ≥1 )
Dual Optimization Problem
For computational purposes, transform the primary optimization

problem into a dual one by using Lagrange multipliers
Dual optimization problem: Find parameters αi that

n
r 1 n n r r
minimize L(α ) = ∑ α i − ∑∑ α i ⋅ α j ⋅ yi ⋅ y j ⋅ xi ⋅ x j
i =1 2 i =1 j =1
subject to ∑i =1α i ⋅ yi = 0 and 0 ≤ αi
n
For the solution, use algorithms from optimization theory

Up to now only linearly separable data
If data is not linearly separable: Soft Margin Optimization
Soft Margin Optimization

Problem of Maximum Margin Optimization: How to treat non-
linearly separable data?
Two typical problems:
data points are not separable complete separation is not optimal
Trade-off between training error and size of margin

Additionally regard the number of
training errors when optimizing:
p2
ξ2 ξ is the distance from p to the
i i
p1 margin (often called slack
ξ1 variable)
C controls the influence of
single training vectors
Primary optimization problem with soft margin:

r r
Find a w that minimizes 1 w, w + C ⋅ ∑
n
ξ
i =1 i
2
(rr
subject to ∀i ∈ [1..n]: yi ⋅ w, xi + b ≥1−ξi and ξi ≥ 0 )

Dual optimization problem with Lagrange multipliers:
n
1 n n r r r
Dual OP: Maximize L(α ) = ∑ α i − ∑∑ α i ⋅ α j ⋅ yi ⋅ y j ⋅ xi ⋅ x j
n
i =1 2 i =1 j =1
subject to ∑α ⋅ y
i =1
i i = 0 and 0 ≤ αi ≤ C
0 < αi < C: pi is a support vector with ξi = 0

αi = C: pi is a support vector with ξi >0 p2
αi = 0: pi is no support vector ξ2
p1
ξ1
Decision rule:
r  r r 
h ( x ) = sign  ∑ α i ⋅ y i ⋅ xi , x + b 
 xi ∈SV 
Kernel Machines:
Non-Linearly Separable Data Sets
Problem: For real data sets, a linear separation with a high
classification accuracy often is not possible
Idea: Transform the data non-linearly into a new space, and try to
separate the data in the new space linearly (extension of the
hypotheses space)
Example for a quadratically separable data set
Kernel Machines:
Extension of the Hypotheses Space
Principle
input space φ extended feature space
Try to separate in the extended feature space linearly
Example
(x, y, z) φ (x, y, z, x2, xy, xz, y2, yz, z2)
Here: a hyperplane in the extended feature space is a

polynomial of degree 2 in the input space

Kernel Machines: Example
Input space (2 attributes): Extended space (6 attributes):

r
x = ( x1 , x2 )
r
(
φ ( x ) = x12 , x22 , 2 ⋅ x1 , 2 ⋅ x2 , 2 ⋅ x1 ⋅ x2 ,1 )
x2 x2
x1 x12
Kernel Machines: Example (2)
Input space (2 attributes): Extended space (3 attributes):

r
x = ( x1 , x2 )
r
(
φ (x ) = x12 , x22 , 2 x1 x2 )
x2 x22
x1 x12
0 0 1

Kernel Machines
Introduction of a kernel corresponds to a feature transformation
r
φ (x ) : FS old 
→ FS new
Dual optimization problem:

n
r 1 n n r r
Maximize L(α ) = ∑ α i − ∑∑ α i ⋅ α j ⋅ yi ⋅ y j ⋅ φ ( xi ), φ ( x j )
i =1 2 i =1 j =1
∑
n
subject to i =1
α i ⋅ yi = 0 and 0 ≤ αi ≤ C
Feature transform φ only affects the scalar product of training vectors

K φ (xi , x j ) = φ ( xi ), φ ( x j )
Kernel K is a function: r r r r
Kernel Machines: Examples
Radial basis kernel Polynomial kernel (degree 2)

r r
( r r2
K ( x , y ) = exp − γ ⋅ x − y ) K ( x , y ) = ( x , y + 1)
r r r r d

Support Vector Machines: Discussion
+ generate classifiers with a high classification accuracy

+ relatively weak tendency to overfitting (generalization
theory)
+ efficient classification of new objects
+ compact models
– training times may be long (appropriate feature space

may be very high-dimensional)
– expensive implementation
– resulting models rarely provide an intuition
What Is Prediction?
Prediction is similar to classification

First, construct a model
Second, use model to predict unknown value
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions

Predictive Modeling in Databases
Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
One can only predict value ranges or category distributions
Method outline:
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction
Determine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysis
Regress Analysis and Log-Linear

Models in Prediction
Linear regression: Y = α + β X
Two parameters, α and β specify the line and are to be
estimated by using the data at hand.
using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, …
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the
above.
Log-linear models:
The multi-way table of joint probabilities is approximated
by a product of lower-order tables.
Probability: p(a, b, c, d) = αab βac χad δbcd
Locally Weighted Regression
Construct an explicit approximation to f over a local region
surrounding query instance xq
Locally weighted linear regression:
The target function f is approximated near xq using the linear
function: f$ ( x ) = w + w a ( x ) +L + w a ( x )
0 1 1 n n
minimize the squared error: distance-decreasing weight K
E ( xq ) ≡ 1 ∑ x∈nearest _ neighbors (x ,k ) ( f ( x) − fˆ ( x)) 2 ⋅ K (d ( xq , x))
2 q
the gradient descent training rule:

( )
∆w ≡ η ∑ x∈nearest _ neighbors (x ,k ) K (d ( xq , x)) ⋅ f ( x) − fˆ ( x) ⋅ a j ( x)
j q
In most cases, the target function is approximated by a constant,

linear, or quadratic function.
Prediction: Numerical Data

Prediction: Categorical Data
Chapter 7 – Conclusions
Classification is an extensively studied problem (mainly

in statistics, machine learning & neural networks)
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc.

References (I)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data
Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,
pages 416-427, New York, NY, August 1998.
T. Joachims: Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision
tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop
Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham,
England, April 1997.
References (II)
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,
March 1996.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August
1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data
mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991.

Fuzzyppt

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fuzzyppt

Uploaded by

Copyright:

Available Formats

Chapter 7: Classification

 Nearest Neighbor Classifier

 Decision Tree Classifiers

 Further Approaches to Classification

Scalability to Large Databases:

 Construction of decision trees is one of the most important tasks in

 main memory resident data

 necessity to use secondary storage algorithms

 Scalability for databases of arbitrary (i.e., unbounded) size

WS 2003/04 Data Mining Algorithms 7 – 82

 Support by indexing structures (secondary storage)

 Management of the data by a database system

 Indexing structures may provide high efficiency

Æ no loss in the quality of decision trees

Scalability to Large Databases:

Identify expensive operations:

 for numerical attributes

 sorting the attribute values

 read and write operations to access the training data

Effort for growth phase dominates the overall effort

 [Mehta, Agrawal & Rissanen 1996]

 SLIQ: Scalable decision tree classifier

 Special data structures

 for every node of the decision tree

 for each numerical attribute

WS 2003/04 Data Mining Algorithms 7 – 85

SLIQ: Data Structures

 in combination with reference to respective entry in class list

 reference to the respective leaf node in the decision tree

 frequencies of the individual classes per partition

WS 2003/04 Data Mining Algorithms 7 – 86

WS 2003/04 Data Mining Algorithms 7 – 87

 Breadth first strategy

tree, evaluate all possible splits for all attributes

 Split of numerical attributes

for each value v in the list do:

 [Shafer, Agrawal & Mehta 1996]

 Simple parallelization of the method

WS 2003/04 Data Mining Algorithms 7 – 89

SPRINT: Data Structures

 additional attribute „class“ for the attribute lists

 separated attribute lists for each node of the

WS 2003/04 Data Mining Algorithms 7 – 90

WS 2003/04 Data Mining Algorithms 7 – 91

SPRINT: Experimental Evaluation

 SLIQ is more efficient than SPRINT as long as the class

 [Gehrke, Ramakrishnan & Ganti 1998]

 Is applicable to breadth first decision tree construction only

 Applicable to all known algorithms

 RainForest: Basic idea

WS 2003/04 Data Mining Algorithms 7 – 93

RainForest: Data Structures

 AVC set for attribute a and node k

 For all training objects that belong to the partition of node k

 Entries: (ai, cj, count)

 AVC group for node k

 For categorial attributes:

 At least one of the AVC sets fits into main memory

 Potentially, the entire AVC group fits into main memory

WS 2003/04 Data Mining Algorithms 7 – 94

WS 2003/04 Data Mining Algorithms 7 – 95

Æ training set is read twice and written once

Nearest Neighbor Classifier

Decision Tree Classifiers

Further Approaches to Classification

Construction of decision trees is one of the most important tasks in

main memory resident data

necessity to use secondary storage algorithms

Scalability for databases of arbitrary (i.e., unbounded) size

Support by indexing structures (secondary storage)

Management of the data by a database system

Indexing structures may provide high efficiency

for numerical attributes

sorting the attribute values

read and write operations to access the training data

[Mehta, Agrawal & Rissanen 1996]

SLIQ: Scalable decision tree classifier

Special data structures

for every node of the decision tree

for each numerical attribute

in combination with reference to respective entry in class list

reference to the respective leaf node in the decision tree

frequencies of the individual classes per partition

Breadth first strategy

Split of numerical attributes

[Shafer, Agrawal & Mehta 1996]

Simple parallelization of the method

additional attribute „class“ for the attribute lists

separated attribute lists for each node of the

SLIQ is more efficient than SPRINT as long as the class

[Gehrke, Ramakrishnan & Ganti 1998]

Is applicable to breadth first decision tree construction only

Applicable to all known algorithms

RainForest: Basic idea

AVC set for attribute a and node k

For all training objects that belong to the partition of node k

Entries: (ai, cj, count)

AVC group for node k

For categorial attributes:

At least one of the AVC sets fits into main memory

Potentially, the entire AVC group fits into main memory

Reading of desired partitions from the entire training data set

Simultaneous creation of AVC groups for as many partitions as

for all RainForest algorithms, the runtime linearly

Techniques to increase classification accuracy

Obtain a hypothesis (classifier) h(t) under w(t)

Nearest Neighbor Classifier

Decision Tree Classifiers

Further Approaches to Classification

robust, works when training examples contain errors

output may be discrete, real-valued, or a vector of

difficult to understand the learned function (weights),

The n-dimensional input vector x = (x1, x2, …, xn) is mapped

The ultimate objective of training

Feed the input tuples into the network one by one

For each unit

Compute the net input to the unit as a linear combination

Based on the evolutionary notion of survival of the fittest,

Fuzzy logic uses truth values between 0.0 and 1.0 to

Motivation: Linear Separation Vectors in ℜ d represent objects

Classification by linear separation:

Problems of linear separation

Computation of selected hyperplanes is very

Approach to solve these problems

Distance to the objects of both classes

Basic idea: Linear separation with the

Hyperplane H(w,b) with normal vector w and value b:

Distance of a vector x to the hyperplane H(w,b):

Two assumptions for classifying xi (class 1: yi = +1, class 2: yi = –1):

For computational purposes, transform the primary optimization

For the solution, use algorithms from optimization theory

Trade-off between training error and size of margin

Try to separate in the extended feature space linearly