You are on page 1of 28

Chapter 7: Classification

„ Introduction
„ Classification problem, evaluation of classifiers

„ Bayesian Classifiers
„ Optimal Bayes classifier, naive Bayes classifier, applications

„ Nearest Neighbor Classifier


„ Basic notions, choice of parameters, applications

„ Decision Tree Classifiers


„ Basic notions, split strategies, overfitting, pruning of decision
trees
„ Scalability to Large Databases
„ SLIQ, SPRINT, RainForest

„ Further Approaches to Classification


„ Neural networks, genetic algorithm, rough set approach, fuzzy
set approaches, support vector machines, prediction
WS 2003/04 Data Mining Algorithms 7 – 81

Scalability to Large Databases:


Motivation

„ Construction of decision trees is one of the most important tasks in


classification

„ We considered up to now
„ small data sets

„ main memory resident data

„ New requirements
„ larger and larger commercial databases

„ necessity to use secondary storage algorithms

„ Scalability for databases of arbitrary (i.e., unbounded) size

WS 2003/04 Data Mining Algorithms 7 – 82


Scalability to Large Databases:
Approaches

„ Sampling
„ use a subset of the data as training set such that
sample fits into main memory
„ evaluate sample of all potential splits (for numerical
attributes)
Æ poor quality of resulting decision trees

„ Support by indexing structures (secondary storage)


„ Use all data as training set (not just a sample)

„ Management of the data by a database system

„ Indexing structures may provide high efficiency

Æ no loss in the quality of decision trees


WS 2003/04 Data Mining Algorithms 7 – 83

Scalability to Large Databases:


Storage and Indexing Structures

Identify expensive operations:


„ Evaluation of potential splits and selection of best split

„ for numerical attributes

„ sorting the attribute values


„ evaluation of attribute values as potential split points
„ for categorial attributes
„ O(2m) potential binary splits for m distinct attribute values
„ Partitioning of training data
„ according to the selected split point

„ read and write operations to access the training data

Effort for growth phase dominates the overall effort


WS 2003/04 Data Mining Algorithms 7 – 84
SLIQ: Introduction

„ [Mehta, Agrawal & Rissanen 1996]

„ SLIQ: Scalable decision tree classifier


„ Binary splits
„ Evaluation of the splits by using the Gini-Index
k
gini (T ) = 1 − ∑ p 2j for k classes ci with
frequencies pi
j =1

„ Special data structures


„ avoid sorting of the training data

„ for every node of the decision tree

„ for each numerical attribute

WS 2003/04 Data Mining Algorithms 7 – 85

SLIQ: Data Structures

„ Attribute lists
„ values of an attribute in ascending order

„ in combination with reference to respective entry in class list

„ sequential access
„ secondary storage resident
„ Class list
„ contains class label for each training object and

„ reference to the respective leaf node in the decision tree

„ random access
„ main memory resident
„ Histograms
„ for each leaf node of the decision tree

„ frequencies of the individual classes per partition

WS 2003/04 Data Mining Algorithms 7 – 86


SLIQ: Example
Attribute lists
Age Id
Training data
23 2
Id Age Income Class 30 1
1 30 65 G 40 3
2 23 15 B Class list
45 6
3 40 75 G Id Class Leaf 55 5
4 55 40 B 1 G N1 55 4
5 55 100 G 2 B N1
6 45 60 G 3 G N1
4 B N1
5 G N1 Income Id
6 G N1 15 2
40 4
N1 60 6
65 1
75 3
100 5

WS 2003/04 Data Mining Algorithms 7 – 87

SLIQ: Algorithm

„ Breadth first strategy


„ For all leaf nodes on the same level of the decision

tree, evaluate all possible splits for all attributes


„ Standard decision tree classifiers follow a depth first strategy

„ Split of numerical attributes


„ Sequentially scan the attribute list of attribute a, and

for each value v in the list do:


„ Determine the respective entry e in the class list
„ Let k be the value of the „leaf“ attribute of e
„ Update the histogram of k based on the value of the „class“
attribute of e
WS 2003/04 Data Mining Algorithms 7 – 88
SPRINT: Introduction

„ [Shafer, Agrawal & Mehta 1996]

„ Shortcomings of SLIQ
„ Size of class list linearly grows with the size of the
database, i.e. with the number of training examples
„ SLIQ scales well only if sufficient main memory for
the entire class list is available

„ Goals of SPRINT
„ Scalability for arbitrarily large databases

„ Simple parallelization of the method

WS 2003/04 Data Mining Algorithms 7 – 89

SPRINT: Data Structures

„ Class list
„ there is no class list any longer

„ additional attribute „class“ for the attribute lists


(resident in secondary storage)
„ no main memory data structures any longer
„ scalable to arbitrarily large databases

„ Attribute lists
„ no single attribute list for the entire training set

„ separated attribute lists for each node of the


decision tree instead
„ waiving of central data structures supports a simple
parallelization of SPRINT

WS 2003/04 Data Mining Algorithms 7 – 90


SPRINT: Example
Age Class Id car type class Id
17 high 1 family high 0
20 high 5 sportive high 1
23 high 0
Attribute lists sportive high 2
32 low 4 for node N1 family low 3
43 high 2 truck low 4
68 low 3 family high 5
N1
Age ≤ 27.5 Age > 27.5
Age Class Id age class Id
N2 N3
17 high 1 32 low 4
20 high 5 43 high 2
23 Hoch 0 Attribute Attribute 68 low 3
lists for lists for
car type class Id node N2 node N3 car type class Id
family high 0 sportive high 2
sportive high 1 family low 3
family high 5 truck low 4

WS 2003/04 Data Mining Algorithms 7 – 91

SPRINT: Experimental Evaluation


8000
runtime (in seconds)

7000
6000

5000 SPRINT
4000
3000

2000
1000 SLIQ number of objects
0
0 0.5 1.0 1.5 2.0 2.5 3.0 (in millions)

„ SLIQ is more efficient than SPRINT as long as the class


list fits into main memory
„ SLIQ is not applicable for data sets with more than one
million entries
WS 2003/04 Data Mining Algorithms 7 – 92
RainForest: Introduction

„ [Gehrke, Ramakrishnan & Ganti 1998]

„ Shortcomings of SPRINT
„ Does not exploit the available main memory

„ Is applicable to breadth first decision tree construction only

„ Goals of RainForest
„ Exploits the available main memory to increase the efficiency

„ Applicable to all known algorithms

„ RainForest: Basic idea


„ Separate scalability aspects from quality aspects of a decision
tree classifier

WS 2003/04 Data Mining Algorithms 7 – 93

RainForest: Data Structures

„ AVC set for attribute a and node k


„ Contains a class histogram for each value of a

„ For all training objects that belong to the partition of node k

„ Entries: (ai, cj, count)

„ AVC group for node k


„ Set of AVC sets of node k for all attributes

„ For categorial attributes:


„ AVC set is significantly smaller than attribute lists

„ At least one of the AVC sets fits into main memory

„ Potentially, the entire AVC group fits into main memory

WS 2003/04 Data Mining Algorithms 7 – 94


RainForest: Example
Training data
Id age income class AVC set „age“ for N1 AVC set „income“ for N1
1 young 65 G value class count value class count
2 young 15 B young B 1 15 B 1
3 young 75 G young G 2 40 B 1
4 senior 40 B senior B 1 60 G 1
5 senior 100 G senior G 2 65 G 1
75 G 1
6 senior 60 G
100 G 1
N1
age = young age = senior

N2 N3
AVC set „income“ for N2
AVC set „age“ for N2
value class count
15 B 1 value class count
65 G 1 young B 1
75 G 1 young G 2

WS 2003/04 Data Mining Algorithms 7 – 95

RainForest: Algorithms

„ Assumption
„ The entire AVC group of the root node fits into main memory

„ Then, the AVC groups of each node also fit into main memory

„ Algorithm RF_Write
„ Construction of the AVC group of node k in main memory by
sequential scan over the training set
„ Determination of the optimal split for node k by using the AVC
group
„ Reading the training set and distribution (writing) to the
partitions

Æ training set is read twice and written once

WS 2003/04 Data Mining Algorithms 7 – 96


RainForest: Algorithms

„ Algorithm RF_Read
„ Avoids explicit writing of the partitions to secondary storage

„ Reading of desired partitions from the entire training data set

„ Simultaneous creation of AVC groups for as many partitions as

possible
„ Training database is read for each tree level multiple times

„ Algorithm RF_Hybrid
„ Usage of RF_Read as long as the AVC groups of all nodes from

the current level of the decision tree fit into main memory
„ Subsequent materialization of the partitions by using RF_Write

WS 2003/04 Data Mining Algorithms 7 – 97

RainForest: Experimental Evaluation


runtime (in seconds)

20,000
number of
training objects
SPRINT (in millions)

10,000

RainForest

1.0 2.0 3.0

„ for all RainForest algorithms, the runtime linearly


increases with the number n of training objects
„ RainForest is significantly more efficient than SPRINT
WS 2003/04 Data Mining Algorithms 7 – 98
Boosting and Bagging

„ Techniques to increase classification accuracy


„ Bagging
„ Basic idea: Learn a set of classifiers and decide the

class prediction by following the majority of the


individual votes
„ Boosting
„ Basic idea: Learn a series of classifiers, where each

classifier in the series pays more attention to the


examples misclassified by its predecessor
„ Applicable to decision trees or Bayesian classifier

WS 2003/04 Data Mining Algorithms 7 – 99

Boosting: Algorithm

„ Algorithm
„ Assign every example an equal weight 1/N

„ For t = 1, 2, …, T do

„ Obtain a hypothesis (classifier) h(t) under w(t)


„ Calculate the error of h(t) and re-weight the examples based
on the error
„ Normalize w(t+1) to sum to 1.0
Output a weighted sum of all the hypothesis, with
„

each hypothesis weighted according to its accuracy


on the training set
„ Boosting requires only linear time and constant space

WS 2003/04 Data Mining Algorithms 7 – 100


Chapter 7: Classification

„ Introduction
„ Classification problem, evaluation of classifiers

„ Bayesian Classifiers
„ Optimal Bayes classifier, naive Bayes classifier, applications

„ Nearest Neighbor Classifier


„ Basic notions, choice of parameters, applications

„ Decision Tree Classifiers


„ Basic notions, split strategies, overfitting, pruning of decision
trees
„ Scalability to Large Databases
„ SLIQ, SPRINT, RainForest

„ Further Approaches to Classification


„ Neural networks, genetic algorithm, rough set approach, fuzzy
set approaches, support vector machines, prediction
WS 2003/04 Data Mining Algorithms 7 – 101

Neural Networks

„ Advantages
„ prediction accuracy is generally high

„ robust, works when training examples contain errors

„ output may be discrete, real-valued, or a vector of

several discrete or real-valued attributes


„ fast evaluation of the learned target function

„ Criticism
„ long training time

„ difficult to understand the learned function (weights),

no explicit knowledge generated


„ not easy to incorporate domain knowledge

WS 2003/04 Data Mining Algorithms 7 – 102


A Neuron
µk (bias for input k)
x1 w1
x2 w2
… …
Σ f
output y
xn wn
input weight weighted activation
vector x vector w sum function

„ The n-dimensional input vector x = (x1, x2, …, xn) is mapped


into variable y by means of the scalar product and a
nonlinear function mapping
WS 2003/04 Data Mining Algorithms 7 – 103

Network Training

„ The ultimate objective of training


„ obtain a set of weights that makes almost all the

tuples in the training data classified correctly


„ Steps
„ Initialize weights with random values

„ Feed the input tuples into the network one by one

„ For each unit

„ Compute the net input to the unit as a linear combination


of all the inputs to the unit
„ Compute the output value using the activation function
„ Compute the error
„ Update the weights and the bias
WS 2003/04 Data Mining Algorithms 7 – 104
Multi-Layer Perceptron

Output vector
Errj = O j (1 − O j )∑k Errk w jk
Output nodes
θ j = θ j + (l) Err j
wij = wij + (l ) Err j Oi
Hidden nodes Err j = O j (1 − O j )(T j − O j )
wij 1
Oj = −I
1+ e j
Input nodes
I j = ∑ wij Oi + θ j
i

Input vector: xi
WS 2003/04 Data Mining Algorithms 7 – 105

Network Pruning and Rule Extraction

„ Network pruning
„ Fully connected network will be hard to articulate
„ N input nodes, h hidden nodes and m output nodes lead to
h⋅(m+N) weights
„ Pruning: Remove some of the links without affecting classification
accuracy of the network
„ Extracting rules from a trained network
„ Discretize activation values; replace individual activation value by
the cluster average maintaining the network accuracy
„ Enumerate the output from the discretized activation values to
find rules between activation value and output
„ Find the relationship between the input and activation value
„ Combine the above two to have rules relating the output to input
WS 2003/04 Data Mining Algorithms 7 – 106
Genetic Algorithms
„ GA: based on an analogy to biological evolution
„ Each rule is represented by a string of bits
„ An initial population is created consisting of randomly
generated rules
„ e.g., “If A1 and Not A2 then C2” can be encoded as 100

„ Based on the evolutionary notion of survival of the fittest,


a new population is formed that consists of the fittest
rules and their offsprings
„ The fitness of a rule is represented by its classification
accuracy on a set of training examples
„ Offsprings are generated by crossover and mutation

WS 2003/04 Data Mining Algorithms 7 – 107

Rough Set Approach


„ Rough sets are used to approximately or “roughly”
define equivalent classes
„ A rough set for a given class C is approximated by two
sets: a lower approximation (certain to be in C) and an
upper approximation (cannot be described as not
belonging to C)
„ Finding the minimal subsets (reducts) of attributes (for
feature reduction) is NP-hard but a discernibility matrix
is used to reduce the computation intensity

WS 2003/04 Data Mining Algorithms 7 – 108


Fuzzy Set
Approaches

„ Fuzzy logic uses truth values between 0.0 and 1.0 to


represent the degree of membership (such as using
fuzzy membership graph)
„ Attribute values are converted to fuzzy values
„ e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
„ For a given new sample, more than one fuzzy value may
apply
„ Each applicable rule contributes a vote for membership
in the categories
„ Typically, the truth values for each predicted category
are summed
WS 2003/04 Data Mining Algorithms 7 – 109

© and acknowledgements: Prof. Dr. Hans-Peter Kriegel and Matthias Schubert (LMU Munich)
and Dr. Thorsten Joachims (U Dortmund and Cornell U)

Support Vector Machines (SVM)

Motivation: Linear Separation „ Vectors in ℜ d represent objects


„ Objects belong to exactly one of
two respective classes
„ For the sake of simpler formulas,
the used class labels are:
y = –1 and y = +1

„ Classification by linear separation:


determine hyperplane which
separates both vector sets with a
„maximal stability“
„ Assign unknown elements to the
separating hyperplane halfspace in which they reside

WS 2003/04 Data Mining Algorithms 7 – 110


Support Vector Machines

„ Problems of linear separation


„ Definition and efficient determination of the
maximum stable hyperplane
„ Classes are not always linearly separable

„ Computation of selected hyperplanes is very


expensive
„ Restriction to two classes

„ …

„ Approach to solve these problems


„ Support Vector Machines (SVMs) [Vapnik 1979, 1995]

WS 2003/04 Data Mining Algorithms 7 – 111

Maximum Margin Hyperplane


„ Observation: There is no unique hyperplane to separate p1 from p2
„ Question: which hyperplane separates the classes best?

p2 p2

p1 p1

„ Criteria
„ Stability at insertion

„ Distance to the objects of both classes

WS 2003/04 Data Mining Algorithms 7 – 112


Support Vector Machines: Principle

„ Basic idea: Linear separation with the


maximum margin hyperplane Maximum Margin Hyperplane (MMH)
„ Distance to points from any of the
two sets is maximal, i.e. at least ξ
„ Minimal probability that the
p2 separating hyperplane has to be
ξ
ξ moved due to an insertion
p1 „ Best generalization behaviour
„ MMH is „maximally stable“
„ MMH only depends on points pi whose
margin distance to the hyperplane exactly is ξ
„ pi is called a support vector

WS 2003/04 Data Mining Algorithms 7 – 113

Maximum Margin Hyperplane


„ Recall some algebraic notions for feature space FS
r r r r
„ Inner product of two vectors x , y ∈ FS : x ,y
r r
e.g., canonical scalar product: x , y = ∑ ( xi ⋅ yi )
d
„
i =1

„ Hyperplane H(w,b) with normal vector w and value b:


H (w , b ) = { x ∈ FS , w , x + b = 0 }
r r r r

„ Distance of a vector x to the hyperplane H(w,b):


1
r r
dist ( x , H ( w, b ) ) = r r ⋅ ( , x + b)
r r
w
w, w
WS 2003/04 Data Mining Algorithms 7 – 114
Computation of the
Maximum Margin Hyperplane

„ Two assumptions for classifying xi (class 1: yi = +1, class 2: yi = –1):


1) The classification error is zero
r r
yi = −1 ⇒ w, xi + b < 0
yi ⋅ ( w, xi + b ) > 0
r r
r r  ⇔
yi = +1 ⇒ w, xi + b > 0
2) The margin is maximal

Let ξ denote the minimum 1


( , xi + b )
r r
„

distance of any training ξ = min


r r r ⋅ w
x i ∈TR
object xi to the hyperplane w, w
H(w,b):
1
r r ⋅ ( w, xi + b ) ≥ ξ for i ∈ [1..n]
r r
„ Then: Maximize ξ subject to yi ⋅
w, w
WS 2003/04 Data Mining Algorithms 7 – 115

Maximum Margin Hyperplane


1
r r ⋅ ( w, xi + b) ≥ ξ
r r
„ Maximize ξ subject to ∀i ∈ [1..n]: yi ⋅
w, w
1
„ Let ξ = r r
w, w
and reformulate the condition:

y i ⋅ ξ ⋅ ( w, xi + b ) ≥ ξ
r r
∀i ∈ [1..n]:

y i ⋅ ( w, x i + b ) ≥ 1
r r
∀i ∈ [1..n]:
1 r r
„ Maximization of r r
w, w
corresponds to a minimization of w, w

Primary optimization problem:


r r
Find a vector w that minimizes w, w
(rr
subject to ∀i ∈ [1..n]: yi ⋅ w, xi + b ≥1 )
WS 2003/04 Data Mining Algorithms 7 – 116
Dual Optimization Problem

„ For computational purposes, transform the primary optimization


problem into a dual one by using Lagrange multipliers

Dual optimization problem: Find parameters αi that


n
r 1 n n r r
minimize L(α ) = ∑ α i − ∑∑ α i ⋅ α j ⋅ yi ⋅ y j ⋅ xi ⋅ x j
i =1 2 i =1 j =1
subject to ∑i =1α i ⋅ yi = 0 and 0 ≤ αi
n

„ For the solution, use algorithms from optimization theory


„ Up to now only linearly separable data
„ If data is not linearly separable: Soft Margin Optimization

WS 2003/04 Data Mining Algorithms 7 – 117

Soft Margin Optimization


„ Problem of Maximum Margin Optimization: How to treat non-
linearly separable data?
„ Two typical problems:

data points are not separable complete separation is not optimal

„ Trade-off between training error and size of margin


WS 2003/04 Data Mining Algorithms 7 – 118
Soft Margin Optimization
„ Additionally regard the number of
training errors when optimizing:
p2
ξ2 „ ξ is the distance from p to the
i i
p1 margin (often called slack
ξ1 variable)
„ C controls the influence of

single training vectors

Primary optimization problem with soft margin:


r r
Find a w that minimizes 1 w, w + C ⋅ ∑
n
ξ
i =1 i
2
(rr
subject to ∀i ∈ [1..n]: yi ⋅ w, xi + b ≥1−ξi and ξi ≥ 0 )
WS 2003/04 Data Mining Algorithms 7 – 119

Soft Margin Optimization


Dual optimization problem with Lagrange multipliers:
n
1 n n r r r
Dual OP: Maximize L(α ) = ∑ α i − ∑∑ α i ⋅ α j ⋅ yi ⋅ y j ⋅ xi ⋅ x j
n
i =1 2 i =1 j =1
subject to ∑α ⋅ y
i =1
i i = 0 and 0 ≤ αi ≤ C

0 < αi < C: pi is a support vector with ξi = 0


αi = C: pi is a support vector with ξi >0 p2
αi = 0: pi is no support vector ξ2
p1
ξ1
Decision rule:
r  r r 
h ( x ) = sign  ∑ α i ⋅ y i ⋅ xi , x + b 
 xi ∈SV 
WS 2003/04 Data Mining Algorithms 7 – 120
Kernel Machines:
Non-Linearly Separable Data Sets
„ Problem: For real data sets, a linear separation with a high
classification accuracy often is not possible
„ Idea: Transform the data non-linearly into a new space, and try to
separate the data in the new space linearly (extension of the
hypotheses space)

Example for a quadratically separable data set

WS 2003/04 Data Mining Algorithms 7 – 121

Kernel Machines:
Extension of the Hypotheses Space

„ Principle
input space φ extended feature space

„ Try to separate in the extended feature space linearly

„ Example
(x, y, z) φ (x, y, z, x2, xy, xz, y2, yz, z2)

„ Here: a hyperplane in the extended feature space is a


polynomial of degree 2 in the input space

WS 2003/04 Data Mining Algorithms 7 – 122


Kernel Machines: Example

Input space (2 attributes): Extended space (6 attributes):


r
x = ( x1 , x2 )
r
(
φ ( x ) = x12 , x22 , 2 ⋅ x1 , 2 ⋅ x2 , 2 ⋅ x1 ⋅ x2 ,1 )
x2 x2

x1 x12

WS 2003/04 Data Mining Algorithms 7 – 123

Kernel Machines: Example (2)

Input space (2 attributes): Extended space (3 attributes):


r
x = ( x1 , x2 )
r
(
φ (x ) = x12 , x22 , 2 x1 x2 )
x2 x22

x1 x12
0 0 1

WS 2003/04 Data Mining Algorithms 7 – 124


Kernel Machines
„ Introduction of a kernel corresponds to a feature transformation
r
φ (x ) : FS old 
→ FS new

Dual optimization problem:


n
r 1 n n r r
Maximize L(α ) = ∑ α i − ∑∑ α i ⋅ α j ⋅ yi ⋅ y j ⋅ φ ( xi ), φ ( x j )
i =1 2 i =1 j =1

n
subject to i =1
α i ⋅ yi = 0 and 0 ≤ αi ≤ C

„ Feature transform φ only affects the scalar product of training vectors


K φ (xi , x j ) = φ ( xi ), φ ( x j )
„ Kernel K is a function: r r r r

WS 2003/04 Data Mining Algorithms 7 – 125

Kernel Machines: Examples

Radial basis kernel Polynomial kernel (degree 2)


r r
( r r2
K ( x , y ) = exp − γ ⋅ x − y ) K ( x , y ) = ( x , y + 1)
r r r r d

WS 2003/04 Data Mining Algorithms 7 – 126


Support Vector Machines: Discussion

+ generate classifiers with a high classification accuracy


+ relatively weak tendency to overfitting (generalization
theory)
+ efficient classification of new objects
+ compact models

– training times may be long (appropriate feature space


may be very high-dimensional)
– expensive implementation
– resulting models rarely provide an intuition

WS 2003/04 Data Mining Algorithms 7 – 127

What Is Prediction?

„ Prediction is similar to classification


„ First, construct a model
„ Second, use model to predict unknown value
„ Major method for prediction is regression
„ Linear and multiple regression
„ Non-linear regression
„ Prediction is different from classification
„ Classification refers to predict categorical class label
„ Prediction models continuous-valued functions

WS 2003/04 Data Mining Algorithms 7 – 128


Predictive Modeling in Databases
„ Predictive modeling: Predict data values or construct
generalized linear models based on the database data.
„ One can only predict value ranges or category distributions
„ Method outline:
„ Minimal generalization
„ Attribute relevance analysis
„ Generalized linear model construction
„ Prediction
„ Determine the major factors which influence the prediction
„ Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.
„ Multi-level prediction: drill-down and roll-up analysis
WS 2003/04 Data Mining Algorithms 7 – 129

Regress Analysis and Log-Linear


Models in Prediction
„ Linear regression: Y = α + β X
„ Two parameters, α and β specify the line and are to be
estimated by using the data at hand.
„ using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, …
„ Multiple regression: Y = b0 + b1 X1 + b2 X2
„ Many nonlinear functions can be transformed into the
above.
„ Log-linear models:
„ The multi-way table of joint probabilities is approximated
by a product of lower-order tables.
„ Probability: p(a, b, c, d) = αab βac χad δbcd
WS 2003/04 Data Mining Algorithms 7 – 130
Locally Weighted Regression
„ Construct an explicit approximation to f over a local region
surrounding query instance xq
„ Locally weighted linear regression:
„ The target function f is approximated near xq using the linear

function: f$ ( x ) = w + w a ( x ) +L + w a ( x )
0 1 1 n n
„ minimize the squared error: distance-decreasing weight K
E ( xq ) ≡ 1 ∑ x∈nearest _ neighbors (x ,k ) ( f ( x) − fˆ ( x)) 2 ⋅ K (d ( xq , x))
2 q

„ the gradient descent training rule:


( )
∆w ≡ η ∑ x∈nearest _ neighbors (x ,k ) K (d ( xq , x)) ⋅ f ( x) − fˆ ( x) ⋅ a j ( x)
j q

„ In most cases, the target function is approximated by a constant,


linear, or quadratic function.
WS 2003/04 Data Mining Algorithms 7 – 131

Prediction: Numerical Data

WS 2003/04 Data Mining Algorithms 7 – 132


Prediction: Categorical Data

WS 2003/04 Data Mining Algorithms 7 – 133

Chapter 7 – Conclusions

„ Classification is an extensively studied problem (mainly


in statistics, machine learning & neural networks)
„ Classification is probably one of the most widely used
data mining techniques with a lot of extensions
„ Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic
„ Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc.

WS 2003/04 Data Mining Algorithms 7 – 134


References (I)
„ C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997.
„ L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
„ P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data
for scaling machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data
Mining (KDD'95), pages 39-44, Montreal, Canada, August 1995.
„ U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994
AAAI Conf., pages 601-606, AAAI Press, 1994.
„ J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision
tree construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases,
pages 416-427, New York, NY, August 1998.
„ T. Joachims: Learning to Classify Text using Support Vector Machines. Kluwer, 2002.
„ M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision
tree induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop
Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham,
England, April 1997.
WS 2003/04 Data Mining Algorithms 7 – 135

References (II)
„ J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
pages 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
„ M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France,
March 1996.
„ S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
„ J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial
Intelligence (AAAI'96), 725-730, Portland, OR, Aug. 1996.
„ R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and
pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August
1998.
„ J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data
mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept.
1996.
„ S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991.

WS 2003/04 Data Mining Algorithms 7 – 136