Module 3 - DecisionTrees-IntroToANN

Ing. Leonel D Rozo C, M.
Sc, PhD(c)
ing.leonelrozo@gmail.com
2010
Decision tree learning is one of the most widely used and practical
methods for inductive inference. It is a method for approximating
discrete-valued functions that is robust to noisy data and capable of
learning disjunctive expressions.
1. Decision tree representation
 Decision trees classify instances by sorting them down the tree

from the root to some leaf node, which provides the classification
of the instance.
 Each node in the tree specifies a test of some attribute of the

instance, and each branch descending from that node
corresponds to one of the possible values for this attribute.
2. Appropriate problems for decision tree learning
• Instances are represented by attribute-value pairs - Instances are

described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot).
• The target function has discrete output values - The decision tree
assigns a boolean classification (e.g., yes or no) to each
example.
• Disjunctive descriptions may be required.
• The training data may contain errors.
• The training data may contain missing attribute values.

3. The basic decision tree learning algorithm
1. Which attribute should be tested at the root of the tree?
2. The best attribute is selected and used as the test at the root
node of the tree.
3. A descendant of the root node is then created for each possible

value of this attribute, and the training examples are sorted to
the appropriate descendant node.
4. The entire process is then repeated using the training examples

associated with each descendant node to select the best attribute
to test at that point in the tree.
3. The basic decision tree learning algorithm
3.1. Which attribute is the best classifier ?
The central choice in the algorithm is selecting which attribute to test

at each node in the tree. We would like to select the attribute that is
most useful for classifying examples.
 Entropy measures homogeneity of examples

Defining a measure commonly used in information theory, called
entropy, that characterizes the (im)purity of an arbitrary collection of
examples.
Given a collection S, containing positive and negative examples of

some target concept, the entropy of S relative to this boolean
classification is:
 Information gain measures the expected reduction in entropy

Information gain is simply the expected reduction in entropy caused
by partitioning the examples according to this attribute. More
precisely, the information gain, Gain(S, A) of an attribute A, relative
to a collection of examples S, is defined as:
Subset of S for
which attribute A
has value v
Set of all possible

values for attribute A
 An illustrative example
4. Issues in decision tree learning
4.1. Avoiding overfitting the data
The algorithm described before grows

each branch of the tree just deeply enough
to perfectly classify the training examples.
While this is sometimes a reasonable
strategy, in fact it can lead to difficulties
when:
 There is noise in the data.
 The number of training examples is too

small to produce a representative
sample of the true target function.
A hypothesis overfits the training examples if some other hypothesis

that fits the training examples less well actually performs better over the
entire distribution of instances (i.e., including instances beyond the
training set).
There are several approaches to avoiding overfitting in decision tree

learning. These can be grouped into two classes:
 Approaches that stop growing the tree earlier, before it reaches the
point where it perfectly classifies the training data.
 Approaches that allow the tree to overfit the data, and then post-prune
the tree.
 Reduced error pruning

Consider each of the decision nodes in the tree to be candidates for
pruning. Pruning a decision node consists of removing the subtree
rooted at that node, making it a leaf node, and assigning it the most
common classification of the training examples affiliated with that
node.
• Nodes are removed only if the resulting pruned tree performs no

worse than the original over the validation set.
• Nodes are pruned iteratively, always choosing the node whose

removal most increases the decision tree accuracy over the validation
set.
 Reduced error pruning

 Rule post-pruning
i. Infer the decision tree from the training set.
ii. Convert the learned tree into an equivalent set of rules.
iii. Prune (generalize) each rule by removing any preconditions that

result in improving its estimated accuracy.
iv. Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances.
4.2. Incorporating continuous-valued attributes
This can be accomplished by dynamically defining new discrete valued

attributes that partition the continuous attribute value into a discrete set
of intervals.
 In particular, for an attribute A that is continuous-valued, the

algorithm can dynamically create a new boolean attribute Ac, that
is true if A < c and false otherwise.
Many tasks involving intelligence or
pattern recognition are extremely
difficult to automate, but appear to be
performed very easily by animals.
 For instance, animals

recognize various objects and
make sense out of the large
amount of visual information in
their surroundings, apparently
requiring very little effort.
1. Introduction
The neural network of an animal is part

of its nervous system, containing a large
number of interconnected neurons
(nerve cells).
 “Neural” is an adjective for neuron
 “Network” denotes a graph-like

structure.
Artificial neural networks refer to computing

systems whose central theme is borrowed
from the analogy of biological neural
networks
2. History of neural networks
“The amount of activity at any given point in the brain cortex is the sum
of the tendencies of all other points to discharge into it, such tendencies
being proportionate…” (William James)
1. To the number of times the excitement of other points may have

accompanied that of the point in question.
2. To the intensities of such excitements.
3. To the absence of any rival point functionally disconnected with

the first point, into which the discharges may be diverted.
1938 Rashevsky in itiated studies of neurodynamics, also

known as neural field theory, representing activation and
propagation in neural networks in terms of differential
equations.
1943 McCulloch and Pitts invented the first artificial model

for biological neurons using simple binary threshold
functions.
1949 Hebb introduced his famous learning rule: repeated

activation of one neuron by another, across a particular
synapse, increases its conductance.
1954 Gabor invented the “learning filter" that uses gradient

descent to obtain “optimal” weights that minimize the MSE
between the observed output signal and a signal generated
based upon the past information.
1958 Rosenblatt invented the “perceptron”, introducing a

learning method for the McCulloch and Pitts neuron model.
1960 Widrow and Hoff introduced the “Adaline”.
1961 Rosenblatt proposed the “backpropagation” scheme for

training multilayer networks.
1969 The limits of simple perceptrons were demonstrated.

3. Structure and function of a single neuron
3.1. Biological neurons
A typical biological neuron is composed of a cell body, a tubular axon,

and a multitude of hair-like dendrites.
The small gap between an end bulb and a dendrite is called a synapse,
across which information is propagated. The axon of a single neuron
forms synaptic connections with many other neurons.
Inhibitory or excitatory signals from other neurons are transmitted to

a neuron at its dendrites’ synapses . The magnitude of the signal
received by a neuron (from another) depends on the efficiency of the
synaptic transmission.
 The cell membrane becomes electrically active when sufficiently

excited by the neurons making synapses onto this neuron.
 A neuron will fire if sufficient signals from other neurons fall upon its
dendrites in a short period of time, called the period of latent
summation.
3.2. Artificial neuron models
 The position of the neuron (node) of

the incoming synapse (connection) is
irrelevant.
 Each node has a single output value,

distributed to other nodes via
outgoing links, irrespective their
positions.
 All inputs come in the same time or

remain activated at the same level
long enough for computation of f to
occur.
The next level of specialization is to assume that different weighted

inputs are summed.
Now, it is necessary to stablish which function f the neuron has…
 Ramp functions
 Step functions
 Sigmoid functions
 Piecewise linear and Gaussian functions

4. Neural net architectures
A single node is insufficient for many practical problems, and

networks with a large number of nodes are frequently used. The
way nodes are connected determines how computations proceed
and constitutes an important early design decision by a neural
network developer.
 Fully connected networks

 Layered networks
 Acyclic networks
 Feedforward networks
 Modular networks
5. Neural learning
 Correlation learning
When an axon of cell A is near

enough to excite a cell B and
repeatedly or persistently takes
place in firing it, some growth
process or metabolic change takes
place in one or both cells such that
A’s efficiency, as one of the cells
firing B, is increased.
5. Neural learning
 Competitive learning
Another principle for neural computation is that when an

input pattern is presented to a network, different nodes
compete to be " winners" with high levels of activity. The
competitive process involves self-excitation and mutual
inhibition among nodes, until a single winner emerges.
 The connections between input nodes and the winner node are
then modified , increasing the likelihood that the same winner
continues to win in future competitions.
 The converse of competition is cooperation, found in some

neural network models.
5. Neural learning
 Feedback-based weight adaptation
If increasing a particular weight leads

to diminished performance or
larger error, then that weight is
decreased as the network is trained
to perform better.
 The amount of change made at every

step is very small in most networks to
ensure that a network does not stray too
far from its partially evolved state, and so
that the network withstands some
mistakes made by the teacher, feedback,
or performance evaluation mechanism.
6. What can neural networks be used for ?
 Classification
 Clustering
Clustering requires grouping together objects that are similar to

each other…
 Pattern association
In pattern association, another important task that can be

performed by neural networks, the presentation of an input
sample should trigger the generation of a specific output pattern
 Function approximation
Many computational models can be described as functions

mapping some numerical input vectors to numerical
outputs. The outputs corresponding to some input vectors
may be known from training data, but we may not know the
mathematical function describing the actual process that
generates the outputs from the input vectors.
 Forescasting
There are many real-life problems in which future events

must be predicted on the basis of past history. An example
task is that of predicting the behavior of stock market
indices.
 Control applications
Control addresses the task of determining the values for input

variables in order to achieve desired values for output variables.
7. Evaluation of networks
 Quality of results
The performance of a neural network is frequently gauged in

terms of an error measure.
• Euclidean distance
• Manhattan or Hamming distance
In classification problems, another possible error measure is the

fraction of misclassified samples.
7. Evaluation of networks
 Generalizability
It is not surprising for a system to perform well on the data

on which it has been trained. But good generalizability is
also necessary, i.e., the system must perform well on new
test data distinct from training data.
 Computational resources
Once training is complete, many neural networks generally take

up very little time in their execution or application to a specific
problem. However, training the networks or applying a learning
algorithm can take a very long time.
8. Real applications of neural networks
8. Real applications of neural networks

Module 3 - DecisionTrees-IntroToANN

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3 - DecisionTrees-IntroToANN

Uploaded by

Copyright:

Available Formats

Ing. Leonel D Rozo C, M.

 Decision trees classify instances by sorting them down the tree

 Each node in the tree specifies a test of some attribute of the

• Instances are represented by attribute-value pairs - Instances are

• Disjunctive descriptions may be required.

• The training data may contain errors.

• The training data may contain missing attribute values.

1. Which attribute should be tested at the root of the tree?

3. A descendant of the root node is then created for each possible

4. The entire process is then repeated using the training examples

3.1. Which attribute is the best classifier ?

The central choice in the algorithm is selecting which attribute to test

 Entropy measures homogeneity of examples

Given a collection S, containing positive and negative examples of

 Information gain measures the expected reduction in entropy

Set of all possible

4.1. Avoiding overfitting the data

The algorithm described before grows

 There is noise in the data.

 The number of training examples is too

4.1. Avoiding overfitting the data

A hypothesis overfits the training examples if some other hypothesis

4.1. Avoiding overfitting the data

There are several approaches to avoiding overfitting in decision tree

4.1. Avoiding overfitting the data

 Reduced error pruning

• Nodes are removed only if the resulting pruned tree performs no

• Nodes are pruned iteratively, always choosing the node whose

4.1. Avoiding overfitting the data

 Reduced error pruning

4.1. Avoiding overfitting the data

i. Infer the decision tree from the training set.

ii. Convert the learned tree into an equivalent set of rules.

iii. Prune (generalize) each rule by removing any preconditions that

4.2. Incorporating continuous-valued attributes

This can be accomplished by dynamically defining new discrete valued

 In particular, for an attribute A that is continuous-valued, the

 For instance, animals

The neural network of an animal is part

 “Neural” is an adjective for neuron

 “Network” denotes a graph-like

Artificial neural networks refer to computing

1. To the number of times the excitement of other points may have

2. To the intensities of such excitements.

3. To the absence of any rival point functionally disconnected with

1938 Rashevsky in itiated studies of neurodynamics, also

1943 McCulloch and Pitts invented the first artificial model

1949 Hebb introduced his famous learning rule: repeated

1954 Gabor invented the “learning filter" that uses gradient

1958 Rosenblatt invented the “perceptron”, introducing a

1960 Widrow and Hoff introduced the “Adaline”.

1961 Rosenblatt proposed the “backpropagation” scheme for

1969 The limits of simple perceptrons were demonstrated.

3.1. Biological neurons

A typical biological neuron is composed of a cell body, a tubular axon,

3.1. Biological neurons

3.1. Biological neurons

Inhibitory or excitatory signals from other neurons are transmitted to

 The cell membrane becomes electrically active when sufficiently

3.2. Artificial neuron models

 The position of the neuron (node) of

 Each node has a single output value,

 All inputs come in the same time or