You are on page 1of 44

Ing. Leonel D Rozo C, M.

Sc, PhD(c)
ing.leonelrozo@gmail.com

2010
Decision tree learning is one of the most widely used and practical
methods for inductive inference. It is a method for approximating
discrete-valued functions that is robust to noisy data and capable of
learning disjunctive expressions.
1. Decision tree representation

 Decision trees classify instances by sorting them down the tree


from the root to some leaf node, which provides the classification
of the instance.

 Each node in the tree specifies a test of some attribute of the


instance, and each branch descending from that node
corresponds to one of the possible values for this attribute.
2. Appropriate problems for decision tree learning

• Instances are represented by attribute-value pairs - Instances are


described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot).

• The target function has discrete output values - The decision tree
assigns a boolean classification (e.g., yes or no) to each
example.

• Disjunctive descriptions may be required.

• The training data may contain errors.

• The training data may contain missing attribute values.


3. The basic decision tree learning algorithm

1. Which attribute should be tested at the root of the tree?

2. The best attribute is selected and used as the test at the root
node of the tree.

3. A descendant of the root node is then created for each possible


value of this attribute, and the training examples are sorted to
the appropriate descendant node.

4. The entire process is then repeated using the training examples


associated with each descendant node to select the best attribute
to test at that point in the tree.
3. The basic decision tree learning algorithm

3.1. Which attribute is the best classifier ?

The central choice in the algorithm is selecting which attribute to test


at each node in the tree. We would like to select the attribute that is
most useful for classifying examples.
3.1. Which attribute is the best classifier ?

 Entropy measures homogeneity of examples


Defining a measure commonly used in information theory, called
entropy, that characterizes the (im)purity of an arbitrary collection of
examples.

Given a collection S, containing positive and negative examples of


some target concept, the entropy of S relative to this boolean
classification is:
3.1. Which attribute is the best classifier ?

 Information gain measures the expected reduction in entropy


Information gain is simply the expected reduction in entropy caused
by partitioning the examples according to this attribute. More
precisely, the information gain, Gain(S, A) of an attribute A, relative
to a collection of examples S, is defined as:

Subset of S for
which attribute A
has value v

Set of all possible


values for attribute A
3.1. Which attribute is the best classifier ?

 An illustrative example
4. Issues in decision tree learning

4.1. Avoiding overfitting the data

The algorithm described before grows


each branch of the tree just deeply enough
to perfectly classify the training examples.
While this is sometimes a reasonable
strategy, in fact it can lead to difficulties
when:

 There is noise in the data.

 The number of training examples is too


small to produce a representative
sample of the true target function.
4. Issues in decision tree learning

4.1. Avoiding overfitting the data

A hypothesis overfits the training examples if some other hypothesis


that fits the training examples less well actually performs better over the
entire distribution of instances (i.e., including instances beyond the
training set).
4. Issues in decision tree learning

4.1. Avoiding overfitting the data

There are several approaches to avoiding overfitting in decision tree


learning. These can be grouped into two classes:

 Approaches that stop growing the tree earlier, before it reaches the
point where it perfectly classifies the training data.

 Approaches that allow the tree to overfit the data, and then post-prune
the tree.
4. Issues in decision tree learning

4.1. Avoiding overfitting the data

 Reduced error pruning


Consider each of the decision nodes in the tree to be candidates for
pruning. Pruning a decision node consists of removing the subtree
rooted at that node, making it a leaf node, and assigning it the most
common classification of the training examples affiliated with that
node.

• Nodes are removed only if the resulting pruned tree performs no


worse than the original over the validation set.

• Nodes are pruned iteratively, always choosing the node whose


removal most increases the decision tree accuracy over the validation
set.
4. Issues in decision tree learning

4.1. Avoiding overfitting the data

 Reduced error pruning


4. Issues in decision tree learning

4.1. Avoiding overfitting the data

 Rule post-pruning

i. Infer the decision tree from the training set.

ii. Convert the learned tree into an equivalent set of rules.

iii. Prune (generalize) each rule by removing any preconditions that


result in improving its estimated accuracy.

iv. Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances.
4. Issues in decision tree learning

4.2. Incorporating continuous-valued attributes

This can be accomplished by dynamically defining new discrete valued


attributes that partition the continuous attribute value into a discrete set
of intervals.

 In particular, for an attribute A that is continuous-valued, the


algorithm can dynamically create a new boolean attribute Ac, that
is true if A < c and false otherwise.
Many tasks involving intelligence or
pattern recognition are extremely
difficult to automate, but appear to be
performed very easily by animals.

 For instance, animals


recognize various objects and
make sense out of the large
amount of visual information in
their surroundings, apparently
requiring very little effort.
1. Introduction

The neural network of an animal is part


of its nervous system, containing a large
number of interconnected neurons
(nerve cells).

 “Neural” is an adjective for neuron

 “Network” denotes a graph-like


structure.

Artificial neural networks refer to computing


systems whose central theme is borrowed
from the analogy of biological neural
networks
2. History of neural networks

“The amount of activity at any given point in the brain cortex is the sum
of the tendencies of all other points to discharge into it, such tendencies
being proportionate…” (William James)

1. To the number of times the excitement of other points may have


accompanied that of the point in question.

2. To the intensities of such excitements.

3. To the absence of any rival point functionally disconnected with


the first point, into which the discharges may be diverted.
2. History of neural networks

1938 Rashevsky in itiated studies of neurodynamics, also


known as neural field theory, representing activation and
propagation in neural networks in terms of differential
equations.

1943 McCulloch and Pitts invented the first artificial model


for biological neurons using simple binary threshold
functions.

1949 Hebb introduced his famous learning rule: repeated


activation of one neuron by another, across a particular
synapse, increases its conductance.
2. History of neural networks

1954 Gabor invented the “learning filter" that uses gradient


descent to obtain “optimal” weights that minimize the MSE
between the observed output signal and a signal generated
based upon the past information.

1958 Rosenblatt invented the “perceptron”, introducing a


learning method for the McCulloch and Pitts neuron model.

1960 Widrow and Hoff introduced the “Adaline”.

1961 Rosenblatt proposed the “backpropagation” scheme for


training multilayer networks.

1969 The limits of simple perceptrons were demonstrated.


3. Structure and function of a single neuron

3.1. Biological neurons

A typical biological neuron is composed of a cell body, a tubular axon,


and a multitude of hair-like dendrites.
3. Structure and function of a single neuron

3.1. Biological neurons

The small gap between an end bulb and a dendrite is called a synapse,
across which information is propagated. The axon of a single neuron
forms synaptic connections with many other neurons.
3. Structure and function of a single neuron

3.1. Biological neurons

Inhibitory or excitatory signals from other neurons are transmitted to


a neuron at its dendrites’ synapses . The magnitude of the signal
received by a neuron (from another) depends on the efficiency of the
synaptic transmission.

 The cell membrane becomes electrically active when sufficiently


excited by the neurons making synapses onto this neuron.

 A neuron will fire if sufficient signals from other neurons fall upon its
dendrites in a short period of time, called the period of latent
summation.
3. Structure and function of a single neuron

3.2. Artificial neuron models

 The position of the neuron (node) of


the incoming synapse (connection) is
irrelevant.

 Each node has a single output value,


distributed to other nodes via
outgoing links, irrespective their
positions.

 All inputs come in the same time or


remain activated at the same level
long enough for computation of f to
occur.
3. Structure and function of a single neuron

3.2. Artificial neuron models

The next level of specialization is to assume that different weighted


inputs are summed.
3. Structure and function of a single neuron

3.2. Artificial neuron models

Now, it is necessary to stablish which function f the neuron has…

 Ramp functions

 Step functions
3. Structure and function of a single neuron

3.2. Artificial neuron models

 Sigmoid functions

 Piecewise linear and Gaussian functions


4. Neural net architectures

A single node is insufficient for many practical problems, and


networks with a large number of nodes are frequently used. The
way nodes are connected determines how computations proceed
and constitutes an important early design decision by a neural
network developer.

 Fully connected networks


4. Neural net architectures

 Layered networks

 Acyclic networks
4. Neural net architectures

 Feedforward networks

 Modular networks
5. Neural learning

 Correlation learning

When an axon of cell A is near


enough to excite a cell B and
repeatedly or persistently takes
place in firing it, some growth
process or metabolic change takes
place in one or both cells such that
A’s efficiency, as one of the cells
firing B, is increased.
5. Neural learning

 Competitive learning

Another principle for neural computation is that when an


input pattern is presented to a network, different nodes
compete to be " winners" with high levels of activity. The
competitive process involves self-excitation and mutual
inhibition among nodes, until a single winner emerges.

 The connections between input nodes and the winner node are
then modified , increasing the likelihood that the same winner
continues to win in future competitions.

 The converse of competition is cooperation, found in some


neural network models.
5. Neural learning

 Feedback-based weight adaptation

If increasing a particular weight leads


to diminished performance or
larger error, then that weight is
decreased as the network is trained
to perform better.

 The amount of change made at every


step is very small in most networks to
ensure that a network does not stray too
far from its partially evolved state, and so
that the network withstands some
mistakes made by the teacher, feedback,
or performance evaluation mechanism.
6. What can neural networks be used for ?

 Classification
6. What can neural networks be used for ?

 Clustering

Clustering requires grouping together objects that are similar to


each other…
6. What can neural networks be used for ?

 Pattern association

In pattern association, another important task that can be


performed by neural networks, the presentation of an input
sample should trigger the generation of a specific output pattern
6. What can neural networks be used for ?

 Function approximation

Many computational models can be described as functions


mapping some numerical input vectors to numerical
outputs. The outputs corresponding to some input vectors
may be known from training data, but we may not know the
mathematical function describing the actual process that
generates the outputs from the input vectors.
6. What can neural networks be used for ?

 Forescasting

There are many real-life problems in which future events


must be predicted on the basis of past history. An example
task is that of predicting the behavior of stock market
indices.
6. What can neural networks be used for ?

 Control applications

Control addresses the task of determining the values for input


variables in order to achieve desired values for output variables.
7. Evaluation of networks

 Quality of results

The performance of a neural network is frequently gauged in


terms of an error measure.

• Euclidean distance

• Manhattan or Hamming distance

In classification problems, another possible error measure is the


fraction of misclassified samples.
7. Evaluation of networks

 Generalizability

It is not surprising for a system to perform well on the data


on which it has been trained. But good generalizability is
also necessary, i.e., the system must perform well on new
test data distinct from training data.

 Computational resources

Once training is complete, many neural networks generally take


up very little time in their execution or application to a specific
problem. However, training the networks or applying a learning
algorithm can take a very long time.
8. Real applications of neural networks
8. Real applications of neural networks

You might also like