You are on page 1of 19

FGCS

@~ITURE
QENERATIoN
@~MPUTE~I
OySTEMS
Future Generation Computer Systems 13 (1997) 21 l-229

Using neural networks for data mining


Mark W. Craven a,*, Jude W. Shavlik b,
a School of Computer Science, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213-3891, USA
b Computer Sciences Department, University of Wisconsin - Madison, 1210 West Dayton Street, Madison, WI 537061685, USA

Received 21 February 1997; accepted 15 April 1997

Abstract

Neural networks have been successfully applied in a wide range of supervised and unsupervised learning applications.
Neural-network methods are not commonly used for data-mining tasks, however, because they often produce incomprehensible
models and require long training times. In this article, we describe neural-network learning algorithms that are able to
produce comprehensible models, and that do not require excessive training times. Specifically, we discuss two classes of
approaches for data mining with neural networks. The first type of approach, often called rule extraction, involves extracting
symbolic models from trained neural networks. The second approach is to directly learn simple, easy-to-understand networks.
We argue that, given the current state-of-the-art, neural-network methods deserve a place in the tool boxes of data-mining
specialists.

Keywords: Machine learning; Neural networks; Rule extraction: Comprehensible models; Decision trees; Perceptrons

1. Introduction the space shuttle, and predicting exchange rates. Al-


though neural-network learning algorithms have been
The central focus of the data-mining enterprise successfully applied to a wide range of supervised and
is to gain insight into large collections of data. Of- unsupervised learning problems, they have not often
ten, achieving this goal involves applying machine- been applied in data-mining settings, in which two
learning methods to inductively construct models of fundamental considerations are the comprehensibility
the data at hand. In this article, we provide an intro- of learned models and the time required to induce
duction to the topic of using neural-network methods models from large data sets. We discuss new devel-
for data mining. Neural networks have been applied opments in neural-network learning that effectively
to a wide variety of problem domains to learn mod- address the comprehensibility and speed issues which
els that are able to perform such interesting tasks as often are of prime importance in the data-mining
steering a motor vehicle, recognizing genes in unchar- community. Specifically, we describe algorithms that
acterized DNA sequences, scheduling payloads for are able to extract symbolic rules from trained neu-
ral networks, and algorithms that are able to directly
learn comprehensible models.
* Corresponding author. E-mail: mark.craven@cs.cmu.edu. Inductive learning is a central task in data min-
E-mail: shavlik@cs.wisc.edu. ing since building descriptive models of a collection

0167-739X/97/$17.00 Copyright 0 1997 Elsevier Science B.V. All rights reserved


PII SO167-739X(97)00022-8
212 Mu! Craven, J.M! Shavlik/Future Generation Computer Systems 13 (1997) 211-229

of data provides one way of gaining insight into it. 2. The suitability of neural networks for data
Such models can be learned by either supervised or mining
unsupervised methods, depending on the nature of the
task. In supervised learning, the learner is given a set Before describing particular methods for data min-
of instances of the form (x, y), where y represents ing with neural networks, we first make an argument
the variable that we want the system to predict, and for why one might want to consider using neural net-
x is a vector of values that represent features thought works for the task. The essence of the argument is that,
to be relevant to determining y. The goal in super- for some problems, neural networks provide a more
vised learning is to induce a general mapping from suitable inductive bias than competing algorithms. Let
x vectors to y values. That is, the learner must build us briefly discuss the meaning of the term inductive
a model, j = f(x), of the unknown function f, that bias. Given a fixed set of training examples, there are
allows it to predict y values for previously unseen ex- infinitely many models that could account for the data,
amples. In unsupervised learning, the learner is also and every learning algorithm has an inductive bias that
given a set of training examples but each instance determines the models that it is likely to return. There
consists only of the x part; it does not include the y are two aspects to the inductive bias of an algorithm:
value. The goal in unsupervised learning is to build its restricted hypothesis space bias and its preference
a model that accounts for regularities in the training bias. The restricted hypothesis space bias refers to the
set. constraints that a learning algorithm places on the hy-
In both the supervised and unsupervised case, learn- potheses that it is able to construct. For example, the
ing algorithms differ considerably in how they rep- hypothesis space of a perceptron is limited to linear
resent their induced models. Many learning methods discriminant functions. The preference bias of a learn-
represent their models using languages that are based ing algorithm refers to the preference ordering it places
on, or closely related to, logical formulae. Neural- on the models that are within its hypothesis space. For
network learning methods, on the other hand, represent example, most learning algorithms initially try to fit
their learned solutions using real-valued parameters in a simple hypothesis to a given training set and then
a network of simple processing units. We do not pro- explore progressively more complex hypotheses until
vide an introduction to neural-network models in this they find an acceptable fit.
article, but instead refer the interested reader to one of In some cases, neural networks have a more ap-
the good textbooks in the field (e.g., [2]). A detailed propriate restricted hypothesis space bias than other
survey of real-world neural-network applications can learning algorithms. For example, sequential and tem-
be found elsewhere [21]. poral prediction tasks represent a class of problems
The rest of this article is organized as follows. In for which neural networks often provide the most ap-
Section 2, we consider the applicability of neural- propriate hypothesis space. Recurrent networks, which
network methods to the task of data mining. Specifi- are often applied to these problems, are able to main-
cally, we discuss why one might want to consider using tain state information from one time step to the next.
neural networks for such tasks, and we discuss why This means that recurrent networks can use their hid-
trained neural networks are usually hard to understand. den units to learn derived features relevant to the task
The two succeeding sections cover two different types at hand, and they can use the state of these derived
of approaches for learning comprehensible models us- features at one instant to help make a prediction for
ing neural networks. Section 3 discusses methods for the next instance.
extracting comprehensible models from trained neu- In other cases, neural networks are the preferred
ral networks, and Section 4 describes neural-network learning method not because of the class of hypothe-
learning methods that directly learn simple, and hope- ses that they are able to represent, but simply be-
fully comprehensible, models. Finally, Section 5 pro- cause they induce hypotheses that generalize better
vides conclusions. than those of competing algorithms. Several empirical
MU! Craven, .I.@! Shuvlik/Future Generation Computer Systems 13 (1997) 211-229 213

studies have pointed out that there are some problem of the problem domain are often encoded by patterns
domains in which neural networks provide superior of activation across many hidden units. Similarly each
predictive accuracy to commonly used symbolic leam- hidden unit may play a part in representing numerous
ing algorithms (e.g., [lS]). derived features.
Although neural networks have an appropriate in- Now let us consider the issue of the learning time
ductive bias for a wide range of problems, they are required for neural networks. The process of learn-
not commonly used for data-mining tasks. As stated ing, in most neural-network methods, involves using
previously, there are two primary explanations for this some type of gradient-based optimization method to
fact: trained neural networks are usually not compre- adjust the networks parameters. Such optimization
hensible, and many neural-network learning methods methods iteratively execute two basic steps: calculat-
are slow, making them impractical for very large data ing the gradient of the error function (with respect
sets. We discuss these two issues in turn before mov- to the networks adjustable parameters), and adjusting
ing on to the core part of the article. the networks parameters in the direction suggested
The hypothesis represented by a trained neural net- by the gradient. Learning can be quite slow with such
work is defined by (a) the topology of the network, methods because the optimization procedure often in-
(b) the transfer functions used for the hidden and out- volves a large number of small steps, and the cost of
put units, and (c) the real-valued parameters associated calculating the gradient at each step can be relatively
with the network connections (i.e., the weights) and expensive.
units (e.g., the biases of sigmoid units). Such hypothe- One appealing aspect of many neural-network
ses are difficult to comprehend for several reasons. learning methods, however, is that they are on-line
First, typical networks have hundreds or thousands of algorithms, meaning that they update their hypotheses
real-valued parameters. These parameters encode the after every example is presented. Because they update
relationships between the input features, X, and the tar- their parameters frequently, on-line neural-network
get value, y. Although single-parameter encodings of learning algorithms often converge much faster than
this type are usually not hard to understand, the sheer batch algorithms. This is especially the case for large
number of parameters in a typical network can make data sets. Often, a reasonably good solution can be
the task of understanding them quite difficult. Sec- found in only one pass through a large training set!
ond, in multi-layer networks, these parameters may For this reason, we argue that training-time per-
represent nonlinear, nonmonotonic relationships be- formance of neural-network learning methods may
tween the input features and the target values. Thus it often be acceptable for data-mining tasks, especially
is usually not possible to determine, in isolation, the given the availability of high-performance, desktop
effect of a given feature on the target value, because computers.
this effect may be mediated by the values of other
features.
These nonlinear, nonmonotonic relationships are
represented by the hidden units in a network which 3. Extraction methods
combine the inputs of multiple features, thus allow-
ing tbe model to take advantage of dependencies One approach to understanding a hypothesis rep-
among the features. Hidden units can be thought of as resented by a trained neural network is to translate
representing higher-level, derived features. Under- the hypothesis into a more comprehensible language.
standing hidden units is often difficult because they Various approaches using this strategy have been in-
learn distributed representations. In a distributed rep- vestigated under the rubric of rule extraction. In this
resentation, individual hidden units do not correspond section, we give an overview of various rule-extraction
to well understood features in the problem domain. approaches, and discuss a few of the successful appli-
Instead, features which are meaningful in the context cations of such methods.
214 Mu! Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

The methods that we discuss in this section differ


along several primary dimensions:
- Representation language: The language that is used
by the extraction method to describe the neural net-
works learned model. The languages that have been
used by various methods include conjunctive infer-
ence (if-then) rules, m-of-n rules, fuzzy rules, deci-
sion trees, and finite state automata.
- Extraction strategy: The strategy used by the extrac-
tion method to map the model represented by the
extracted rules: y t 21 A ~2 A z3
trained network into a model in the new represen-
y t Xl A x2 A 7x5
tation language. Specifically, how does the method y t Xl A x3 A 1x5
explore a space of candidate descriptions, and what
level of description does it use to characterize the Fig. 1. A network and extracted rules. The network has five input
units representing five Boolean features. The rules describe the
given neural network. That is, do the rules extracted
settings of the input features that result in the output unit having
by the method describe the behavior of the network an activation of 1.
as a whole, the behavior of individual units in the
network, or something in-between these two cases. activation:
We use the term global methods to refer to the first 1 ifc wjai + 0 > 0,
case, and the term local methods to refer to the sec- aY =
ond case. 0 otheLise,
- Network requirements: The architectural and train-
where aY is the activation of the output unit, ai the
ing requirements that the extraction method imposes
activation of the ith input unit, Wi the weight from
on neural networks. In other words, the range of
the ith input to the output unit, and 8 is the threshold
networks to which the method is applicable.
parameter of the output unit. We use xi to refer to the
Throughout this section, as we describe various
value of the ith feature, and ai to refer to the activation
rule-extraction methods, we will evaluate them with
of the corresponding input unit. For example, if xi =
respect to these three dimensions.
true then aj = 1.
Fig. 1 shows three conjunctive rules which describe
3.1. The rule-extraction task
the most general conditions under which the output
unit has an activation of unity. Consider the rule:
Fig. 1 illustrates the task of rule extraction with a
very simple network. This one-layer network has five y +x1 AX2A 1x5.
Boolean inputs and one Boolean output. Any network,
This rule states that when x1 = true, x2 = true, and
such as this one, which has discrete output classes
x5 = false, then the output unit representing y will
and discrete-valued input features, can be exactly de-
have an activation of 1 (i.e., the network predicts y =
scribed by a finite set of symbolic if-then rules, since
true). To see that this is a valid rule, consider that for
there is a finite number of possible input vectors. The
the cases covered by this rule:
extracted symbolic rules specify conditions on the in-
put features that, when satisfied, guarantee a given
output state. In our example, we assume that the value
Thus, the weighted sum exceeds zero. But what ef-
false for a Boolean input feature is represented by
fect can the other features have on the output units
an activation of 0, and the value tnce is represented
activation in this case? It can be seen that
by an activation of 1. Also we assume that the out-
put unit employs a threshold function to compute its 0 5 a3w3 + aqwq 5 4.
M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229 215

r--x--:

extracted rules: y t hl V h2 V h3
hl t 21 A $2
hz c x2 A x3 A x4
h3 t- x5

Fig. 2. The local approach to rule extraction. A multi-layer neural network is decomposed into a set of single-layer networks. Rules
are extracted to describe each of the constituent networks. and the rule sets are combined to describe the multi-layer network.

No matter what values the features x3 and x4 have, the output unit per class for a multi-class learning prob-
output unit will have an activation of 1. Thus the rule lem (i.e., a problem with more than two classes), then
is valid, it accurately describes the behavior of the net- our decision procedure might be to predict the class
work for those instances that match its antecedent. To associated with the output unit that has the greatest ac-
see that the rule is maximally general, consider that tivation. In general, an extracted rule (approximately)
if we drop any one of the literals from the rules an- describes a set of conditions under which the network,
tecedent, then the rule no longer accurately describes coupled with its decision procedure, predicts a given
the behavior of the network. For example, if we drop class.
the literal 7x5 from the rule, then for the examples As discussed at the beginning of this section, one
covered by the rule: of the dimensions along which rule-extraction meth-
ods can be characterized is their level of description.
One approach is to extract a set of global rules that
and thus the network does not predict that y = true characterize the output classes directly in terms of the
for all of the covered examples. inputs. An alternative approach is to extract local rules
So far, we have defined an extracted rule in the con- by decomposing the multi-layer network into a col-
text of a very simple neural network. What does a lection of single-layer networks. A set of rules is ex-
rule mean in the context of networks that have con- tracted to describe each individual hidden and output
tinuous transfer functions, hidden units, and multiple unit in terms of the units that have weighted connec-
output units? Whenever a neural network is used for a tions to it. The rules for the individual units are then
classification problem, there is always an implicit de- combined into a set of rules that describes the network
cision procedure that is used to decide which class is as a whole. The local approach to rule extraction is
predicted by the network for a given case. In the simple illustrated in Fig. 2.
example above, the decision procedure was simply to
predict y = true when the activation of the output unit 3.2. Search-based rule-extraction methods
was 1. and to predict y = false when it was 0. If we
used a logistic transfer function instead of a threshold Many rule-extraction algorithms have set up the task
function at the output unit, then the decision procedure as a search problem which involves exploring a space
might be to predict y = true when the activation ex- of candidate rules and testing individual candidates
ceeds a specified value, say 0.5. If we were using one against the network to see if they are valid rules. In
MU? Craven, J. W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

Fig. 3. A rule search space. Each node in the space represents a possible rule antecedent. Edges between nodes indicate specialization
relationships (in the downward direction). The thicker lines depict one possible search tree for this space.

this section we consider both global and local methods tive literal in the antecedent). To address this issue, a
which approach the rule-extraction task in this way. number of heuristics have been employed to limit the
Most of these algorithms conduct their search combinatorics of the rule-exploration process.
through a space of conjunctive rules. Fig. 3 shows a Several rule-extraction algorithms manage the com-
rule search space for a problem with three Boolean binatorics of the task by limiting the number of literals
features. Each node in the tree corresponds to the an- that can be in the antecedents of extracted rules [9,16].
tecedent of a possible rule, and the edges indicate spe- For example, the algorithm of Saito and Nakano [16]
cialization relationships (in the downward direction) uses two parameters, kPs and kneg, that specify the
between nodes. The node at the top of the graph rep- maximum number of positive and negative literals, re-
resents the most general rule (i.e., all instances are spectively, that can be in an antecedent. By restricting
members of the class y), and the nodes at the bottom the search to a depth of k, the rule space considered
of the tree represent the most specific rules, which is limited to a size given by the following expression:
cover only one example each. Unlike most search
k
processes which continue until the first goal node is
; 2k.
found, a rule-extraction search continues until all (or c( >
i=O
most) of the maximally general rules have been found.
Notice that rules with more than one literal in their For fixed k, this expression is polynomial in n, but
antecedent have multiple ancestors in the graph. Obvi- obviously, it is exponential in the depth k. This means
ously when exploring a rule space, it is inefficient for that exploring a space of rules might still be intractable
the search procedure to visit a node multiple times. In since, for some networks, it may be necessary to search
order to avoid this inefficiency, we can impose an or- deep in the tree in order to find valid rules,
dering on the literals thereby transforming the search The second heuristic employed by Saito and Nakano
graph into a tree. The thicker lines in Fig. 3 depict one is to limit the search to combinations of literals that
possible search tree for the given rule space. occur in the training set used for the network. Thus, if
One of the problematic issues that arises in search- the training set did not contain an example for which
based approaches to rule extraction is that the size of xt = true and x2 = true, then the rule search would
the rule space can be very large. For a problem with not consider the rule y t x1 A x2 or any of its
II binary features, there are 3 possible conjunctive specializations.
rules (since each feature can be absent from a rule Exploring a space of candidate rules is only one part
antecedent, or it can occur as a positive or a nega- of the task for a search-based rule-extraction method.
M.W? Craven, J.M! Shavlik/Future Generation Computer Systems 13 (1997) 211-229 217

The other part of the task is testing candidate rules bitrary linear constraints into the extraction process
against the network. The method developed by Gallant means that the method can be used to test rules that
operates by propagating activation intervals through specify very general conditions on the output units.
the network. The first step in testing a rule using this For example, it can extract rules that describe when
method is to set the activations of the input units that one output unit has a greater activation than all of
correspond to the literals in the candidate rule. The the other output units. Although the VIA approach is
next step is to propagate activations through the net- better at detecting general rules than Gallants algo-
work. The key idea of this second step, however, is the rithm, it may still fail to confirm maximally general
assumption that input units whose activations are not rules, because it also assumes that the hidden units in
specified by the rule could possibly take on any allow- a layer act independently.
able value, and thus intervals of activations are prop- The rule-extraction methods we have discussed so
agated to the units in the next layer. Effectively, the far extract rules that describe the behavior of the output
network computes, for the examples covered by the units in terms of the input units. Another approach
rule, the range of possible activations in the next layer. to the rule-extraction problem is to decompose the
Activation intervals are then further propagated from network into a collection of networks, and then to
the hidden units to the output units. At this point, the extract a set of rules describing each of the constituent
range of possible activations for the output units can networks.
be determined and the procedure can decide whether There are a number of local rule-extraction methods
to accept the rule or not. Although this algorithm is for networks that use sigmoidal transfer functions for
guaranteed to accept only rules that are valid, it may their hidden and output units. In these methods, the
fail to accept maximally general rules, and instead may assumption is made that the hidden and output units
return overly specific rules. The reason for this defi- can be approximated by threshold functions, and thus
ciency is that in propagating activation intervals from each unit can be described by a binary variable indi-
the hidden units onward, the procedure assumes that cating whether it is on (activation x 1) or off (ac-
the activations of the hidden units are independent of tivation X 0). Given this assumption, we can extract
one another. In most networks this assumption is un- a set of rules to describe each individual hidden and
likely to hold. output unit in terms of the units that have weighted
Thrun [ 191 developed a method called validity in- connections to it. The rules for each unit can then be
terval analysis (VIA) that is a generalized and more combined into a single rule set that describes the net-
powerful version of this technique. Like Gallants work as a whole.
method, VIA tests rules by propagating activation in- If the activations of the input and hidden units in a
tervals through a network after constraining some of network are limited to the interval [0, 11, then the lo-
the input and output units. The key difference is that cal approach can significantly simplify the rule search
Thrun frames the problem of determining validity in- space. The key fact that simplifies the search combi-
tervals (i.e., valid activation ranges for each unit) as natorics in this case is that the relationship between
a linear programming problem. This is an important any input to a unit and its output is a monotonic one.
insight because it allows activation intervals to be That is, we can look at the sign of the weight connect-
propagated backward, as well as forward through ing the i th input to the unit of interest to determine
the network, and it allows arbitrary linear constraints how this variable influences the activation of the unit.
to be incorporated into the computation of validity If the sign is positive, then we know that this input
intervals. Backward propagation of activation inter- can only push the units activation towards 1, it can-
vals enables the calculation of tighter validity inter- not push it away from 1. Likewise, if the sign of the
vals than forward propagation alone. Thus, Thruns weight is negative, then the input can only push the
method will detect valid rules that Gallants algorithm units activation away from 1. Thus, if we are extract-
is not able to confirm. The ability to incorporate ar- ing rules to explain when the unit has an activation of
218 M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

d xlx2x3x47x5

Fig. 4. A search tree for the network in Fig. 1. Each node in the space represents a possible rule antecedent. Edges between
nodes indicate specialization relationships (in the downward direction). Shaded nodes correspond to the extracted rules shown in
Fig. 1.

1, we need to consider lxi literals only for those in- relationship between each input and the output unit in
puts xi that have negative weights, and we need con- a perceptron is specified by a single parameter (i.e.,
sider nonnegated xi literals only for those inputs that the weight on the connection between the two), we
have positive weights. When a search space is limited know not only the sign of the inputs contribution to
to including either xi or -xi, but not both, the num- the output, but also the possible magnitude of the con-
ber of rules in the space is 2 for a task with 12binary tribution. This information can be used to order the
features. Recall that when this monotonicity condition search tree in a manner that can save effort. For ex-
does not hold, the size of the rule space is 3. ample, when searching the rule space for the network
Fig. 4 shows a rule search space for the network in in Fig. 1, after determining that y t x1 is not a valid
Fig. 1. The shaded nodes in the graph correspond to rule, we do not have to consider other rules that have
the extracted rules shown in Fig. 1. Note that this tree only one literal in their antecedent. Since the weight
exploits the monotonicity condition, and thus does not connecting x1 to the output unit is larger than the
show all possible conjunctive rules for the network. weight connecting any other input unit, we can con-
A number of research groups have developed lo- clude that if x1 alone cannot guarantee that the output
cal rule-extraction methods that search for conjunctive unit will have an activation of 1, then no other single
rules [8,9,17]. Like the global methods described pre- input unit can do so either. Sethi and Yoo [17] have
viously, the local methods developed by Fu [8] and shown that, when this heuristic is employed, the num-
Gallant [9] manage search combinatorics by limiting ber of nodes explored in the search is
the depth of the rule search. When the monotonicity
condition holds, the number of rules considered in a
search of depth k is bounded above by
0 --).
cc 2n 2
nn

Notice that even with this heuristic, the number of


nodes that might need to be visited in the search is
still exponential in the number of variables.
There is another factor that simplifies the rule search It can be seen that one advantage of local search-
when the monotonicity condition is true. Because the based methods, in comparison to global methods, is
A4.W Craven, J.U! Shavlik/Future Generation Computer Systems 13 (1997) 211-229 219

that the worst-case complexity of the search is less network, and the hypothesis produced by the leam-
daunting. Another advantage of local methods is that ing algorithm is a decision tree that approximates the
the process of testing candidate rules is simpler. network.
A local method developed by Towel1 and Shavlik TREPAN differs from other rule-extraction meth-
[20] searches not for conjunctive rules, but instead ods in that it does not directly test hypothesized rules
for rules that include m-of-n expressions. An m-of-n against a network, nor does it translate individual hid-
expression is a Boolean expression that is specified den and output units into rules. Instead, TREPANS
by an integer threshold, m, and a set of n Boolean extraction process involves progressively refining a
literals. Such an expression is satisfied when at least m model of the entire network. The model, in this case,
of its n literals are satisfied. For example, suppose we is a decision tree which is grown in a best-first manner.
have three Boolean features, ~1, x2, and x3; the m-of-n The TREPAN algorithm, as shown in Table 1, is
expression Z-of-{xl, 1x2, x3} is logically equivalent similar to conventional decision-tree algorithms, such
to (Xl A X2) v (Xl A x3) v (7x2 AX3). as CART [3] and C4.5 [ 141, which learn directly from
There are two advantages to extracting m-of-n rules a training set. These algorithms build decision trees by
instead of conjunctive rules. The first advantage is recursively partitioning the input space. Each internal
that m-of-n rule sets are often much more concise node in such a tree represents a splitting criterion that
and comprehensible than their conjunctive counter- partitions some part of the input space, and each leaf
parts. The second advantage is that, when using a lo- represents a predicted class.
cal approach, the combinatorics of the rule search can As TREPAN grows a tree, it maintains a queue of
be simplified. The approach developed by Towel1 and leaves which are expanded into subtrees as they are
Shavlik extracts m-of-n rules for a unit by first clus- removed from the queue. With each node in the queue,
tering weights and then treating weight clusters as TREPAN stores (i) a subset of the training examples,
equivalence classes. This clustering reduces the search (ii) another set of instances which we shall refer to
problem from one defined by n weights to one defined as query instances, and (iii) a set of constraints. The
by (c (< n) clusters. This approach, which assumes stored subset of training examples consists simply of
that the weights are fairly well clustered after train- those examples that reach the node. The query in-
ing, was initially developed for knowledge-based neu- stances are used, along with the training examples, to
ral networks [20], in which the initial weights of the select the splitting test if the node is an internal node,
network are specified by a set of symbolic inference or to determine the class label if it is a leaf. The con-
rules. Since they correspond to the symbolic rules, the straint set describes the conditions that instances must
weights in these networks are initially well clustered, satisfy in order to reach the node; this information is
and empirical results indicate that the weights remain used when drawing a set of query instances for a newly
fairly clustered after training. The applicability of this created node.
approach was later extended to ordinary neural net- Although TREPAN has many similarities to conven-
works by using a special cost function for network tional decision-tree algorithms, it is substantially dif-
training [ 51. ferent in a number of respects, which we detail below.

3.3. A learning-based rule-extraction method 3.3.1. Membership queries and the oracle
When inducing a decision tree to describe the given
In contrast to the previously discussed methods, network, TREPAN takes advantage of the fact that it
we have developed a rule-extraction algorithm called can make membership queries. A membership query
TREPAN [4,6] that views the problem of extracting is a question to an oracle that consists of an instance
a comprehensible hypothesis from a trained network from the learners instance space. Given a membership
as an inductive learning task. In this learning task, query, the role of the oracle is to return the class label
the target concept is the function represented by the for the instance. Recall that, in this context, the target
220 MU? Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

Table 1
The TREPANalgorithm
TREPAN
Input: Oracle(), trainingset S, featureset F, minsample

initializethe root of the tree, R, as a leaf node

/* get a sample of instances */


use S to constructa model MR of the distributionof instancescovered by node R
q := max(0, minsample- 1 S I)
queryhstancesg := a set of q instancesgeneratedusing model MR

/* use the network to label all instances */


for each examplex E (S U qUe~-hstanCesR )

class label for x := Oracle(x)

/* do a best-first expansion of the tree */


initializeQueue with tuple (R, S, query_instancesR, {) )
while Queue is not empty and global stoppingcriterianot satisfied

/* make node at head of Queue into an internal node */


remove ( node N, SN, qUeqLinstanCesN, COnstraintsN) from head of Queue
use F, SN, and que~-iFZSranceSN to constructa splittingtest T at node N

/* make children nodes */


for each outcome, t, of test T
make C, a new child node of N
constraintsc := COnxtraintsN U (T = t)

/* get a sample of instances for the node C */


SC := membersof SN with outcome t on test T
constructa model MC of the distributionof instancescovered by node C
q := max(0, minsample- 1 SC I)
query_instancesc := a set of q instancesgeneratedusing model MC and constraintsc
for each example x E query_insfancesc
class label for x := Oracle(x)

/* make node C a leaf for now */


use SC and queryinstancesc to determineclass label for C

/* determine if node C should be expanded */


if local stoppingcriterianot satisfiedthen
put (C, SC, query-instancesc, constraintsc) into Queue

Return: tree with root R

concept we are trying to learn is the function repre- ship queries. The ability to make membership queries
sented by the network. Thus, the network itself serves means that whenever TREPAN selects a splitting test
as the oracle, and to answer a membership query it for an internal node or selects a class label for a leaf,
simply classifies the given instance. it is able to base these decisions on large samples of
The instances that TREPAN uses for membership data.
queries come from two sources. First, the examples
that were used to train the network are used as mem- 3.3.2. Tree expansion
bership queries. Second, TREPAN also uses the train- Unlike most decision-tree algorithms, which grow
ing data to construct models of the underlying data trees in a depth-first manner, TREPAN grows trees us-
distribution, and then uses these models to generate ing a best-first expansion. The notion of the best node,
new instances - the query instances - for member- in this case, is the one at which there is the greatest
M.W. Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229 221

potential to increase thefidelity of the extracted tree to and then returns the tree that has the highest level of
the network. By fidelity, we mean the extent to which fidelity to the network.
the tree agrees with the network in its classifications. The principal advantages of the TREPAN approach,
The function used to evaluate node N is: in comparison to other rule-extraction methods, are
twofold. First, TREPAN can be applied to a wide
f(N) = reach(N) x (1 -fidelity(N)), class of networks. The generality of TREPAN derives
from the fact that its interaction with the network con-
where reach(N) is the estimated fraction of instances
sists solely of membership queries. Since answering a
that reach N when passed through the tree, and
membership query involves simply classifying an in-
fidelity(N) is the estimated fidelity of the tree to the
stance, TREPAN does not require a special network
network for those instances. The motivation for ex-
architecture or training method. In fact, TREPAN does
panding an extracted tree in a best-first manner is that
not even require that the model be a neural network.
it gives the user a fine degree of control over the size
TREPAN can be applied to a wide variety of hard-
of the tree to be returned: the tree-expansion process
to-understand models including ensembles (or com-
can be stopped at any point.
mittees) of classifiers that act in concert to produce
predictions.
3.3.3. Splitting tests The other principal advantage of TREPAN is that it
Like some of the rule-extraction methods discussed gives the user fine control over the complexity of the
earlier, TREPAN exploits m-of-n expressions to pro- hypotheses returned by the rule-extraction process.
duce more compact extracted descriptions. Specif- This capability derives from the fact that TREPAN
ically, TREPAN uses a heuristic search process to represents its extracted hypotheses using decision
construct m-of-n expressions for the splitting tests at trees, and it expands these trees in a best-first manner.
its internal nodes in a tree. TREPAN first extracts a very simple (i.e., one-node)
description of a trained network, and then succes-
3.3.4. Stopping criteria sively refines this description to improve its fidelity to
TREPAN uses three criteria to decide when to stop the network. In this way, TREPAN explores increas-
growing an extracted tree. First, TREPAN uses a sta- ingly more complex, but higher fidelity, descriptions
tistical test to decide if, with high probability, a node of the given network.
covers only instances of a single class. If it does, then TREPAN has been used to extract rules from net-
TREPAN does not expand this node further. Second, works trained in a wide variety of problem domains
TREPAN employs a parameter that allows the user to including: gene and promoter identification in DNA,
place a limit on the size of the tree that it should re- telephone-circuit fault diagnosis, exchange-rate pre-
turn. This parameter, which is specified in terms of diction, and elevator control. Table 2 shows test-set
internal nodes, gives the user some control over the accuracy and tree complexity results for five such
comprehensibility of the tree produced by enabling a problem domains. The table shows the predictive
user to specify the largest tree that would be accept- accuracy of feed-forward neural networks, decision
able. Third, TREPAN can use a validation set, in con- trees extracted from the networks using TREPAN,
junction with the size-limit parameter, to decide on and decision trees learned directly from the data us-
the tree to be returned. Since TREPAN grows trees in ing the C4.5 algorithm [14]. It can be seen that, for
a best-first manner, it can be thought of as produc- every data set, neural networks provide better pre-
ing a nested sequence of trees in which each tree in dictive accuracy than the decision trees learned by
the sequence differs from its predecessor only by the C4.5. This result indicates that these are domains for
subtree that corresponds to the node expanded at the which neural networks have a more suitable inductive
last step. When given a validation set, TREPAN uses bias than C4.5. Indeed, these problem domains were
it to measure the fidelity of each tree in this sequence, selected for this reason, since it is in cases where
222 M.U? Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

Table 2
Test-set accuracy (%) and tree complexity (# feature references)
Problem domain Accuracy Tree complexity
Networks TREPAN c4.5 TREPAN c4.5
Protein-coding region recognition 94.1 93.1 90.4 70.5 153.3
Heart-disease diagnosis 84.5 83.2 74.6 24.4 15.5
Promoter recognition 90.6 87.4 85.0 105.5 7.0
Telephone-circuit fault diagnosis 65.3 63.3 60.7 26.3 35.0
Exchange-rate prediction 61.6 60.6 54.6 14.0 53.0

neural networks provide superior predictive accuracy maintain state information from one input instance to
to symbolic learning approaches that it makes sense the next. Like an FSA, each time a recurrent network
to apply a rule-extraction method. Moreover, for all is presented with an instance, it calculates a new state
five domains, the trees extracted from the neural net- which is a function of both the previous state and the
works by TREPAN are more accurate than the C4.5 given instance. A state in a recurrent network is not
trees. This result indicates that in a wide range of a predefined, discrete entity, but instead corresponds
problem domains in which neural networks provide to a vector of activation values across the units in the
better predictive accuracy than conventional decision- network that have outgoing recurrent connections -
tree algorithms, TREPAN is able to extract decision the so-called state units. Another way to think of such
trees that closely approximate the hypotheses learned a state is as a point in an s-dimensional, real-valued
by the networks, and thus provide superior predictive space defined by the activations of the s state units.
accuracy to trees learned directly by algorithms such Recurrent networks are usually trained on se-
as C4.5. quences of input vectors. In such a sequence, the
The two rightmost columns in Table 2 show tree order in which the input vectors are presented to
complexity measurements for the trees produced by the network represents a temporal order, or some
TREPAN and in these domains. The measure of com- other natural sequential order. As a recurrent network
plexity used here is the number of feature references processes such an input sequence, its state-unit ac-
used in the splitting tests in the trees. An ordinary, tivations trace a path in the s-dimensional state-unit
single-feature splitting test, like those used by C4.5, space. If similar input sequences produce similar
is counted as one feature reference. An m-of-n test, paths, then the continuous-state space can be closely
like those used at times by TREPAN, is counted as approximated by a finite state space in which each
n feature references, since such a split lists IZ feature state corresponds to a region, as opposed to a point,
values. We contend that this measure of syntactic com- in the space. This idea is illustrated in Fig. 5, which
plexity is a good indicator of the comprehensibility of shows an FSA and the two-dimensional state-unit
trees. The results in this table indicate that, in general, space of a recurrent network trained to accept the
the trees produced by the two algorithms are roughly same strings as the FSA. The path traced in the space
comparable in terms of size. The results presented in illustrates the state changes of the state-unit activa-
this table are described in greater detail elsewhere [4]. tions as the network processes a sequence of inputs.
The nonshaded regions of the space correspond to the
3.4. Finite state automata extraction methods states of the FSA.
Several research groups have developed algorithms
One specialized case of rule extraction is the ex- for extracting FSA from trained recurrent networks.
traction of finite state automata (FSA) from recurrent The key issue in such algorithms is deciding how to
neural networks. A recurrent network is one that has partition the s-dimensional real-valued space into a
links from a set of its hidden or output units to a set of set of discrete states. The method of Giles et al. [lo],
its input units. Such links enable recurrent networks to which is representative of this class of algorithms,
M.W Craven, J.W Shuvlik/Future Generation Computer Systems 13 (1997) 211-229 223

.g l
2
S
3
ti
*g

80
state unit 1 activation
Fig. 5. The correspondence between a recurrent network and an FSA. Depicted on the left is a recurrent network that has three input
units and two state units. The two units that are to the right of the input units represent the activations of the state units at time t - 1.
Shown in the middle is the two-dimensional, real-valued space defined by the activations of the two hidden units. The path traced
in the space illustrates the state changes of the hidden-unit activations as the network processes some sequence of inputs. Each of
the three arrow styles represents one of the possible inputs to the recurrent network. Depicted on the right is a finite state automaton
that corresponds to the network when the state space is discretized as shown in the middle figure. The shade of each node in the
FSA represents the output value produced by the network when it is in the corresponding state.

proceeds as follows. First, the algorithm partitions mapping from its input space to its output space that
each state units activation range into q intervals of preserves the topological ordering of points in the in-
equal width, thus dividing the s-dimensional space put space. That is, the similarity of points in the in-
into qs partitions. The method initially sets q = 2, but put space, as measured using a metric, is preserved
increases its value if it cannot extract an FSA that cor- in the output space. In their exchange-rate predic-
rectly describes the networks training set. The next tion architecture, Lawrence et al. [13] used SOMs to
step is to run the input sequences through the network, map from a continuous, four-dimensional input space
keeping track of (i) the state transitions, (ii) the input into a discrete output space. The input space, in this
vector associated with each transition, and (iii) the out- case, represents X(t), and the output space represents
put value produced by the network. It is then a simple a three-valued discrete variable that characterizes the
task to express this record of the networks processing trend in the exchange rate.
as an FSA. Finally, the FSA can be minimized using The second component of the system is a neural net-
standard algorithms. work that has a set of recurrent connections from each
Recently, this approach has been applied to the of its hidden units to all of the other hidden units. The
task of exchange-rate prediction. Lawrence et al. [ 131 input to the recurrent network is a three-dimensional
trained recurrent neural networks to predict the daily vector consisting of the last three discrete-values out-
change in five foreign exchange rates, and then ex- put by the SOM. The output of the network is the
tracted FSA from these networks in order to character- predicted probabilities that the next daily movement
ize the learned models. More specifically, the task they of the exchange rate will be upward or downward. In
addressed was to predict the next (log-transformed) other words, the recurrent network learns a mapping
change in a daily exchange rate x(t + l), given the from the SOMs discrete characterization of the time
previous four values of the same time series: series to the predicted direction of the next value in
the time series.
X(t) = (x(t), x(t - l), x(t - 2), x(t - 3)).
The third major part of the system is the rule-
Their solution to this task involves three main compo- extraction component. Using the method described
nents. The first component is a neural network called above, FSA are extracted from the recurrent networks.
a self-organizing map (SOM) [ 121 which is trained by The states in the FSA correspond to regions in the
an unsupervised learning process. An SOM learns a space of activations of the state units. Each state is
224 h4.W Craven, J.U? Shuvlik/Future Generation Computer Systems 13 (1997) 211-229

labeled by the corresponding network prediction (up output units in the network. Another key aspect of ex-
or down), and each state transition is labeled by the traction strategies is the way in which they explore a
value of the discrete variable that characterizes the space of rules. In this section we described (i) methods
time series at time t. that use search-like procedures to explore rule spaces,
After extracting automata from the recurrent net- (ii) a method that iteratively refines a decision-tree
works, Lawrence et al., compared their predictive ac- description of a network, and (iii) a method that ex-
curacy to that of the neural networks and found that the tracts FSA by first clustering unit activations and then
FSA were only slightly less accurate. On an average, mapping the clusters into an automaton. In addition
the accuracy of the recurrent networks was 53.4%, and to these rule-exploration strategies, there are also al-
the accuracy of the FSA was 53.1% (both of which are gorithms that extract rules by matching the networks
statistically distinguishable from random guessing). weight vectors against templates representing canon-
ical rules, and methods that are able to directly map
3.5. Discussion hidden units into rules when the networks use transfer
functions, such as radial basis functions, that respond
As we stated at the beginning of this section, to localized regions of their input space.
there are three primary dimensions along which rule- Another key dimension we have considered is the
extraction methods differ: representation language, extent to which methods place requirements on the
extraction strategy, and network requirements. The networks to which they can be applied. Some meth-
algorithms that we have discussed in this section pro- ods require that a special training procedure be used
vide some indication of the diversity of rule-extraction for the network. Other methods impose restrictions on
methods with respect to these three aspects. the network architecture, or require that hidden units
The representation languages used by the methods use sigmoidal transfer functions. Some of the methods
we have covered include conjunctive inference rules, we have discussed place restrictions on both the net-
m-of-n inference rules, decision trees with m-of-n works architecture and its training regime. Another
tests, and finite state automata. In addition to these rep- limitation of many rule-extraction methods is that they
resentations, there are rule-extraction methods that use are designed for problems that have only discrete-
fuzzy rules, rules with confidence factors, majority- valued features. The trade-off that is involved in these
vote rules, and condition/action rules that perform requirements is that, although they may simplify the
rewrite operations on string-based inputs. This multi- rule-extraction process, they reduce the generality of
plicity of languages is due to several factors. One fac- the rule-extraction method.
tor is that different representation languages are well Readers who are interested in more detailed de-
suited for different types of networks and tasks. A sec- scriptions of these rule-extraction methods, as well as
ond reason is that researchers in the field have found pointers to the literature are referred elsewhere [ 1,4].
that it is often hard to concisely describe the concept
represented by a neural network to a high level of
fidelity. Thus, some of the described representations, 4. Methods that learn simple hypotheses
such as m-of-n rules, have gained currency because
they often help to simplify extracted representations. The previous section discussed methods that are
The extraction strategies employed by various al- designed to extract comprehensible hypotheses from
gorithms also exhibit similar diversity. As discussed trained neural networks. An alternative approach to
earlier, one aspect of extraction strategy that distin- data mining with neural networks is to use learning
guishes methods is whether they extract global or local methods that directly learn comprehensible hypothe-
rules. Recall that global methods produce rules which ses by producing simple neural networks. Although we
describe a network as a whole, whereas local meth- have assumed in our discussion so far that the hypothe-
ods extract rules which describe individual hidden and ses learned by neural networks are incomprehensible,
M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229 225

Table 3
The BBP algorithm
BBP
Input: training set S of m examples, set C of candidate inputs that map to (-1, +l),
number of iterations T

/* set the initial distribution to be uniform */


for all x E S
D1 (X) := l/m

for t := 1 to T do
/* addanother feature */
hf := argmaxciEc 1ED, [f(x) Ci (X)11
/* determine error of this feature */
et:=0
for all x E S
if h,(x) # f(x) then t, := E, + Dt(x)
/* update the distribution */
Br := Et/Cl-Et)
for all x E S
if h,(x) = f(x) then
e,se &+1(x) := I%&(X)

Q,l (XI := Q(x)

/* re-normalize the distribution */


z1 :=Xx 4+1(x)
for all x 6 S
&+1 (XI := Q+1 (x)/Z,

return: h(x) = sign( $ - ln(hh(x1)

the methods we present in this section are different training with a gradient-based optimization method.
in that they learn networks that have a single layer The hypotheses it learns, however, are perceptrons,
of weights. In contrast to multi-layer networks, the and thus we consider it to be a neural-network method.
hypotheses represented by single-layer networks are The BBP algorithm is shown in Table 3. The ba-
usually much easier to understand because each pa- sic idea of the method is to repeatedly add new input
rameter describes a simple (i.e., linear) relationship units to a learned hypothesis, using different proba-
between an input feature and an output category. bility distributions over the training set to select each
one. Because the algorithm adds weighted inputs to
4.1. A supervised method hypotheses incrementally, the complexity of these hy-
potheses can be easily controlled.
There is a wide variety of methods for learning The inputs incorporated by BBP into a hypothe-
single-layer neural networks in a supervised learning sis represent Boolean functions that map to {- 1, + 1).
setting. In this section we focus on one particular algo- In other words, the inputs are binary units that have
rithm that is appealing for data-mining applications be- an activation of either - 1 or + 1. These inputs may
cause it incrementally constructs its learned networks. correspond directly to Boolean features, or they may
This algorithm, called BBP [ 111, is unlike traditional represent tests on nominal or numerical features (e.g.,
neural-network methods in that it does not involve color = red, x1 > O.S), or logical combinations
226 M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

Table 4
Test-set accuracy (%) and hvoothesis comnlexitv C# features used)
Problem domain Accuracy Hypothesis complexity
Networks BBP c4.5 Networks BBP c4.5
Protein-coding region recognition 93.6 93.6 84.9 464 171 150
Promoter recognition 90.6 92.7 85.0 57 30 10
Splice-junction recognition 95.4 94.6 94.5 60 56 22

features (e.g., [color = red] A [shape = round]). The BBP algorithm has two primary limitations.
Additionally, the algorithm may also incorporate an First, it is designed for learning binary classification
input representing the identically true function. The tasks. The algorithm can be applied to multi-class
weight associated with such an input corresponds to learning problems, however, by learning a perceptron
the threshold of the perceptron. for each class. The other limitation of the method
On each iteration of the BBP algorithm, an input is is that it assumes that the inputs are Boolean func-
selected from the pool of candidates and added to the tions. As discussed above, however, domains with real-
hypothesis under construction. BBP measures the cor- valued features can be handled by discretizing the
relation of each input with the target function being features.
learned, and then selects the input whose correlation The BBP method is based on an algorithm called
has the greatest magnitude. The correlation between a AdaBoost [7] which is a hypothesis-boosting algo-
given candidate and the target function varies from it- rithm. Informally, a boosting algorithm learns a set
eration to iteration because it is measured with respect of constituent hypotheses and then combines them
to a changing distribution over the training examples. into a composite hypothesis in such a way that, even
Initially, the BBP algorithm assumes a uniform dis- if each of the constituent hypotheses is only slightly
tribution over the training set. That is, when selecting more accurate than random guessing, the composite
the first input to be added to a perceptron, BBP as- hypothesis has an arbitrarily high level of accuracy
signs equal importance to the various instances of the on the training data. In short, a set of weak hypothe-
training set. After each input is added, however, the ses are boosted into a strong one. This is done by
distribution is adjusted so that more weight is given to carefully determining the distribution over the training
the examples that the input did not correctly predict. data that is used for learning each weak hypothesis.
In this way, the learners attention is focused on those The weak hypotheses in a BBP perceptron are sim-
examples that the current hypothesis does not explain ply the individual inputs. Although the more general
well. AdaBoost algorithm can use arbitrarily complex func-
The algorithm stops adding weighted inputs to the tions as its weak hypotheses, BBP uses very simple
hypothesis after a pre-specified number of iterations functions for its weak hypotheses in order to facilitate
have been reached, or after the training set error has comprehensibility.
been reduced to zero. Since only one input is added Table 4 shows test-set accuracy and hypothesis-
to the network on each iteration, the size of the final complexity results for the BBP algorithm, ordinary
perceptron can be controlled by limiting the number feed-forward neural networks, and C4.5 applied to
of iterations. The hypothesis returned by BBP is a per- three problem domains in molecular biology. As the
ceptron in which the weight associated with each input table indicates, simple neural networks, such as those
is a function of the error of the input. The perceptron induced by BBP, can provide accuracy comparable to
uses the sign function to decide which class to return: multi-layer neural networks in some domains. More-
over, in two of the three domains, the accuracy of the
1 if x > 0, BBP hypotheses is significantly superior to decision
sign(x) =
1 -1 ifx(0. trees learned by C4.5.
M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229 227

The three rightmost columns of Table 4 show one units represent the k classes into which examples are
measure of the complexity of the models learned clustered.
by the three methods - the total number of features The net input to each output unit in this method is
incorporated into their hypotheses. These results il- a linear combination of the input activations:
lustrate that, like decision-tree algorithms, the BBP
neti = Wijaj.
algorithm is able to selectively incorporate input fea- c
tures into its hypotheses. Thus in many cases, the
BBP hypotheses use significantly fewer features, Here, aj is the activation of the jth input unit, and
and have significantly fewer weights, than ordinary wij is the weight linking the jth input unit to the ith
multi-layer networks. It should also be emphasized output. The name competitive learning is derived from
that multi-layer networks are usually much more the process used to determine the activations of the
difficult to interpret than BBP hypotheses because hidden units. The output unit that has the greatest net
their weights may encode nonlinear, nonmonotonic input is deemed the winner, and its activation is set to
relationships between the input features and the class one. The activations of the other output are set to zero:
predictions.
1 1 if C WijCZj > C WhjC7.j
In summary, these results suggest the utility of the
BBP algorithm for data-mining tasks: it provides good Ui =
units h # i,
predictive accuracy on a variety of interesting, real-
world tasks, and it produces syntactically simple hy-
potheses, thereby facilitating human comprehension The training process for competitive learning in-
of what it has learned. Additional details concerning volves minimizing the cost function:
these experiments can be found elsewhere [4].
C = i FIrai(L2j - Wij)*,

4.2. An unsupervised method i j

where ai is the activation of the ith output unit, aj the


As stated in Section 1, unsupervised learning in- activation of the jth input unit, and wij is the weight
volves the use of inductive methods to discover reg- from the jth input unit to the ith output unit. The
ularities that are present in a data set. Although there update rule for the weights is then:
is a wide variety of neural-network algorithms for
ac
unsupervised learning, we discuss only one of them Awij = -q--- = Wi (aj - Wij 1,
aWij
here: competitive learning [ 151. Competitive learning
is arguably the unsupervised neural-network algorithm where n is a learning-rate parameter.
that is most appropriate for data mining, and it is il- The basic idea of competitive learning is that each
lustrative of the utility of single-layer neural-network output unit takes responsibility for a subset of the
methods. training examples. Only one output unit is the winner
The learning task addressed by competitive leam- for a given example, and the weight vector for the win-
ing is to partition a given set of training examples into ning unit is moved towards the input vector for this
a finite set of clusters. The clusters should represent example. As training progresses, therefore, the weight
regularities present in the data such that similar exam- vector of each output unit moves towards the centroid
ples are mapped into similar classes. of the examples for which the output has taken respon-
The variant of competitive learning that we consider sibility. After training, each output unit represents a
here, which is sometimes called simple competitive cluster of examples, and the weight vector for the unit
learning, involves learning in a single-layer network. corresponds to the centroid of the cluster.
The input units in such a network represent the rele- Competitive learning is closely related to the sta-
vant features of the problem domain, and the k output tistical method known as k-means clustering. The
228 M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229

principal difference between the two methods is that some of these methods, such as BBP, have a bias to-
competitive learning is an on-line algorithm, meaning wards incorporating relatively few weights into their
that during training it updates the networks weights hypotheses.
after every example is presented, instead of after all of We have not attempted to provide an exhaustive sur-
the examples have been presented. The on-line nature vey of the available neural-network algorithms that are
of competitive learning makes it more suitable for suitable for data mining. Instead, we have described
very large data sets, since on-line algorithms usually a subset of these methods, selected to illustrate the
converge to a solution faster in such cases. breadth of relevant approaches as well as the key is-
sues that arise in applying neural networks in a data-
mining setting. It is our hope that our discussion of
5. Conclusion neural-network approaches will serve to inspire some
interesting applications of these methods to challeng-
We began this article by arguing that neural-network ing data-mining problems.
methods deserve a place in the tool box of the data
miner. Our argument rests on the premise that, for Acknowledgements
some problems, neural networks have a more suit-
able inductive bias (i.e., they do a better job of leam- The authors have been partially supported by ONR
ing the target concept) than other commonly used grant NOO014-93-1-0998 and NSF grant IRI-9502990.
data-mining methods. However, neural-network meth- Mark Craven is currently supported by DARPA grant
ods are thought to have two limitations that make F33615-93-1-1330.
them poorly suited to data-mining tasks: their learned
hypotheses are often incomprehensible, and training
References
times are often excessive. As the discussion in this ar-
ticle shows, however, there is a wide variety of neural- [l] R. Andrews, J. Diederich and A.B. Tickle, A survey and
network algorithms that avoid one or both of these critique of techniques for extracting rules from trained
limitations. artificial neural networks, Knowledge-Based Syst. 8 (6)
(1995).
Specifically, we discussed two types of approaches [2] C.M. Bishop, Neural Networks for Pattern Recognition
that use neural networks to learn comprehensible (Oxford University Press, Oxford, UK, 1996).
models. First, we described rule-extraction algo- [3] L. Breiman, J. Friedman, R. Olshen and C. Stone,
Classijcation and Regression Trees (Wadsworth and Brooks,
rithms. These methods promote comprehensibility
Monterey, CA, 1984).
by translating the functions represented by trained [4] M.W. Craven, Extracting comprehensible models from
neural networks into languages that are easier to un- trained neural networks, Ph.D. Thesis, Computer Sciences
Department, University of Wisconsin, Madison, WI, 1996;
derstand. A broad range of rule-extraction methods
available as CS Technical Report 1326; available by
has been developed. The primary dimensions along WWW as ftp://ftp.cs.wisc.edu/machine-leaming/shavlik-
which these methods vary are their (i) representation group/craven.thesis.ps.Z.
languages, (ii) strategies for mapping networks into [5] M. Craven and J. Shavlik, Learning symbolic rules using
artificial neural networks, in: Proc. 10th Internat. Co@ on
the representation language, and (iii) the range of Machine Learning, Amherst, MA (Morgan Kaufmann, Los
networks to which they are applicable. Altos, CA, 1993) 73-80.
In addition to rule-extraction algorithms, we de- [6] M.W. Craven and J.W. Shavlik, Extracting tree-structured
representations of trained networks, in: Advances in
scribed both supervised and unsupervised methods
Neural Information Processing Systems, eds. D. Touretzky,
that directly learn simple networks. These networks M. Mozer and M. Hasselmo, Vol. 8 (MIT Press, Cambridge,
are often humanly interpretable because they are MA, 1996).
[7] Y. Freund and R.E. Schapire, Experiments with a new
limited to a single layer of weighted connections,
boosting algorithm, in: Proc. 13th Internat. Co& on Machine
thereby ensuring that the relationship between each Learning, Bari, Italy (Morgan Kaufmann Los Altos, CA,
input and each output is a simple one. Moreover, 1996) 148-156.
M.W Craven, J.W Shavlik/Future Generation Computer Systems 13 (1997) 211-229 229

[8] L. Fu, Rule learning by searching on adapted nets, in: Proc. [20] G. Towel1 and J. Shavlik, Extracting refined rules from
9th National Con& on Artificial Intelligence, Anaheim, CA knowledge-based neural networks, Mach. Learning 13 (1)
(AAAI, Menlo Park, CA/MIT Press, Cambridge, MA, 1991) (1993) 71-101.
59&595. [21] B. Widrow, D.E. Rumelhart and M.A. Lehr, Neural
[9] S.I. Gallant, Neural Network Learning and Expert Systems networks: Applications in industry, business, and science,
(MIT Press, Cambridge, MA, 1993). Commun. ACM 37 (3) (1994) 93-105.
[lo] CL. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun
and Y.C. Lee, Learning and extracting finite state automata
with second-order recurrent neural networks, Neural Comput. Mark Craven is currently a Postdoc-
4 (1992) 393-405. toral Research Fellow in the Depart-
[l l] J.C. Jackson and M.W. Craven, Learning sparse ment of Computer Science at Carnegie
perceptrons, in: Advances in Neural Information Processing Mellon University. He received the
Systems, eds. D. Touretzky, M. Mozer and M. Hasselmo, M.S. and Ph.D. in computer science
Vol. 8 (MIT Press, Cambridge, MA, 1996). from the University of Wisconsin in
[ 121 T. Kohonen, Self-Organizing Maps (Springer, Berlin, 1991 and 1996, respectively, and his
1995). B.A. in political science from the Uni-
[13] S. Lawrence, CL. Giles and A.C. Tsoi, Symbolic versity of Colorado in 1987. His Ph.D.
conversion, grammatical inference and rule extraction for thesis research focused on extracting
foreign exchange rate prediction, in: Neural Networks in the comprehensible models from trained
Capital Markets, eds. Y. Abu-Mostafa, A.S. Weigend and neural networks. His other research
P.N. Refenes (World Scientific, Singapore, 1997). interests include machine-learning methods for extracting
[14] J. Quinlan, C4.5: Programsfor Machine Learning (Morgan structured symbolic representations from text and hypertext,
Kaufmann, San Mateo, CA, 1993). and machine-learning applications in molecular biology.
[15] D.E. Rumelhart and D. Zipser, Feature discovery by
competitive learning, Cognitive Sci. 9 (1985) 75-l 12.
[ 161 K. Saito and R. Nakano, Medical diagnostic expert system Jude Shavlik is an Associate Profes-
based on PDP model, in: Proc. IEEE Internat. Con& on sor in Computer Sciences at the Uni-
Neural Networks, San Diego, CA (IEEE Press, New York, versity of Wisconsin - Madison, where
1988) 255-262. he has been on the faculty since 1988.
[17] I.K. Sethi and J.H. Yoo, Symbolic approximation of He received B.S.s in electrical engi-
feedforward neural networks, in: Pattern Recognition in neering and in Biology from MIT in
Practice, eds. E.S. Gelsema and L.N. Kanal, Vol. 4 (North- 1979. In 1980 he received an M.S. in
Holland, New York, 1994). biophysics from Yale University. After
[ 181 J. Shavlik, R. Mooney and G. Towell, Symbolic and neural working for the Mitre Corporation for
net learning algorithms: An empirical comparison, Mach. several years he received his Ph.D. in
Learning 6 (1991) 111-143. computer science from the University
[19] S. Thrun, Extracting rules from artificial neural of Illinois in 1988. His research inter-
networks with distributed representations, in: Advances in ests include machine learning, neural networks, informational
Neural Information Processing Systems, eds. G. Tesauro, retrieval, and computational biology. He is editor-in-chief of
D. Touretzky and T. Leen, Vol. 7 (MIT Press, Cambridge, the AZ Magazine and serves on the editorial board of several
MA 1995). journals.

You might also like