You are on page 1of 23

Adaptive Neuro-Fuzzy Inference with Statistical Distance Metrics and

Hybrid Learning algorithms for Data Mining of Symbolic Data

S.PAPADIMITRIOU1,2 S. MAVROUDI1 L. VLADUTU1 G. PAVLIDES2 A. BEZERIANOS1

1. Department of Medical Physics, School of Medicine, University of Patras,

26500 Patras, Greece, tel: +30-61-996115,

email: bezer@heart.med.upatras.gr

2. Dept. of Computer Engineering and Informatics,

University of Patras,26500 Patras, Greece, tel: +30-61-997804,

email: stergios@heart.med.upatras.gr

Abstract The application of neuro-fuzzy systems to domains involving prediction and classification of symbolic data
requires a reconsideration and a careful definition of the concept of distance between patterns. Traditional distances are
inadequate to provide information about the proximity between the symbolic patterns. This work proposes a new
architecture of neurofuzzy systems, the Symbolic Adaptive Neuro Fuzzy Inference System (SANFIS) that utilizes effectively
a statistically extracted distance measure. The learning approach is a hybrid one and consists of a sequence of steps some
of which are essential and some others are used to optimize further the performance. Initially, a Statistical Distance
Metric space is computed from the information provided with the training set. The premise parameters are subsequently
evaluated with a three-phase Instance Based Learning (IBL) scheme that estimates the input membership function centers
and spreads and constructs the corresponding fuzzy rules. The first phase of the IBL scheme explores heuristic approaches
that can uncover information for the relative importance and reliability of the examples. The second phase exploits this
information and extracts an adequate subset of the training patterns for the construction of the fuzzy rules. The concept of
fuzzy adaptive subsethood is used at the third phase, for the reduction of the number of the fuzzy sets used as input
membership functions. The consequent parameters are estimated with an efficient linear least squares formulation. The
obtained performances from the SANFIS trained with the hybrid learning methods are significantly better than the
traditional nearest neighbour Instance Based Learning schemes in many data mining problems and the system offers
enhanced explanation ability.

Keywords: Neuro-fuzzy Learning, Data Mining, Symbolic Data Classification, Radial Basis Functions, Heuristic
Learning, Instance Based Learning.

1
1 Introduction

The emergence of neuro-fuzzy network technology [1,4] offers valuable insight to confront complicated data
mining problems. In this context, neuro-fuzzy networks can be viewed as advanced mathematical models for
discovering complex dependencies between variables of physical processes from a set of perturbed
observations. In designing a neuro-fuzzy network model, for modeling a complex physical process, we are in
effect building a nonlinear model of the physical process that generates the attribute sets and the corresponding
outcomes [2,3,4]. However, the application of neuro-fuzzy networks although proven very successful in problem
domains, as signal processing and pattern recognition has not yielded adequate performances in the domain of
data mining of symbolic data. Patterns arising both from commercial databases and from many engineering
databases (as those that describe biosequences [10]) involve data defined over a space that lacks the fundamental
properties of distance metric spaces. In order to confront these difficulties the paper presents the symbolic
Adaptive Neuro Fuzzy Inference System (SANFIS) architecture and learning algorithms.

This work first adapts within the context of the peculiarities of the SANFIS, a distance metric, for expressing
the distance between values of features in symbolic domains. This metric was initially proposed at the context
of nearest neighbor schemes as a means for capturing effectively the proximity information for symbolic
patterns [7,14,16]. The data mining techniques that the paper presents have a wide span of possible application
since they are adapted to heterogeneous data records having both symbolic and numeric data attributes. For the
symbolic attributes the Statistical Distance Metrics adapted from the Modified Value Difference Metric
(MVDM) [7,16] has resulted the best performance. However, for the numeric attributes the optimal policy for
evaluating the distance is not obvious. In many cases the generalization improvement is obtained by handling
the numeric values with a normalized numeric distance type, e.g. with an Euclidean distance metric normalized
with the standard deviation. Nevertheless, frequently numeric attribute domains are better handled with a
discretized numeric distance metric augmented with an interpolation to alleviate the discretization effects (e.g.
the Interpolated Value Difference Metric [16].

After the computation of a Statistical Distance metric space for the symbolic attributes the main steps of the
presented approach for the construction of the SANFIS system for data mining can be summarized as follows:

• Initial estimation of input Membership Functions (MF) and rules: Estimation of the reliability and
importance of each training example with Instance Based Learning techniques. The outcome of this
learning phase is an ordering of the examples in terms of their significance and reliability. Weight
parameters are computed that quantify the importance of the examples. The SANFIS network can be
designed to exploit effectively the irregularity of the problem’s state space through the selection of the
proper training examples as rule centers and the determination of parameters for each such rule center.
These parameters serve for accounting the significance and reliability of the corresponding example.
We describe three different Instance Based Learning (IBL) algorithms for the implementation of this
learning step.

2
• Incremental construction of the fuzzy rule base: For each example the construction of a fuzzy rule is
eligible with spreading of membership functions determined in accordance with the significance and
reliability of the example. The fuzzy system is constructed incrementally by considering each example
in the order of its significance and forming a new fuzzy rule only if the currently available fuzzy system
is inadequate to explain the considered example.

• Merging of MFs: The number of membership functions along each dimension is reduced according to a
measure of fuzzy mutual subsethood and the fuzzy rule base is modified according to the reduced set of
membership functions.

• Computation of consequent parameters: The consequent parameters are estimated with an efficient
linear least squares formulation that can be solved with the stable Singular Value Decomposition
algorithm.

The paper proceeds as follows: Section 2 outlines briefly the structure of the SANFIS. Section 3 presents the
proposed Statistical Distance Metric (SDM) that has been proven quite effective for coping with symbolic
attributes. Section 4 discusses how the SDM is fitted at the context of SANFIS. Section 5 describes the three
fore-mentioned Instance Based Learning phases for the construction of the SANFIS, each one in its own
subsection. The fourth subsection deals with the consequent parameters with a linear least squares formulation.
Section 6 discusses the results obtained by the application of the new algorithms on the UCI data sets and
compares them with the performance of some nearest neighborhood approaches. Finally, in Section 7 the
conclusions are presented along with directions for future work.

2. The architecture of the NeuroFuzzy network

The research for the proper neurofuzzy architecture for adaptive training with symbolic data has lead to a
structure similar to the ANFIS system [23] that is called as Symbolic Adaptive Neuro Fuzzy Inference
(SANFIS). It is important to emphasize beforehand that the SANFIS system retains the efficiency of handling
numeric attributes and extends it to the symbolic domain. The SANFIS architecture is presented with Figure 1,
where the usual convention of denoting a node with parameters with a square and without parameters with a
circle is adopted. Specifically, the layers of the neurofuzzy inference network have the following functions.

Layer 1 The input layer. This layer does not perform any processing of the inputs (it simply buffers the values
for input to the next layer).

Layer 2 The membership function layer. For numeric features it performs the function

Oi1( x ) = µAi ( x )

3
  
2
 x − mi 
where we use Gaussian membership functions, i.e. µ Ai ( x ) = exp  −  
 and mi , σi are the premise
  σi 
 
parameters that correspond to the centers and spreads of the Gaussians. We should note here that for numeric
features, x is simply the numeric value as it is feeded from Layer 1.

However, for symbolic ones the statistical distance SD ( x, mi ) , between the input feature value x and the

symbolic value mi that forms the center of the corresponding MF is computed. The distance SD ( x , mi ) and

the spreading σi are computed according to the methods described in the next section. For symbolic features,
the Layer 2 of SANFIS evaluates the following Gaussian membership function:

  2
SD ( x, mi )  
µ Ai ( x ) = exp  −   .

  σi  

Layer 3 The conjunction layer computes the firing strength of each rule. The algebraic product is used as the
conjunction operator. The output values of this layer is a circle node labeled Π and forms the product of the

incoming signals, i.e. for figure 1, wi = µ Ai ( x1 ) ⋅ µ Bi ( x2 ) , i = 1,2 .

Layer 4 The normalization layer normalizes the firing strengths of rules by dividing by the sum of these

wi
strengths, i.e. wi = , i = 1,2 . The output values of this node are called normalized firing strengths.
w1 + w2

Layer 5 A zeroth order Takagi-Sugeno type is used for the specification of the outputs of the fuzzy rules [24]

as O 4j ( x ) = wi ⋅ b j ⋅ x .

Layer 6 A summation operation performed by this layer computes the total output of the fuzzy system as the
sum of the local rule outputs.

3. The Statistical Distance Metric (SDM)

The key problem for applications involving symbolic features is the definition of the distance metric. In
domains where features are numeric, it is straightforward to compute the distance between two points in the
pattern space in terms of a geometric distance (e.g. Euclidean, Manhatan). Indeed, the traditional neurofuzzy
learning algorithms have been formulated on these distance metrics and operate effectively in numeric domains
with such distances. However, when the features are symbolic (as usually happens in bioinformatics and in
commercial data mining applications) the utilization of the traditional types of distances yields inadequate
performance. Two common approaches for handling symbolic information is the overlap method and the
orthogonal representation [5, 7]. The overlap method simply counts the number of feature values that two

4
instances have in common. This distance metric oversimplifies the pattern space, ignores all the information
presented within the training set and yields (as expected) poor performance. The orthogonal representation
encodes binary vectors with the symbolic features in such a way that the numerical distance between different
features is the same. This method suffers from the same inadequacies to extract any information embeded
within the training set.

In order to be able to obtain an effective formulation of the distances between patterns with symbolic feature
values we have adapted distance measures of the type proposed in [7, 16]. The statistical distance measure takes
into account the overall similarity of classification of all instances for each possible value of each feature. This
method extracts with a statistical approach from the training set, a matrix that defines the distances between all
possible values of a given feature. Therefore, a separate matrix for each feature is obtained. The distance
measure for a specific feature f is defined according to the following equation:

Nc C k
Ac C Bc
SD f (V A ,VB ) = ∑ CA −
CB
c =1 (11)

In the equation above, V A and V B denote two possible values for the feature f , e.g. for the DNA

promoter data they will be two nucleotides. The distance between the values is the sum over all the N c
classes. For example, for the DNA promoter example (discussed below) there are two classes, either the
sequence is a promoter (i.e. a sequence that initiates a process called transcription) or not. The number of

patterns for which their feature f has value V A ( V B ) and are classified to class c, is denoted by C Ac (

C Bc ). Also, the total number of patterns that have value V A ( VB ) for feature f is denoted by C A (

C B ), and k is a constant usually set to 1. These counts are computed over all patterns of the training set. It
becomes easily evident from (1) that the distance between feature values labeled with the same relative
frequency for all possible classes is zero.

Furthermore, the more correlated are the classifications pertaining to two values for a feature the smaller is their
statistical distance computed with equation (1). Therefore for feature values with similar classifications a small
statistical distance will be computed. Equation (1) accounts for the overall similarity of classification of all
training instances for each possible class.

The distance D ( X , Y ) between two patterns X , Y is obtained by a weighted sum of distances between the
values of the individual features of these patterns (obtained from equation 1).

F
D( X ,Y ) = ∑w f i SD f (V X i ,VYi ) r
i =1
(22)

where F is the number of features, w f accounts for the weight assigned to feature f reflecting its
significance and r is a parameter that controls how distances between individual features scale for the

5
computation of the total pattern distance (usually r = 1 or 2). Also, V X f and VY f denote the values for the
ith feature of X and Y.

The SDM is defined for symbolic attributes in a way that resembles the Modified Value Difference Metric
(MVDM) that was used at the PEBLS system [19]. However, for numeric features we have two major choices:

a) to adopt a numerical distance measure along these dimensions, therefore obtaining a distance metric of
heterogeneous type [16].

b) To discretize the numerical ranges and to extend the functionality of the SDM. A parameter of
particular importance in this discretization is the number s of equal width intervals. Although, it is
difficult to define general guidelines the heuristic rule of setting s to the larger of 5 or C , where C
is the number of output classes of the problem domain, proves effective in practice.

Experience gained from real data sets indicates that it is difficult to judge beforehand which of these approaches
is better.

4. Neurofuzzy inference with the Statistical Distance Metric (SDM)

In contrast to example based nearest neighbor learning schemes, the SANFIS learns a smooth functional that
weights the contribution of each examplar. This provides some intuition for the superior performance of
SANFIS networks related to simple Instance Based Learning (IBL) schemes. Nevertheless, in order for the
SANFIS solution to obtain better generalization performance than the nearest neighbourhood schemes a careful
design of its parameters is necessary. This section discusses how the peculiarities of the Statistical Distance
Metric space can be described properly with the tuning of SANFIS parameters. Actually the extraction of the
Statistical Distance Metric space (described at the previous section) and the corresponding tuning of SANFIS
parameters (described at the current section) can be viewed as a preprocessing stage that extracts from the
representation of the symbolic features useful distance measures for neurofuzzy inference.

A parameter of particular importance is the region of influence of the Gaussian Radial Basis Function MFs that
is determined by their spread parameter. The determination of proper spread parameters for the membership
functions becomes more complicated within the domain of statistical distances. At the neural network literature
significant work exists on the estimation of spreads for Radial Basis Function (RBF) networks [ 2,3,6,8,9]. The

heuristic suggestions of [1] compute these spreads σ for RBF networks as σ = d max 2m , where d max is

the maximum distance between patterns and m the number of RBF centers, has not been proven very effective
in practice at the SDM case. One basic reason for the inefficiency of the above formula is that symbolic features
can have significant differences between them. Therefore, the spreading of the MFs corresponding to each
feature dimension should be computed by designing the corresponding distance metric independently from the
other features. Thus, the design of the SANFIS network proceeds by computing different feature spreading
scaling factors for each feature and different weights for the rule centers. The weight parameters adjust the

6
region of influence of a rule center that relates to the importance and the reliability of the example and their
computation is considered at the next section.

The feature spreading scaling factors adjust the spread of the Gaussian kernels along the dimension that
corresponds to the feature. In order to obtain an effective general setting for the computation of the feature

spreading factors, a sensible approach is to obtain at a first step an estimate of the average distance d av , f , of

patterns within the space defined with the SDM independently for each feature f . The parameter d av , f is
used as the feature spreading scaling factor, since it is learned from the peculiarities of the particular feature and
normalizes effectively the distances along the dimensions determined by the feature. Then the region of
influence of the MF kernels is designed by requiring that at a particular distance Spread from a MF center

expressed in units of d av, f , the attenuation of influence is decreased by a . The meaning of these parameters in
practice is illustrated with Figure 2. The figure illustrates the envelope of the Gaussian that is used as

membership function for a feature dimension f with an average distance parameter d av , f = 0.5 . The
requirements fullfiled by this MF is that the set membership reduces to 10% (i.e. the attenuation parameter is
a = 0.1 ) at a distance three times d av, f (i.e. Spread = 3).

We should note, that with this method for the estimation of the MF spreads, the number of rules is considered

only implicitly, since this number determines to a large extent the average distance parameter d av , f .
Mathematically, the requirements for the rate of decaying of MFs along each feature dimension are designed by

seeking a parameter βf such that :

exp( −β f ⋅ d av , f ⋅ Spread ) = a (33)

and therefore the required parameter βf is derived as:

β f = −log( a ) ( Spread ⋅ d av , f ) (44)

Values of these parameters that realize good results are those of the example presented in Figure 2, i.e.
Spread=3 and a=0.1. These parameters imply that at a distance from a rule center along the corresponding
feature dimension 3 times larger than the average distance between patterns, the influence of the MF for the

particular feature f attenuates with a factor of 0.1. For these parameters we obtain
β f = −log( a ) ( Spread ⋅ d av , f ) =1.5351 , and the Gaussian envelope that is plotted in Figure 2

corresponds to ex
p( −1.5351 ⋅x ) .

The evaluation of rule activation proceeds by computing the response for the conjunctive premise conditions
along each feature dimension independently and by multiplying the individual responses. The MF evaluation

corresponding to a distance x f from the center of the MF i for feature f is MF i , f ( x f ) =exp( −βf ⋅ x f ) .

Note that all the MFs for a feature dimension f , have the same spreading β f at all the learning steps, except
the last, i.e. the MF merging step, which generally modifies the Gaussian MF spreading by merging the

7
appropriate MFs. Finally, the total activation of the rule RULE (x ) is obtained by the mutliplicative

conjunction rule as:


RULE ( x ) = ∏exp( −β f ⋅ x f ) = exp( −∑β f ⋅x f )
. This scheme is easily augmented
f f

to account for feature weighting as:

(−w'f βf x f ) =exp −∑(βf x f )


 
RULE ( x) =∏wf exp (−βf ⋅x f ) =∏exp

−w 'f , where w 'f = log w f
f f f 
 

We can elaborate further on the design of the SANFIS neurofuzzy network by exploiting the results of the
Instance Based Learning, that is described at the next section, about the significance and reliability of each

example. Based on these results, for each example x r , r = 1,  , R used for the construction of the fuzzy

system a weight parameter wr is determined. This weight parameter is assigned to the corresponding rule r,

such that the spreading σ f , r of each MF corresponding to feature f and belonging to the premise part of the

βf
same rule r, is further defined as σ f , r = . The designed MF centers own an influence at a distance x from
wr

their center formulated by exp( −σ f , r ⋅ x ) for each feature dimension. Clearly, the larger the values of these

weight parameters, the smaller the parameter σf ,c and therefore the more “extended” is the region

of influence of the corresponding Gaussian MF. Consequently, this means that the rule that is created based on

example x r is a reliable classifier.

Therefore the above scheme with the spreads σ f , r dependent on both rule r and feature f ,

trains locally the spreads of the MFs, accounting for the peculiarities and irregularities of the state space. The
Instance Based Learning (IBL) step can estimate the relative importance of each rule summarized with the

parameter wr and therefore can improve the performance of the designed neurofuzzy solution.

8
55 SANFIS Learning

Although the architecture of the SANFIS is similar to ANFIS [20,21,22,23], the training algorithms are
significantly different. Instead of using gradient descent optimization of the premise and
subsequent parameters we use a two stage learning strategy. The first stage, performs structure
learning with a multistep Instance Based Learning (IBL) approach. At this stage the reliability
and importance of the examples is estimated as described in Subsection 5.1. These examples
are ordered according in a list according to their significance. The next step described in
Subsection 5.2 is an incremental fuzzy system construction process that exploits the
information gathered for the reliability of examples and for the peculiarities of the statistical
distance space along each feature dimension. Exploiting their similarity with a fuzzy mutual
subsethood measure reduces the number of membership functions and can simplify the
SANFIS structure without generally a degradation at the generalization performance. This
pruning process is described in Subsection 5.3. Finally, at Subsection 5.4 the consequent
parameters are learned with a linear least squares formulation that can be computed efficiently
with the pseudoinverse solution.

5.1 Instance Based Learning (IBL) for the selection of MF centers and the determination of their
parameters

Some examples of the training set are more reliable classifiers than others. It is highly desirable to detect the
reliable examples and to exploit them for the rule construction. Also, the more reliable an example is, the larger
should be the region of influence of the corresponding MF when the example is used as a MF rule. The extent
of the region of influence is expressed with the spreading parameter σ of the rule center. Instead of using
clustering self-organizing techniques [2] or an approach that exploits the principle of structural risk
minimization [15], a heuristically driven learning strategy is adopted for the determination of the examples that
should be used as rules and of their widths. The former approaches are well suited for the effective learning of
smooth functionals even if these lie in high dimensional spaces but the irregularities of the statistical distance
metric space extracted from symbolic data and the need to cope with many artefacted examples poses several
problems at the application of their mathematical framework.

The proposed neurofuzzy training approach consists of two phases. At the first phase, the Instance Based
Learning (IBL) step, the potential of each example for serving as a rule (i.e. how representative and reliable the
example is) is evaluated. This step is of heuristic type and it tries to discover the reliability and the importance
of the training examples with an Instance Based Learning (IBL) scheme that resembles the functionality of
PEBLS [7]. The IBL learning step offers the potentiality to detect the noisy examples of the designed
classification system at the initial approximate solution and therefore the feasibility to remove them from the
second neurofuzzy learning phase. The IBL step is implemented with nearest neighbor based schemes [14, 16] .
The classification function of IBL viewed as an input-output mapping tends to have many class boundaries and

9
discontinuous “islands” of misclassified regions placed near erroneously classified examples. The structure of
the decision boundaries is smoothed and most of the artefacted regions are extracted to reject the influence of
noisy examples at the designed classification system. The examples that do not yield satisfactory performance
at the initial IBL learning step, are detected and marked in order to avoid their use at the rule discovery process.
This learning step detects with Instance Based Learning the “good” examples and uses them to construct rules
by placing the MF centers and by tuning the MF spreads according to the structure of the SDM along the
corresponding dimension. This can be viewed as a structure identification step.

The second learning phase is the neurofuzzy training phase. The first learning step, of this phase, the premise
parameter identification is accomplished with Instance Based Learning techniques. These techniques construct
the MFs, place their centers and compute their spreads. All these parameters are computed within the Statistical
Distance Metric space. This step can be considered to perform an initial approximation to the structure of the
solution space analogous to the one performed with Instance Based Learning nearest neighbor schemes. For
every example r and feature f , a membership function of the form close _ to( r , f ) is constructed along the
dimension of f in order to describe how the outcome depends on the closeness to the example r. Thus, the
rules have the following form:

if close _ to( r,1) ∧ close _ to( r,2) ∧  ∧ close _ to( r, F ) then c la s_so f _ e x a m (pr )le

where the consequent class_of_example(r) is expressed with a scalar constant br and F denotes the number of
features. Clearly, these rules constitute a zero’th order Takagi-Sugeno system.
Then for the second phase we avoid the formulation of gradient descent based techniques to further tune the
premise parameters and to compute the consequent parameters. The approximate structure and the lack of
pattern coordinates at the Statistical Distance Metric space perplexes significantly the formulation of the
gradient. At the second learning step, the subsequent parameter learning step, the fuzzy system is designed to
construct a smooth solution that fits well with the training set.

Three basic approaches have been exploited for the implementation of the structure identification learning pass,
based on Instance Based Learning techniques. The one pass approach is an exemplar weighting method that is
used in combination with the nearest neighbor parameter which must attain a value of larger than one. The
learning is accomplished with only one pass through the training examples. At this pass, for each training
instance, its k nearest neighbors are detected from among the remaining training set. If j neighbors have a
matching class then the weight is assigned to the current instance according to the simple formula:
weight =1 + k − j . Therefore, the more the class of the exemplar is reinforced by its neighbors, the less the

weight (i.e. the more reliable the exemplar is). Algorithmically, the one pass instance based learning algorithm
takes the form:

for each pattern P of the training set do

Detect the k nearest neighbors to P from the training set according to the Statistical Distance Metric;

10
Let j = number of nearest neighbors with the same class label as the actual class of P;

Set the weight parameter that quantifies the reliability of the exemplar as weight =1 + k − j ;

endfor;

As a particular example, consider the case that the nearest neighborhood parameter is set to six. The exemplar
weights will range from one to seven, depending on the number of neighbors that have a matching class. A
weight of one means that all the neighbors reinforce the class assignment of the example (i.e. j = k ), therefore
this example is reliable. Conversely, at the other extreme case, a weight of seven, implies with a high
probability that the isolated exemplar is artefacted and therefore it should not account too much to the final
solution.

The one pass technique succeds in attenuating significantly the effect of artefacted exemplars. Evidently, it
attenuates also the exemplars placed near class separation boundaries. Even so, since the strength of the
boundary exemplars is symmetrically reduced, the algorithm is not biased towards favoring a particular class.

An alternative IBL heuristic weighting approach for the determination of the influence of the rule centers
evaluates the performance history related to each exemplar. The performance history is evaluated by running a
large number of classification trials. Usually, all the patterns of the training set are used at those trials. At each
such trial a sample pattern P is picked from the training set and its classification class (P ) is evaluated as

the classification of its nearest neighbor Pn , i.e. class ( P ) = class ( Pn ) . The assigned classification
class (P ) with the nearest neighbor rule is compared with the actual class of the pattern. Exemplars that have

been used successfully to classify their neighbors are assigned a small weight (correspondingly a large region of
influence). Therefore, the spreading of the MFs is adjusted by considering its success ratio, as it is evaluated
from the percentage of times that it was used to classify correctly. This percentage evaluates the effectiveness of
a rule as a classifier. Denoting by used(i) the number of times the exemplar i (that serves as the rule center) is
used to classify (as the nearest neighbour example), and by correct(i) the number of times it is used correctly,
the formula for weighting becomes:

weight = used (i ) correct (i )

We should note that at the evaluation of the nearest neighbour, a weighting scheme for the exemplars is not
used, i.e. all examples are treated equally at the distance computations.

The algorithm for the used correct approach is formulated as:

for all patterns p of the training set do

used[p] = correct[p] = 1;

endfor;
for all patterns p of the training set do

Cp = class(p); // actual class of pattern p

11
Pnearest = nearest_neighbor(p); // nearest neighbour of p without using weighting schemes for the
exemplars

Cnearest = class(Pnearest); // class of the nearest neighbour pattern

Used[Cnearest]++; // increment count of times the pattern Cnearest is used to perform classification

if Cnearest == Cp then

correct[Cnearest]++; // pattern Cnearest has been used to classify correctly

endfor;

for all patterns p of the training set do

weight[p] = used[p]/correct[p];

endfor;

endfor;

Finally, another effective exemplar weighting method for the implementation of IBL learning step is the
increment method. This method assigns initially to all the weights a value of 1.00. Then, the single neighbor of
each training instance is determined. The distance is computed by ignoring the weighting (equivalent to
assuming that each instance has a weight of 1.00). Therefore, the exemplar’s weight becomes the number of
times that the particular example is used at the training process minus the number of times that is used correctly.
This approach differs from the used correct approach only at the final step that assigns the weights differently,
i.e.

for all patterns p of the training set do

weight[p] = used[p] – correct[p];

end;

5. 2 Adding Rules, Modifying Membership Functions

Initially, the SANFIS system consists of only one fuzzy rule that is constructed from the most reliable example.
Then, repetitively examples are retrieved in the order of their significance and incorporating their information
expands the fuzzy system. Criteria are defined according to which this repetitive process stops when the fuzzy
system attains sufficient generalization potential. For each example, a number of membership functions equal to
the number of input variables (features) are constructed and their spreading is tuned to fit the peculiarities of the
statistical distance space along this dimension according to equation (4). These membership functions can be
reduced with a clustering algorithm based on the concept of the fuzzy adaptive subsethood described at the
following section.

The function that the SANFIS system implements takes the form

12
R F  ( 2
 1  SD x f , m f , r   )
∑ ∏ br exp  −
 2  σ f ,r  
r =1 f =1   
F( x | θ ) =
R F  ( 2
 1  SD x f , m f , r   ) (55)
∑∏ exp  −
 2  σ f ,r  
r =1 f =1   

Where

• x = [ x1 x2  x F ] is the input vector.


• ( )
θ = br , m f , r , σ f ,r , r =1, , R , f = 1, , F , denotes the parameter set of the SANFIS system.

• R is the number of fuzzy rules

• F is the number of input variables, i.e. the number of features.

• m f , r is a numeric value assigned to the symbolic feature where the fth membership function for the

rth fuzzy rule achieves a maximum. The actual numeric value of m f , r does not have a particular
meaning for the application because it is usually an integer mapped to the symbolic feature arbitrarily
(e.g. consecutive integers can be used for “tokenizing” features).

• σ f , r is the width (spread) of the membership function for the f th feature and r th fuzzy rule.
• ( )
SD f x f , m f , r is the statistical distance between the symbolic feature value x f and the symbolic

feature rule m f , r computed according to equation (1).

• br is the scalar output of the r th Takagi-Sugeno zeroth-order fuzzy rule.


The algorithmic steps for fuzzy system construction have as follows:

1. Retrieve the most significant example ( x, d ) , where x denotes the attribute vector and
d is the corresponding class label and construct the first fuzzy rule with the
ExampleToFuzzyRule construction algorithm

2. while the list of examples is not empty do

3. Retrieve the next significant example ( x, d )

4. if the currently constructed fuzzy system cannot explain adequately the mapping ( x, d )
then

5. Construct new fuzzy rule for x and expand the fuzzy system.

endif

endwhile

The fuzzy system construction proceeds by considering before expansion if the current set of fuzzy rules is
adequate to explain the currently considered example x . If

13
F( x | θ ) − y ≤ ε F

then the fuzzy system F already satisfactorily represents the information in the corresponding example and

hence no rule is added to F and the next training data is considered by performing the same type of ε F test.

Suppose that

F(x | θ ) − y > εF

Then a new fuzzy rule rnew is added to represent the information contained in ( x, y ) . The widths σ f ,rnew

for the new rule rnew are adapted in order to adjust the spacing between the membership functions so that

• the new rule does not distort what has already been learned by the fuzzy system.

• a smooth interpolation is performed between training points.

This modification of the σ f ,rnew , for each feature dimension f, is accomplished by determining the nearest

neighbour for the fth MF center m f , rnear , from all the MF centers of the same feature f of the fuzzy system

constructed up to this point, according to the statistical distance metric. The spreading σ f , rnew is updated
according to

1
σ f ,rn e w= m f ,rn e w− m f ,rn e a r
Λ

for f = 1,  , F , where Λ is a weighting factor that determines the amount of overlap of the membership

functions. The weighting factor Λ and the weights σ f , rnew have an inverse relationship. A value of Λ = 2
attains good results in practice.

5.3 The Mutual Subsethood measure and Membership Function aggregation

Membership functions µrf , r =1, , R, f =1, , F are used for all the R reliable examples in order to
provide the region of influence of this example along the corresponding attribute (i.e. dimension) coordinates
f =1, , F , within the Statistical Distance Metric space. Each membership function µrf owns a center at

the corresponding feature coordinate and a dispersion computed with the evaluation of the dispersion of the
statistical distance metric along the corresponding feature dimensions according to the formulation of (3).

However, this technique has a tendency to produce a large number of MFs. Clearly, for R reliable examples of
feature dimensionality F used at the SANFIS construction the resulted fuzzy system has R ⋅ F MFs and R
rules. The initially large fuzzy system can in most cases be significantly reduced by agglomerating adjacent
Gaussian MFs, belonging to the same feature f and owning a large degree of overlap, to equivalent ones. This
aggregation function is controlled by computing a measure that evaluates the degree of equivalence among the

14
corresponding fuzzy sets of the MFs. The algorithms for the reduction of MFs based on the mutual subsethood
measure are adapted from [25], Chapter 13.

The measure of mutual subsethood [25] quantifies the degree to which two fuzzy sets are equal or how much
they are similar. Suppose a : X →[0,1] and b : X →[0,1] are the set functions of fuzzy sets A and B .
Then fuzzy sets A and B are equal if a ( x ) = b( x ) for all x∈X . If A = B then A⊆ B and B ⊆ A . Here

A⊆ B if a ( x ) ≤ b( x ) for every x∈X . Then a fuzzy measure of equality or mutual subsethood measure
that quantifies the degree to which fuzzy set A equals fuzzy set B is defined as

E ( A, B ) = Degree ( A = B ) = Degree ( A ⊆ B and B ⊆ A)

To resolve the above definition the notion of size or cardinality of fuzzy set A denoted by c ( A) needs to be
defined. It equals the sum of the fit values of A , i.e. c( A) = ∑ a( x ) .
x∈ X

The mutual subsethood measure E ( A, B ) can be computed in terms of a ratio of counts as in [25]:

c( A ∩ B )
E ( A, B ) = .
c( A ∪ B )

The approximate mutual subsethood measure proposed is computed in [25] for the quantification of the
similarity between two Gaussian membership functions. According to [ 25] for Gaussian MFs with centers
m1 , m2 and variances σ1 ,σ2 this measure can be approximated with the following formulas:

c( A ∩ B ) c( A ∩ B ) (66)
E ( A, B ) = =
c( A ∪ B ) σ 1 π + σ 2 π − c ( A ∩ B )
where

c( A ∩ B ) =
(
1 h 2 m2 − m1 + π ( σ 1 + σ 2 ) )
2 π (σ 1 + σ 2 )

+
(
1 h 2 m2 − m1 + π ( σ 1 − σ 2 ) )
2 π (σ 2 − σ 1 )

+
(
1 h 2 m2 − m1 − π ( σ 1 − σ 2 ) )
2 π (σ 1 − σ 2 )
and h ( x ) = max( 0, x ) .

15
The approach for the reduction of the number of MFs is simple. Specifically, the mutual subsethood measure
E ( A, B ) is computed for all the Gaussian MFs with equation (6). If it is above a threshold

value (a well behaved threshold is 0.8) then the two Gaussian MFs with means m1 , m2 and

m1 + m2
variances σ1 ,σ2 are replaced with one with mean m = and variance
2

σ +σ2
σ= 1 .
2

5.4 Consequent parameter identification

The consequent parameters are evaluated with a least squares formulation that is solved efficiently with the

T
 
well-known pseudoinverse solution [1]. We define y = y1 y2  y F 
 as the desired output vector y ,
 

where F is the number of features and let W = w1T [ w2T  ]


T be an
wF F × R matrix that consists of

the wT
f outputs of Layer 4. Each wT
f is the vector that is derived by the response of the network up to Layer

4 to the input vector x . Clearly we can identify a linear least squares problem whose solution can be obtained
as

(
b = WT W )−1 WT y
The above solution is the pseudoinverse solution for least squares problems and is particularly effective for
training the consequent part of the SANFIS system.

5. Results

The SDM is used as a distance metric for patterns with symbolic attributes in order to fit a neurofuzzy solution
to the hidden dependencies involving attributes of symbolic domains. One prerequisite for the application of
this distance type which is statistical in nature, is to have enough training data for the accurate construction of
the SDM space. However, the training sets of size large enough for providing the essential information for
generalization, provide also the necessary information for the computation of an effective distance matrix.
Although we do not attempt to establish formal bounds for the sufficiency of the size of the training set, the
results obtained from many experiments for which the neurofuzzy network with statistical distances produced
excellent results provide some strong empirical support for its validity.

16
We have applied the neurofuzzy based solutions to many standard data sets from the UCI machine learning
repository. The results are illustrated by Table 1. This table compares the PEBLS performance with that
obtained with the proposed neurofuzzy solution. The 3nd column displays the average generalization
performance of the neurofuzzy system without the Instance Based Learning (IBL) estimation of parameters.
The next three columns (4th to 6th) summarize the corresponding average classification performance for the
results obtained by applying the neurofuzzy with the three approaches to IBL described, i.e. the one-pass, the
used-correct and the increment approaches. The improvements achieved with the IBL learning are evident. It is
important to note at this point that the utilization of IBL learning allows to consider for the construction of the
fuzzy rules only a small percentage (about 5%-20% dependent on the particular application) of the training
examples which are the most important and reliable. Therefore, we can obtain a small SANFIS system that
generalizes well.

Table 2 presents results for the same data sets as in Table 1 with the application of the mutual subsethood
pruning of the SANFIS system. The additional column displays the percentage of reduction of the MFs. This
percentage depends on the distribution of the training data points and tends to increase with the increase of the
training set size. We conclude that the merging of MFs results in a simpler SANFIS system of similar overall
performance. Clearly, from the results displayed at Tables 1 and 2 we can conclude that the neurofuzzy
solutions outperform the simple nearest neighbor schemes.

Database PEBLS NeuroFuzzy NeuroFuzzy NeuroFuzzy NeuroFuzzy


+IBL +IBL +IBL

One-pass Used Correct Increment


Hypothyroid 97.90 98.14 98.33 98.44 98.37

Iris 94.62 95.40 96.32 96.8 96.12


Hepatitis 76.59 78.33 79.56 84.37 81.34
Breast Cancer 94.23 95.90 96.11 96.22 96.18
Heart Disease 81.90 83.41 82.15 83.87 85.21
Audiology 77.9 78.12 81.26 81.14 79.94
Liver Disorders 63.45 63.20 65.91 72.54 74.58
Table 1 Illustration of the performance of the proposed NeuroFuzzy + IBL data mining algorithms compared
to the plain IBL as implemented with the PEBLS learning system. We can observe that the utilization of IBL
within the framework improves the generalization results. However, we cannot easily conclude that a
particular IBL learning approach is better.

Database PEBLS NeuroFuzzy NeuroFuzzy NeuroFuzzy NeuroFuzzy Percentage of


+IBL +IBL +IBL reduction of
MFs
One-pass Used Correct Increment
Hypothyroid 97.90 98.04 98.46 98.52 97.73 42 %

17
Iris 94.62 95.30 96.32 96.8 96.14 36 %
Hepatitis 76.59 78.43 79.56 85.11 82.14 29 %
Breast Cancer 94.23 95.93 96.11 96.13 96.18 41 %
Heart Disease 81.90 84.52 82.18 84.12 85.10 43 %
Audiology 77.9 79.21 81.15 82.14 80.12 45 %
Liver 63.45 63.25 64.92 71.89 73.67 47 %
Disorders
Table 2 The results for the same data sets as in Table 1 with the application of the mutual subsethood pruning
of the SANFIS system. The additional column displays the percentage of reduction of the MFs. We conclude
that the merging of MFs results in a simpler SANFIS system of similar overall performance.

Below we describe in more detail one application from bioinformatics and one from data mining of
commercial databases. The application from bioinformatics concerns the prediction of promoter sequences
[5,10]. This task involves predicting whether or not a given subsequence of a DNA sequence is a promoter, i.e. a
sequence of genes that initiates a process called transcription. The data set contains 106 examples, 53 of which
were positive examples (promoters) and the rest negative ones. A training pattern consists of a sequence of 57
nucleotides (features) from the alphabet a, c, g and t with the respective classification (promoter or not
promoter). Since the available number of patterns were small the classification performance was tested with the
leave-one-out methodology, i.e. repeatedly trials have been performed by training on 105 examples and testing
on the remaining one. The computed performance was 2/106 (i.e. an average of 2 errors over 106 trials) versus
4/106 for a competitive experiment that used the KBANN neural network model [12].

Another application concerns a different domain: the data mining of commercial databases. A large training set
is extracted from a database kept by a mobile telecommunication company. This set contains 62000 records
concerning some attributes of its customers (e.g. job, area, sex, and pay method) and the classification in terms
of their quality as customers. This classification is in six classes, ranging from 0 to 5 in terms of increasing
customer quality. The objective for the learning system was to perform well at predicting the class of a new
customer from its attributes and therefore to guide the company strategy. For this problem the presented
statistical distance with neurofuzzy learning has obtained the best classification performance, i.e. around 60%
success at the prediction of the customer class. In contrast, a nearest neighbourhood classification scheme based
on the same distance obtains only around 30% and a Self-Organizing Map (SOM) operating with the traditional
distance types with a numerical coding of feature values (it is not easy to adapt the statistical distance at the
context of the SOM) obtains a performance around 42%.

6. Conclusions

Neuro-fuzzy network algorithms designed for learning are very effective in domains in which all features have
numeric values. In these domains, the examples are treated as points and distance metrics obey to standard

18
definitions. However, the usual domain of data mining applications is the symbolic domain. In this domain the
utilization of the traditional distance metrics usually results in incompetent results.

This work has adapted a Statistical Distance Metric (SDM), for expressing the distance between values of
features in symbolic domains that was proposed for nearest neighbor schemes, to the context of the peculiarities
of the Symbolic Adaptive Neuro-Fuzzy Inference System (SANFIS). The potential of this distance metric to
regularize the solution of the neural fuzzy networks is the theoretical justification of the improved performance
related to the simple nearest neighbor schemes.

Additional performance improvement has been obtained by exploiting the fact that the examples of the training
set are not all of the same importance and reliability. Therefore, a learning mechanism is implemented for the
estimation of the reliability and significance of each example. The SANFIS can be designed to exploit
effectively the irregularity of the problem’s state space through the selection of the proper training examples for
the construction of the fuzzy rules with an Instance Based Learning (IBL) scheme. Also, the weight parameters
for each rule are determined with the IBL learning pass. These weight parameters serve for accounting the
significance and reliability of the corresponding example. We have described three different Instance Based
Learning (IBL) algorithms for the implementation of this learning step. The fuzzy rules are constructed with an
incremental process, that creates a new rule only in the case that the currently assembled system is inadequate
to explain the new example. The concept of fuzzy adaptive subsethood is used at the third phase, for the
reduction of the number of the fuzzy sets used as membership functions. Finally, the consequent parameters are
estimated with an efficient linear least squares formulation. The obtained performances with this multilevel
learning method are significantly better than the traditional nearest neighbour schemes in many data mining
problems and the system offers enhanced explanation ability with the learned fuzzy rules.

The neurofuzzy data mining system that the paper has presented has a wide span of applications and has been
adapted to heterogeneous data records having both symbolic and numeric data attributes. For the symbolic
attributes the Statistical Distance Metric has resulted the best performance. However, for the numeric attributes
the optimal policy for evaluating the distance is not obvious. In many cases the generalization improvement is
obtained by handling the numeric values with a normalized numeric distance type, e.g. with Euclidean distance
metric normalized with the standard deviation. Furthermore, some attribute domains are better handled with a
discretized numeric distance metric augmented with an interpolation to alleviate the discretization effects (e.g.
the Interpolated Value Difference Metric [16]).

Future work to upgrade further the proposed SANFIS and IBL hybrid data mining algorithms can proceed along
many different directions. Specifically, more elaborated schemes for treating numerical attributes by finding
optimal multisplits [13] can be incorporated within the context of the current work in order to enhance further
the performance. Also, another approach for the effective discretization of continuous attributes using a
simulated annealing algorithm [11] can improve the results obtained by treating numeric attributes as discretized
symbolic. Furthermore, a nonlinear optimization of the consequent parameters with an approach like the
Levenberg-Marquardt algorithm [18] can (at least theoretically) obtain better performance at the cost of a more
complex (and perhaps more unstable) implementation.

19
Acknowledgment

The authors wish to thank the Research Committee of the University of Patras, for the partial financial
support of this research with the contract KARATHEODORIS, 2454

References
[11] Simon Haykin, Neural Networks, MacMillan College Publishing Company, Second Edition, 1999.

[22] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer perceptrons, Science 247 (1990):978-982.

[33] T. Poggio, and F. Girosi, “Networks for approximation and learning”, Proceedings of the IEEE, vol. 78, pp. 1481-1497, 1991.

[44] Vapnik., V. N. 1998, Statistical Learning Theory, New York, Wiley.

[55] Pierre Baldi, Soren Brunak, Bioinformatics, MIT Press, 1998.

[66] Federico Girosi, "An Equivalence Between Sparse Approximation and Support Vector Machines", Neural Computation, 10:6, (1998) , pp. 1455-
1480.

[77] C. Stanfill, D. Waltz, “Toward memory-based reasoning”, Communications of the ACM, 29:12 (1986) , pp.1213-1228.

[88] S. Papadimitriou, A. Bezerianos, A. Bountis, “Radial Basis Function Networks as Chaotic Generators for Secure Communication Systems”,
International Journal On Bifurcation and Chaos, Vol. 9, No. 1 (1999), pp. 221-232.

[99] A. Bezerianos, S. Papadimitriou, D. Alexopoulos, “Radial Basis Function Neural Networks for the Characterization of Heart Rate Variability
Dynamics”, Artificial Intelligence in Medicine 15 (1999) pp. 215-234.

[1010] Vladimir Brusic, and John Zeleznikow, “Knowledge discovery and data mining in biological databases”, The Knowledge Engineering Review,
Vol. 14:3, 1999, pp. 257-277

[1111] Justin C. W. Debuse and Victor Jayward-Smith, “Discretisation of Continuous Commercial Database Features for a Simulated Annealing Data
Mining Algorithm” Applied Intelligence 11, pp. 285-295 (1999)

[1212] G. Towell, J. Shavlik, M. Noordewier, “Refinement of approximate domain theories by knowledge-based neural networks”, Proceedings Eight
National Conference on Artificial Intelligence”, pp. 861-866, Menlo Park, CA:AAAI Press, 1990

[1313] Tapio Elomaa, Jho Rousu, “General and Efficient Multisplitting of Numerical Attributes”, Machine Learning, 36, pp. 201-244, (1999)

[1414] Stefan Berchtold, Daniel A. Keim, Hans-Peter Kriegel, Thomas Seidl, “Indexing the Solution Space: A New Technique for Nearest Neighbor
Search in High-Dimensional Space”, IEEE Transactions On Knowledge and Data Engineering, Vol. 12, No. 1, January/February 2000

[1515] Peter Bartlett & John Shawe-Taylor, “Generalization Performance of Support Vector Machines and Other Pattern Classifiers”, in Advances in
Kernel Methods, Support Vector Learning, MIT Press, 1999, pp.43-54

[1616] D. Randall Wilson, Tony R. Martinez, “Improved Heterogenous Distance Functions”, Journal of Artificial Intelligence Research 6 (1997), p. 1-34

[1717] J. Barry Gomm, Ding Li Yu, “Selecting Radial Basis Function Network Centers with Recursive Orthogonal Least Squares Training”, IEEE
Transactions On Neural Networks, Vol. 11, No 2, March 2000, p. 306-314

[1818] Christopher M. Bishop, Neural Networks for Pattern Recognition (Clarendon Press-Oxford, 1996)

[1919] Scott Cos and Steven Salzberg, “A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features”, Machine Learning, Vol. 10, pp.
57-78, 1993

[2020] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, `` Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine
Intelligence,'' Prentice Hall, 1996.

[2121] J.-S. R. Jang and C.-T. Sun, `` Neuro-Fuzzy Modeling and Control,'' The Proceedings of the IEEE, vol. 83, pp. 378-406, Mar. 1995.

20
[2222] J.-S. R. Jang, `` ANFIS: Adaptive-Network-based Fuzzy Inference Systems,'' IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, pp. 665-
685, May 1993.

[2323] J.-S. R. Jang, `` Self-Learning Fuzzy Controller Based on Temporal Back-Propagation,'' IEEE Trans. on Neural Networks, vol. 3, pp. 714-723,
Sept. 1992.

[2424] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its application to modeling and control”, IEEE Trans. Syst., Man, Cybern., vol.
15, pp. 116-132, Jan. 1985

[2525] Bart Kosko, Fuzzy Engineering, Prentice Hall, 1997

[2626] Hans Hellendoorn, Dimiter Driankov (Eds), Fuzzy Model Identification, Springer-Verlang, 1997

21
L a y e r 1 L a y e r 2 L a y e r 3 L a y

X 1 X 2
A 1
w 1 w 1
X 1
Π Ν w 1 b1
A 2

Σ
B 1

X 2 Π Ν w 2 b 2
B 2
w 2 w2
X 2

Figure 11: The architecture of the Symbolic Adaptive Neuro Fuzzy System (SANFIS)

Figure 22: Illustration of the design of the Gaussian MF spreading along each feature dimension.

22
Figure 33: The mutual subsethood measure quantifies the amount of overlap of Gaussian MFs. At the left subplot the
amount of overlap is small and the corresponding mutual subsethood measure takes small values. The opposite case is
illustrated at the right subplot.

23

You might also like