The Supervised Network Self-Organizing Map For Classification of Large Data Sets

Applied Intelligence 16, 185203, 2002
c _2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

The Supervised Network Self-Organizing Map for Classication
of Large Data Sets
STERGIOS PAPADIMITRIOU
Department of Medical Physics, School of Medicine, University of Patras, 26500 Patras, Greece;
Department of Computer Engineering and Informatics, University of Patras, 26500 Patras, Greece
stergios@heart.med.upatras.gr
SEFERINA MAVROUDI AND LIVIU VLADUTU
Department of Medical Physics, School of Medicine, University of Patras, 26500 Patras, Greece
G. PAVLIDES
Department of Computer Engineering and Informatics, University of Patras, 26500 Patras, Greece
ANASTASIOS BEZERIANOS
Department of Medical Physics, School of Medicine, University of Patras, 26500 Patras, Greece
bezer@patreas.upatras.gr
Abstract. Complex application domains involve difcult pattern classication problems. The state space of these
problems consists of regions that lie near class separation boundaries and require the construction of complex dis-
criminants while for the rest regions the classication task is signicantly simpler. The motivation for developing the
Supervised Network Self-Organizing Map (SNet-SOM) model is to exploit this fact for designing computationally
effective solutions. Specically, the SNet-SOM utilizes unsupervised learning for classifying at the simple regions
and supervised learning for the difcult ones in a two stage learning process. The unsupervised learning approach
is based on the Self-Organizing Map (SOM) of Kohonen. The basic SOM is modied with a dynamic node inser-
tion/deletion process controlled with an entropy based criterion that allows an adaptive extension of the SOM. This
extension proceeds until the total number of training patterns that are mapped to neurons with high entropy (and
therefore with ambiguous classication) reduces to a size manageable numerically with a capable supervised model.
The second learning phase (the supervised training) has the objective of constructing better decision boundaries
at the ambiguous regions. At this phase, a special supervised network is trained for the computationally reduced
task of performing the classication at the ambiguous regions only. The performance of the SNet-SOM has been
evaluated on both synthetic data and on an ischemia detection application with data extracted from the European
ST-T database. In all cases, the utilization of SNet-SOM with supervised learning based on both Radial Basis
Functions and Support Vector Machines has improved the results signicantly related to those obtained with the
unsupervised SOM and has enhanced the scalability of the supervised learning schemes. The highly disciplined
design of the generalization performance of the Support Vector Machine allows to design the proper model for the
particular training set.
Keywords: neural networks, data mining, self-organizing maps, learning vector quantization, divide and conquer
algorithms, radial basis functions, support vector machines, computational complexity, ischemia detection
186 Papadimitriou et al.
1. Introduction
Application domains involving large data sets, as the
data mining of large commercial databases [13], the
protein secondary structure prediction problem [4] and
the intelligent processing of biomedical signals [5, 6],
are frequently characterized by the existence of a large
number of noisy observations of a set of variables and
of the related instantiations of outcomes. The objective
is to uncover hidden complex and often fuzzy relations
between the variables and the outcomes in the form
of input/output dependencies. The emergence of neu-
ral network technology [7, 8] offers valuable insight
to confront with these complicated problems. In this
context, neural networks can be viewed as advanced
mathematical models for discovering complex corre-
lations between variables of physical processes from a
set of perturbed observations.
Advanced neural network techniques are based on
solid formal theories for building the generalization
potential onto their design. For example, Radial Basis
Function networks exploit the Tikhonovs regulariza-
tion [911] while Support Vector Machines are derived
as implementations of the robust Statistical Learning
Theory of Vapnik [12], although equivalences between
these two seemingly different approaches exist [11].
However, the computational requirements for the nu-
merical evaluation of these models become prohibitive
as the size of the problem increases [13].
In contrast, unsupervised learning schemes, demand
signicantly less computational resources and scale
better to the problem size but they usually cannot dis-
criminate well over parts of the state space that re-
quire the enforcement of complex decision boundaries
[7, 14]. The motivation for developing the Network
Self-Organizing Map (SNet-SOM) model is to pro-
vide a simple and effective framework to exploit those
facts for the design of computationally effective solu-
tions. Specically, the SNet-SOM utilizes a computa-
tionally efcient unsupervised learning for identifying
and classifying at the simple regions (where a non-
ambiguous classication can be established) and su-
pervised learning for the difcult (ambiguous) ones in
a two stage learning process.
The unsupervised learning approach that is utilized
by the SNet-SOMis based on the Self-Organizing Map
(SOM) of Kohonen [7, 15, 16]. The Self-Organizing
Map (SOM) is an unsupervised neural network model
with the capability of converging to a conguration
that is indicative of the intrinsic statistical features con-
tained in the input patterns. In essence, the SOM acts
as a powerful feature detector that can derive a repre-
sentation of the original data with a signicantly lower
dimensionality. Formally, the SOMconstructs a princi-
pal curve that approximates smoothly the nonlinearly
relatedtrainingdata [16, 17]. This constructionis a non-
linear analog of the linear Principal Component Analy-
sis (PCA) method [7]. The regularization of the approx-
imating function in the context of the SOMis therefore
manipulated implicitly with the potential that the build-
ing of the principal curve offers. This regularization po-
tential is exploited well in the context of the Network
Self-Organizing Map (SNet-SOM) for the derivation of
high quality results over the non-ambiguous parts of the
state space. The basic SOM algorithm is modied with
a dynamic node insertion/deletion process controlled
with an entropy-based criterion that allows an adaptive
extension of the SOM. This extension proceeds until
the total number of training patterns that are mapped
to neurons with high entropy (and therefore with am-
biguous classication) reduces to a size manageable
numerically with a capable supervised model.
Recently, neural network models with strong math-
ematical basis for obtaining outstanding, near optimal
performance have been developed. A notable example
capable of obtaining good generalization performance,
on the basis of the training patterns alone (without
incorporating a priori knowledge for the problem) is
the Support Vector Machine [7, 12]. These approaches
however, require a form of extensive optimisation (e.g.
solvinga quadratic-programmingproblem[12, 18, 19])
with a computational complexity that does not scale
well with the size of the problem. The poor scaling be-
haviour seems to be a common feature of all the neu-
ral models that attempt near optimal performance. The
root of this fact is the well-known curse of dimension-
ality and the corresponding combinatorial explosion
of the problem state space [20]. Therefore, a device for
dividing a complex application domain to a part that
corresponds to regions that are placed near class bound-
aries (where it is difcult to perform classication de-
cisions) and a part that is unambiguous (i.e. within
class regions) becomes of particular importance. Usu-
ally the unambiguous part accounts for most of the state
space. Since it can be managed with the computation-
ally efcient adaptation of the SOM algorithm in the
context of SNet-SOM, the computational requirements
for the total problem are reduced substantially without
a reduction of the classication accuracy. Moreover,
since the performance of many supervised learning
Supervised Network Self-Organizing Map 187
algorithms deteriorates for large problems (e.g. due to
the local minima trapping problem in gradient descent
algorithms [7, 8]) the SNet-SOM in these cases im-
proves the pure supervising learning solution in addi-
tion to reducing substantially the computation time.
The paper proceeds as follows. Section 2 describes
the proposed Supervised Network Self-Organizing
Map(SNet-SOM). Specically, subsection2.1presents
the modications to the basic Self-Organizing Map
(SOM) for the task of separating the regions of the
state space over which complex decision boundaries
should be enforced (i.e. the ambiguous regions). Sub-
section 2.2 deals with the design of the supervised part
of the SNet-SOMand proposes the Support Vector Ma-
chines as one of the most effective models. Section 3
discusses the results of the classication effectiveness
of the plain SOMand compares themwith the obtained
performance after the utilization of the additional su-
pervised stage. The results discussed concern both sim-
ulatedsynthetic data andanischemia detectionapplica-
tion. Finally, Section 4 presents the conclusions along
with some directions onto which further research can
proceed for improvements.
2. The Supervised Network Self-Organizing Map
The Supervised Network Self-Organizing Map (SNet-
SOM) is the proposed extension to the Self-Organizing
Map (SOM) of Kohonen [1517] designed in order
to cope with complex application domains by a exi-
ble combination of the SOM approach with supervised
learning schemes. The SNet-SOM improves for large
problems the pure supervising learning solution in ad-
dition to reducing substantially the computation time.
The SNet-SOM consists of two components:
The Classication Partition SOM (CP-SOM).
The supervised expert network.
The size of the CP-SOM is dynamically expanded
with an adaptive process optimized for the task of the
detection of the difcult ambiguous regions of the
state space. It is trained over the whole training set.
The dynamic growth is based on the criterion of neu-
ron ambiguity (i.e. uncertainty about class assignment),
which is quantied with the entropy. This is in contrast
to the local quantization error approach of [21] that
grows the map at the nodes that accumulate the largest
quantization error. Class ambiguity is accounted much
better with the entropy measure than with the accu-
mulated local error. It is easy to understand that the
local error can be large even with no class ambiguity,
while the entropy directly and objectively quanties
the ambiguity. Classication decisions are performed
with the CP-SOM only at the unambiguous part of the
state space that corresponds to the neurons of small
entropy. The supervised expert handles the ambiguous
subspace. Below we discuss the CP-SOM and the su-
pervised expert network in detail.
2.1. The Classication Partition SOM (CP-SOM)
The Classication Partition SOM (CP-SOM) is initial-
ized with a fewnodes (usually four) and grows nodes to
represent the input data. Weight values of the nodes are
self-organized according to a method inspired by the
SOM algorithm. The former approach yet, has signif-
icantly lighter computational demands in comparison
with the latter, because the ne-tuning of the weights
is avoided and the CP-SOM is usually of small size.
The CP-SOM learning algorithm is as follows:
(1) Initialization Phase
Initialization of the weight vectors of the starting nodes
(usually four) with randomnumbers within the domain
of feature values.
(2) Adaptation Phase
1. Present input to the network and determine the
weight vector that is closest to the input vector
mapped to the current feature map (winner), using
usually either Manhattan or Euclidean distance.
2. Adapt weight vectors only in the neighborhood of
the winner and for the winner itself according to the
following formula:
w
j
(k 1)
=
_
w
j
(k), j , N
k
w
j
(k) (k)
k
(x
k
w
j
(k)), j N
k
(1)
where (k) is the learning rate, a monotonically de-
creasing sequence of positive parameters, N
k
is the
neighborhood at the kth learning step and
k
is the
neighborhood function.
The neighborhood function
k
decreases monotoni-
cally with increasing distance fromthe winning neuron
(i.e. nodes closer to the winner are adapted more) like
in the standard SOM algorithm and the initial neigh-
borhood includes all the map. However, both these
parameters (i.e. N
k
,
k
) do not need to shrink with
time. This is explained by the following: Initially, the
neighborhood is large enough to include the whole
map. The CP-SOM has initially a much smaller size
than a usual SOM; thus a large neighborhood is not re-
quired to train the whole map at the rst learning steps
(e.g. with 4 nodes initially at the map, a neighborhood
of 1 only is required). In contrast to the standard SOM
algorithm, the neighborhood for each training epoch
during the dynamic expansion phase takes small start-
ing values. A training epoch is dened as the training
of the CP-SOM with a xed number of neurons at its
lattice. Since the CP-SOM initially is of a very small
size a small neighborhood parameter covers initially
most of (or all) the map. As training proceeds, dur-
ing subsequent training epochs, the area dened by
the neighborhood becomes localized near the winning
neuron, not by shrinking the vicinity radius (as in the
standard SOM) but by enlarging the SOM with the dy-
namic growing.
(3) Expansion Phase
The objective of controlling the number of training pat-
terns that correspond to the ambiguous regions (and
therefore that will be handled with the supervised ex-
pert) is the motivation for a modication of the basic
SOM algorithm that leads to its dynamic expansion.
The number of these training patterns should adhere
with the computational limitations of the supervis-
ing learning algorithm (upper bound). Moreover, there
should be enough patterns for providing the essential
information for establishing good generalization over
the ambiguous regions (lower bound). The expansion
phase follows the adaptation phase. This phase is con-
trolled by two parameters playing the role of upper
and lower bounds. The rst one, the parameter Super-
visedExpertMaxPatterns species a limitation to train
effectively the supervised expert when the size of the
training set exceeds its value. Also, for obtaining ef-
fective generalization, the other variable, the Super-
visedExpertMinPatterns controls the lower bound on
the size of the training set. The expansion phase con-
sists of the following steps.
3.1 Calibration of the map with a majority-voting
scheme. At this step not only the class of each
node, but also a parameter HN
i
characterizing the
entropy of the nodes (NodeEntropy) is computed
for every node i . This parameter is computed ac-
cording to Eq. (2) that is discussed below.
3.2 Detection of the neurons whose class assignments
are ambiguous, referred to as the ambiguous neu-
rons. A neuron is ambiguous if its node entropy
HN
i
exceeds a threshold value.
3.3 Evaluation of the map over the whole training set
in order to compute the number of training pat-
terns that correspond to the ambiguous neurons.
This number is denoted by the variable NumTrain-
ingSetAtAmbiguous.
3.4 if NumTrainingSetAtAmbiguous > SupervisedEx-
pertMaxPatterns then
// the number of patterns remaining for supervising
training should not exceed the
// limitations of the supervising solution for effec-
tive training
3.4.1.a. Perform map expansion by inserting
smoothly at the neighborhood of each ambiguous
neuron a number of neurons that depends on its
fuzziness (i.e. the more uncertain is the class of
a neuron the more neurons are inserted over its
neighborhood). The neuron insertion process is not
described in detail since it follows the guidelines
of [21].
3.4.1.b. Repeat the adaptation phase after the dy-
namic extension of the map in order to adjust
the new neurons in the appropriate positions of
the lattice and re-execute the previous steps of the
expansion phase reentering to test the conditions
of 3.4
else
if NumTrainingSetAtAmbiguous < SupervisedEx-
pertMinPatterns then
// few training patterns are passed to the supervis-
ing expert
3.4.2. Reduce the parameter NodeEntropyThresh-
oldForConsideringAmbiguous that controls whe-
ther or not a node will be considered as am-
biguous. Thus, restarting from Step 3.2, more
nodes will be considered as ambiguous and the
size of the supervised expert training set will be
increased.
else
// number of patterns transferred to the supervising
expert is within the desired limits
3.4.3. Generate training and testing sets for the su-
pervised expert. Further supervising training will
be performed with these sets by the supervising
learning algorithm in order to better resolve the
ambiguous parts of the state space.
endif
This algorithm exploits well the topological order-
ing that the basic SOM provides and increases the res-
olution of the representation over regions of the state
space that lie between class boundaries. At this point, it
should be emphasized that simply increasing the SOM
size with the adaptive extension algorithm until each
neuron represents unambiguously a class, yields to a
SOM conguration that although ts to the training
set, fails to generalize well.
The classication task proceeds by feeding the pat-
tern to the CP-SOM. If the winning neuron is one that
is not ambiguous, the CP-SOM classies by using the
class of the winning neuron. In the other case, the su-
pervised expert is used to perform the classication
decision.
It should be noted here that the CP-SOMcan be used
to confront directly all the classication problem. The
only parameter that requires a signicant change is the
size of the map that needs to be enlarged in order to pro-
vide better class resolution over the state space of the
problem. The attained performance is directly compa-
rable to those achieved for this task with other network
types proposed in the literature [5, 6, 22]. Nevertheless
using solely the CP-SOM, there remain regions of the
state space where complex decision boundaries should
be enforced in order to separate effectively between the
different classes. The SNet-SOM obtains further gen-
eralization performance by separating the patterns at
these regions with supervising learning schemes.
The assignment of a class label to each neuron of the
CP-SOM is performed according to a majority-voting
scheme [17]. This scheme acts as a local averaging
operator dened over the class labels of all the pat-
terns that activate that neuron as the winner (and ac-
cordingly are located at the neighborhood of that neu-
ron). The typical majority-voting scheme considers one
vote for each winning occurrence. An alternative more
analog weighted majority voting scheme weights the
votes each by a factor that decays with the distance of
the voting pattern from the winner (i.e. the largest the
distance the weakest the vote). The averaging operation
of the majority and weighted majority voting schemes
effectively attenuates the artifacts of the training set
patterns. An alternative CP-SOM calibration method
selects as the class of the neuron the class of the near-
est pattern to that neuron (i.e. according to the nearest
neighbor rule [7]). This method does not perform well
the task of canceling noise and it leads to less overall
classication performance. The performances of both
the majority voting and the weighting majority voting
schemes are similar for both the task of rejecting arti-
facts and for the classication task.
In the context of SNet-SOM the utilization of either
majority or weighted majority voting for the CP-SOM
is essential. These schemes allow the degree of class
discrepancy for a particular neuron to be readily es-
timated. Indeed, by counting the votes at each SOM
neuron for every class, an entropy criterion for the un-
certainty of the class label of neuron m can be directly
evaluated, as [23]:
HN(m) =
N
c
k=1
p
k
log p
k
(2)
where N
c
denotes the number of classes and p
k
=
V
k
/V
total
is the ratio of votes V
k
for class k to the total
number of votes V
total
to neuron m. Clearly, the en-
tropy is zero for unambiguous neurons and increases
as the uncertainty about the class label of the neu-
ron increases. The upper bound of H(m) is log N
c
,
and corresponds to the situation where all classes are
equiprobable (i.e. the voting mechanism does not fa-
vor a particular class). Therefore, with these voting
schemes the regions of the SOM that are placed at
ambiguous regions of the state space can be readily
identied. For these regions the supervised expert is
designed and optimized for obtaining adequate gener-
alization performance.
The computation time for each iteration of the adap-
tation phase of the SOM algorithm (which is the most
time consuming) scales with a factor that is almost lin-
ear to the size of the training set. As a general rule, the
number of iterations that are performed at the adapta-
tion phase of the CP-SOM is much smaller than at the
corresponding phase of the standard SOM algorithm.
Specically, we have determined experimentally that
a good initial value of the learning rate is n(k) = 0.2
and having it to decrease in 10 steps to 0.02 yields
good results (thus the factor for decreasing the learn-
ing rate is exp(log(0.1)/10) = 0.7943. The CP-SOM
only separates the state spaces without extensive ne-
tuning. However, the required number of neurons de-
pends mainly on the number of classes that are to be
separated (i.e. is independent of the training set size).
Since the size of CP-SOM is small, and the number
of training epochs (i.e. map expansions) is also very
small (3 to 5 map expansions are usually sufcient) the
unsupervised phase of the SNet-SOM algorithm scales
linearly with the size of the pattern set (although with
a large scaling factor).
In contrast, supervised learning approaches, as i.e.
the Radial Basis Function networks have computa-
tional demands in terms of memory and processing
resources that scale with a factor that is about cubic to
the size of the training set. Practically, it was infeasi-
ble at our computing environment to train Radial Basis
Function networks with a size more than 2000 500
where 2000 is the training set size and 500 the number
of hidden units (i.e. RBF centers). Therefore, for the
applications considered with more than 9000 patterns
at the training set it is very difcult to train effectively
an RBF network to accomplish directly the classica-
tion task.
2.2. The Supervised Expert Network
The supervised expert network has the task of discrim-
inating over the state space regions where the class de-
cision boundaries are complex. The supervised expert
network should be of a local approximation type and it
should incorporate formalism in its design for obtain-
ing adequate generalization performance. Appropriate
neural network models that fulll these requirements
are the Radial Basis Function (RBF) and the Support
Vector Machines (SVM). The SOM prototype vectors
create piecewise linear class boundaries [17] that are
usually not effective for resolving the class ambiguity
over all the regions of the state space. Moreover, even
if the learning procedure enlarges the SOM adaptively
until each of its neurons represent unambiguously a
single class, this solution addresses only the minimiza-
tion of the training error and ignores the generalization
performance. On account of this, in the absence of a for-
mal setting for designing generalization, the decision
boundaries that the SOM constructs for the ambiguous
regions are not expected to cope well for discriminating
new patterns. To the contrary, a Support Vector Ma-
chine implementation of the supervised expert offers
the potential to construct near perfect decision bound-
aries. For example, the results discussed in [7] illustrate
close to optimal separation for a classication problem
involving overlapping Gaussian distributions. Below
we discuss the implementation of the supervised ex-
pert with a Radial Basis Function and with Support
Vector machine, emphasizing on the later choice that
provides better performance and disciplined design.
2.2.1. Radial Basis Function Supervising Expert.
The Radial Basis Function networks have been used
successfully in many applications [24, 25]. They
explore the Tikhonovs regularization theory for ob-
taining generalization performance by which obtain
a tradeoff between a term that measures the tness
of the solution to the training set and one that eval-
uates the smoothness of the solution. Denoting x
i
, d
i
,
F(x
i
) the input vectors, the desired responses and the
corresponding realizations of the network respectively
this tradeoff can be formulated with a cost function as
[7, 9]:
C(F) = C
s
(F) C
r
(F) (3)
where,
C
s
(F) =
1
2
l
i =1
[d
i
F(x
i
)]
2
(4)
C
r
(F) =
1
2
|DF|
2
The C
s
(F) is the standard error term that accounts
for the tting to the training set in the least squares
error sense, is a positive real number called the regu-
larization parameter and C
r
(F) is the regularized term
that favors the smoothness of the solution. The latter
term is the most important from the point of view of
generalization performance. The operator D is a sta-
bilizer because it stabilizes the solution by providing
smoothness. In turn, a smooth solution is signicantly
more robust to erroneous examples of the training set.
Nonetheless, the design of a proper generalization per-
formance for RBF networks still remains a difcult and
complex issue that involves heuristic criteria for the se-
lection of centers and of their parameters [26].
The SNet-SOM with an RBF network as supervised
expert has been congured to grow adaptively un-
til about 2000 patterns (i.e. SupervisedExpertMaxPat-
terns =2000) map onto the ambiguous neurons. This
size of the training set is appropriate for an RBF so-
lution by means of a numerically effective approach.
Specically, m =500 xed centers are selected at
random from the training patterns. Their spread
is common and is computed according to the em-
pirical formula =d
max
/
2m, where d
max
is the
maximum distance between the chosen centers, that
for the normalized 5-dimensional input vector is
=
5/
2 500 = 0.707. The only parameters that

need to be learned are the linear weights of the out-
put layer, which are computed with the pseudoinverse
method [7]. The number of centers that yielded good
generalization performance is m = 500 and the regu-
larization parameter at the range 0.1 to 0.3.
The RBF supervised expert is simple to design and
to implement. However, it is not an easy task to esti-
mate the important parameter SupervisedExpertMin-
Patterns. The RBF networks do not include sufcient
tools to access the generalization performance and
(time consuming) empirical cross-validation methods
are usually needed [7]. A simple, yet effective heuris-
tic is to transfer to the RBF supervising expert approx-
imately the maximum number of patterns that it can
handle effectively.
2.2.2. Support Vector Machine Supervising Expert.
Since the main objective of the SNet-SOM is to use
the supervised expert in order to generalize effectively
at the state space regions where the plain SOM cant
generalize well we have tted a supervised model with
improved generalization abilities, the Support Vector
Machine (SVM) network [12, 27]. The SVM obtains
high generalization performance without the need to
add apriori knowledge even when the dimension of the
input space is high. Moreover, it is perhaps the model
that allows the more accurate formal accessment of the
generalization performance. This ts well within the
framework of SNet-SOM. Belowwe attempt a rigorous
assessment of the generalization performance of the
supervised expert. Also, we provide a methodology for
the selection of an SVMmodel well tted to the number
of transferred patters (i.e. model order selection). But,
some notation and key concepts should be dened rst.
The problem of classication is to estimate a func-
tion f : R
N
{1] using input-output training data
(x
1
, y
1
), . . . , (x
l
, y
l
) R
N
{1] such that f will cor-
rectly classify unseen examples (x, y), i.e. f (x) = y.
These examples (x, y) are generated fromthe same un-
derlying probability distribution P(x, y) as the training
data. The quantity to be minimized in order to obtain
generalization performance is the risk (or prediction
risk)
R[ f ] =
_
1
2
[ f (x) y[dP(x, y) (5)
Since P(x, y) is unknown we can only minimize the
empirical risk
R
emp
[ f ] =
1
l
l
i =1
1
2
[ f (x
i
) y
i
[ (6)
The Support Vector Machine model of machine
learning establishes formal bounds on the generaliza-
tion error [28, 29]. These results allow to bound the
generalization error for a chosen signicance level.
They relate the number of examples, the training set
error and the complexity of the hypothesis space to the
generalization error [30].
VapnikChervonenkis (VC) theory [12] shows that
is imperative to restrict the class of functions from
which f is chosen to one that has a capacity suit-
able for the amount of the available training data. The
developed theory provides bounds on the test error.
The inductive principle of structural risk minimization
[29] introduces a systematic method to minimize these
bounds by considering both the empirical risk and the
capacity of the function class that is accounted by the
VapnikChervonenkis (VC) dimension h. For binary
classication, h is the maximal number of points that
can be separated by the functions that can be imple-
mented with the learning machine into two classes in
all possible 2
h
ways. An important VCbound is the one
outlined by the following theorem (Theorem 1), which
provides a bound on the rate of uniform convergence
of the training error to the classication error, for a set
of classication functions with VC dimension h [31].
Theorem 1. If h < l is the VC dimension of the class
of functions that the learning machine can implement,
then for all functions of that class, with probability of
at least 1 p, the bound
R[ f ] R
emp
[ f ]
_
h
l
,
log( p)
l
_
holds, where the condence term is dened as
_
h
l
,
log( p)
l
_
=
_
h
_
log
2l
h
1
_
log
_
p
4
_
l
This theorem states clearly the dependence of the gen-
eralization error R[ f ] on the VC dimension parameter
h. The Support Vector Machine is perhaps the only
model where this important parameter can be explic-
itly controlled. This is done by enforcing maximum
separation between the patterns of different classes
with the construction of the optimal linear separating
hyperplane, usually in a very high-dimensional feature
space where the input data are mapped by means of a
kernel function.
The key concepts of the Support Vector Machine ap-
proach to the implementation of the principle of struc-
tural risk minimization are briey presented. Details
can be found in [12]. This material serves also as the
theoretical basis for the comprehension of the results
of the SVM supervised expert implementation.
The Support Vector (SV) algorithm implements
Structural Risk Minimization based on a structure of
separating hyperplanes imposed on a dot product space
X. For a set of pattern vectors x
1
, . . . , x
l
X, these
hyperplanes can be written as {x X : w x b = 0],
where w is an adjustable weight vector and b is a bias.
In order to enforce the uniqueness of the hyperplane
we require
min
i =1,...,l
[w x
i
b[ = 1 (7)
i.e. the data point closest to the hyperplane has a dis-
tance of 1/|w|. This distance is referred to as the mar-
gin of separation. Hyperplanes constrained with (7)
are termed canonical hyperplanes.
The following important theorem provides a rigor-
ous way of controlling the SVM generalization per-
formance with the computation of the appropriate
weight vector w for the maximization of the margin
[12, 31].
Theorem 2. Let the l training set vectors x
1
, x
2
, . . . ,
x
l
X belong to a sphere of radius R, and center at
a, i.e. B
R
(a) ={x X : |x a| < R], a X. Also, let
f
w,b
=sgn(w x) b be canonical hyperplane deci-
sion functions, dened on these points. Then the set
{ f
w,b
: |w| A] has a VC-dimension h satisfying
h min(R
2
A
2
, n) 1 (8)
Theorem 2 states that control over the VC dimen-
sion (i.e. complexity) of the optimal hyperplane can
be exercised independently of the dimensionality n of
the input space, by properly choosing the margin of
separation =
1
|w|

1
A
. Applying the framework
of the structural risk minimization for linear machines,
a set of separating hyperplanes of varying VC dimen-
sion is constructed such that the decrease of the VC
dimension occurs at the expense of the smallest pos-
sible increase in training error. The SVM imposes a
structure on the set of separating hyperplanes by con-
straining the Euclidean norm of the weight vector w,
in order to minimize the VC dimension of the learning
machine, according to Theorem 2.
Suppose we are given a set of examples
(x
1
, y
1
), . . . , (x
l
y
l
), x
i
X, y
i
{1] and we assume
that the two classes of the classication problem are
linearly separable. In this case, we can nd an optimal
weight vector w
0
such that |w
0
|
2
is minimum(in order
to maximize the margin =
1
|w|
of Theorem 2) and
y
i
(w
0
x
i
b) 1.
The support vectors are those training examples that
satisfy the equality, i.e. y
i
(w
0
x
i
b) =1. The sup-
port vectors dene two hyperplanes. The one hyper-
plane goes through the support vectors of one class
and the other through the support vectors of the other
class. The distance between the two hyperplanes de-
nes the margin of separation, which is maximized
when the norm of the weight vector |w
0
| is mini-
mum. This minimization can proceed by maximizing
the following function with respect to the variables
i
(Lagrange multipliers) [12]:
W() =
l
i =1
i

1
2
l
i =1
l
j =1
i

j
(x
i
x
j
) y
i
y
j
(9)
subject to the constraint: 0
i
. If
i
>0 then x
i
cor-
responds to a support vector.
The classication of an unknown vector x is obtained
by computing
F(x) = sgn{w
0
x b], where w
0
=
l
i =1
i
y
i
x
i
,
where the sum accounts only N
s
l nonzero sup-
port vectors (i.e. training set vectors x
i
whose
i
are
nonzero). Clearly, after the training, the classication
can be accomplished efciently by taking the dot prod-
uct of the optimum weight vector w
0
with the input
vector x.
The case that the data is not linearly separable is
handled by introducing slack variables (
1
,
2
, . . . ,
l
)
with
i
0 [27] such that, y
i
(w x
i
b) 1
i
, i =
1, . . . , l. The introduction of the variables
i
, allows
misclassied points, which have their corresponding
i
> 1. Thus,

l
i =1
i
is an upper bound on the num-
ber of training errors. The corresponding generaliza-
tion of the concept of optimal separating hyperplane is
obtained by the solution of the following optimization
problem
minimize
1
2
w w C
l
i =1
i
(10)
subject to: y
i
(w x
i
b) 1
i
and
i
0, i = 1, . . . , l (11)
The control of the learning capacity is achieved by
the minimization of the rst term of (10) while the pur-
pose of the second term is to punish for misclassica-
tion errors. The parameter C is a kind of regularization
parameter, that controls the tradeoff between learning
capacity and training set errors. Clearly, a large C cor-
responds to assigning a higher penalty to errors.
Finally, the case of nonlinear Support Vector Ma-
chines should be considered. The input data in this
case are mapped into a high dimensional feature space
through some nonlinear mapping chosen a priori
[12]. The optimal separating hyperplane is then con-
structed in this space. The corresponding optimization
problem is obtained from (9) by substituting x by its
mapping z = (x) in the feature space:
maximize W() =
l
i =1
i

1
2
l
i =1
l
j =1
i

j
((x
i
) (x
j
)) y
i
y
j
(12)
Also, the constraint 0
i
, becomes 0
i
C
(assuming the nonseparable case). When it is possi-
ble to derive a proper kernel functional K such that
K(x
i
, x
j
) = (x
i
) (x
j
), the mapping is not ex-
plicitly used. Conversely, given a symmetric positive
kernel K(x, y), Mercers theorem [12] states that there
exists a mapping such that K(x, y) =(x) (y).
By designing a kernel K that satises Mercers condi-
tion, the training algorithm is reformulated to the fol-
lowing optimization:
W() =
l
i =1
i

1
2
l
i =1
l
j =1
i

j
K(x
i
x
j
) y
i
y
j
(13)
with the constraint 0
i
C, and the decision func-
tion becomes
F(x) = sgn
_
l
i =1
i
y
i
K(x
i
, x) b
_
With different expressions for inner products
K(x
i
, x) we can construct different learning machines
with arbitrary types of decision surfaces (nonlinear in
input space). The best known kernel types are the poly-
nomial and the radial basis. Polynomial kernels specify
polynomials of any xed order d for the inner product
in the corresponding feature space, i.e.
K(x
i
, x) = ((x
i
x) 1)
d
(14)
Radial Basis Function (RBF) kernels construct de-
cision functions of the form:
F(x) = sgn
_
l
i =1
i
y
i
exp
_
[x x
i
[
2
2
_
b
_
,
with kernel of the type:
K(x
i
, x) = exp
_
[x x
i
[
2
2
_
.
In the RBF case, the SVM training algorithm deter-
mines both the centers (support vectors) x
i
, the cor-
responding weights
i
and the threshold b. Another
kernel of the RBF type is the Laplacian RBF that uses
K(x
i
, x) = exp
_
i
[x x
i
[
2
_
.
The SVM is perhaps the only model which permits
disciplined model order selection for the optimization
of the generalization performance. This makes it highly
suited for the implementation of the supervising expert
at the context of sNet-SOM.
In order to formulate a means for obtaining the best
possible generalization by controlling characteristics
of the learning machine we utilize the ideas of [32]
at the context of determining the kernel degree, which
yields the best generalization from the training data
transferred to the supervising expert. We work with
polynomial type of kernel, i.e. a kernel of type (14).
Following the results of [12, pp. 428430, 32], the
VC-dimension can be estimated as
h c
1
h
est
= R
2
|w|
2
with c
1
< 1 independent of the kernel.
Thus, inorder tocompute h
est
, we needtocompute R,
the radius of the smallest sphere enclosing the training
data in feature space. This task is formulated with the
following quadratic programming problem:
minimize R
2
subject to |z
i
z
R
2
| (15)
where z
i
= (x
i
) and z
is the center of the sphere

enclosing the images of all the training data in feature
space, that is to be determined. This optimization prob-
lem can be solved as in [32] and yields the estimation
of the optimal support vector machine model for the
supervising expert.
Table 1. Generalization Performance of SVM classiers with polynomial kernels of
various degrees. We can observe that the estimated VC dimension of the polynomial
SVM model h
est
, is much smaller than the dimensionality of the feature space. The
testing set performance attains its maximum when the VC dimension of the learning
machine (estimated with h
est
) is at its minimum.
Chosen degree Dimensionality Estimated VC dimension
d of polynomial of the feature of the polynomial Testing set
classier space SVM model, h
est
performance (%)
2 630 17 72.4
3 7770 12 81.3
4 73,815 27 65.9
5 575,757 38 64.7
6 3,838,380 65 62.4
7 22,481,940 161 58.6
For example for N =2045 patterns transferred to
a supervised expert at the ischemia detection applica-
tion described at the applications section, some results
concerning the polynomial SVM model selection are
presented at the Table 1. We observe that the model
with the smallest estimated VC dimension h
est
obtains
the best generalization performance.
The training of Support Vector Machines involves
the solution of a quadratic programming optimization
problemwhich generally has worst-case computational
complexity of order N
3
s
, where N
s
is the number of sup-
port vectors [33]. This complexity arises from the need
to perform an inversion of an N
s
by N
s
Hessian ma-
trix and limits the applicability of SVMs to small scale
problems. Currently some decomposition algorithms
have been proposed that extent the support vector
optimization to large data sets by exploiting decom-
position schemes [13, 19]. However, although these
approaches alleviate the problem they still require sig-
nicant computational resources and demand special
properties from the training set in order to obtain com-
putational efciency (i.e. the approach of [19] relies
on many of the support vectors having corresponding
Lagrange multipliers a
i
at the upper bound, a
i
= C).
Therefore, it is far better to limit the size of the sup-
port vector optimization problem, with a device like the
proposed SNet-SOM, without sacricing the accuracy
of the nal solution.
3. Applications
This section presents the results of the application
of the SNet-SOMat three distinct types of applications.
The rst data set is a synthetic one and it illustrates the
potential of the SNet-SOM to uncover hidden depen-
dencies in large data sets. The second experiment deals
with the distinction of chaos from noise. The chaotic
samples have been obtained from the simulation of the
Lorenz chaotic attractor. Finally, the third one involves
the classication of real data. It concerns the detection
of ischemic episodes from the information available at
the European ST-T Database. In all the cases the SNet-
SOM has improved signicantly the SOM results and
has presented a scalable architecture to the problem
size.
3.1. Synthetic Data
The performance of the suggested neural models can be
evaluated more precisely with synthetic data at which
all the parameters (e.g. sample size, noise levels, and
degree of input /output dependence) and the form of
dependencies can be controlled. In contrast, for real
data it is very difcult to obtain objective criteria for
evaluating the effectiveness of the constructed models.
Consider for simplicity that we observe the outcome
of only one variable Y (e.g. heart disease). The ob-
served data consist of a number of N attributes
(e.g. age, weight, working habits, athletic activity),
A
i
, i =1, . . . , N that relate (possibly) to the outcome.
These attributes are described with Random Variables
(R.V.) A
i
, i =1, . . . , N. It is assumed that the depen-
dence of the outcome on the attribute variables can
be expressed by a set of (unknown) smooth functions.
This smoothness requirement, more formally implies
that the dependence between the expected values of the
outcome E(Y) and the expected values of the attributes
E(A
i
) can be approximated with continuous functions.
Therefore, the explored model looks like:
Y = f(A
1
, A
2
, . . . , A
n
) (16)
where the function f can be decomposed to N contin-
uous usually nonlinear functions, i.e.
Y = f(A
1
, A
2
, . . . , A
n
) =
i
f
i
(A
i
) (17)
At the simulations, most of the functions f
i
are not
implemented explicitly with a mathematical formula,
but rather their input /output relationship is tabulated
at the form of an input /output mapping. This choice
is selected to mimic the difculties of modeling real-
world problems with closed form functions.
The training set is derived by inducing noise both
at the observation variables A
i
, i = 1, . . . , N, and at
the observed outcome variable Y. This comes in ac-
cordance with the fact that usually in practice at data
mining problems both the observed variables and the
outcomes are subject to inaccuracies. Therefore, the
construction of the training set takes the following
steps:
1. Generation (randomly) of some proper values
V
i
, i = 1, . . . , N, for the attributes (i.e. for the input
variables) A
i
, i = 1, . . . , N.
2. Induction of observation noise to these attributes
values. Denote by V
/
i
, i =1, . . . , N, the correspond-
ing values that model disturbed observations of the
attributes.
3. Computation of the values of the outcome vari-
ables. The case of one outcome variable Y is con-
sidered for simplicity (the consideration of more
outcome variables can be performed similarly). The
value O of the outcome variable Y is computed
as O =f(V
1
, V
2
, . . . , V
n
) =
i
f
i
(V
i
), i.e. it is ex-
pressed as a function of the correct values of the
attributes and not the observed ones. The observed
disturbed attribute values V
/
i
, have not been used,
since the observation inaccuracy does not affect
the operation (and therefore the outcomes) of the
system.
4. Induce observation noise to the outcome variables.
For the case of the outcome variable O, denote by
O
/
the measured inaccurate outcome. Therefore,
one training set sample consists of the values V
/
1
,
V
/
2
, . . . , V
/
N
, O
/
.
The clear sample V
1
, V
2
, . . . , V
N
, O, serves to es-
timate the noise level of the training set. That is, for
every pattern V
1
, V
2
, . . . , V
N
, the closest noisy pattern
V
/
1
, V
/
2
, . . . , V
/
N
, at the training set is detected and their
class labels are compared. Therefore, an estimate of the
training set misclassication ratio due to observation
noise is obtained. It is expected from the regularization
networks not only to uncover the hidden dependencies
and thereby to be able to predict the outcome variable
fromthe input observations, but also to reduce the mis-
classication ratio on new testing patterns (testing set
misclassication ratio). This comes as a consequence
of the capability to regularize the input /output mapping
by capturing the smooth dependence of the outcomes
on the attributes.
Thereafter, random testing patterns are generated,
their classications are computed (according to the
hidden dependencies) and a similar smallest distance
over the training set classication procedure is fol-
lowed. The resulting misclassication ratio is referred
to as the raw training set misclassication ratio. This
parameter serves as a reference for the estimation of
the generalization performance that the networks ob-
tain from their robust regularization framework. The
regularization networks are able under the proper con-
ditions not only to uncover the laws that govern the
dynamics of the physical process but also to improve
substantiallythe rawtrainingset misclassicationratio.
This fact will become evident by the results of the sim-
ulation experiments of Section 5. The most prominent
factors for their successful application is the sufcient
size of the training set and the sufciency of the set of
approximation functions realizable by the network to
match the complexity of the environment from which
the training data are obtained. The learning capacity
of the machine can be incorporated within the mathe-
matical framework of the VapnikChervonenkis (VC)
dimension [7, 12].
Table 2 presents some illustrative results from the
simulation experiments. The rst column is the num-
ber of patterns in the (disturbed) training set. The sec-
ond column quanties the level of noise induced at
the training set using the methodology described in
Section 2 in order to perturb both the attribute values
and the related outcomes. The next four columns list the
misclassication ratio for the corresponding neural net-
works that are evaluated. Specically, the third column
displays the classication performance of the plain un-
supervised SOMmodel and the fourth incorporates the
LVQ supervised ne-tuning. It becomes evident that
Table 2. Some indicative results from the evaluation of SOM, SOM/LVQ, RBF and SNet-SOM designs for the
problem of extracting information from a noisy training set. For the RBF case, the number in brackets denote the
number of training patterns and the number of RBF centers respectively.
Raw training set
misclassication ratio
(disturbance of the training set)
Testing set misclassication ratio
Training SOM SOM with RBF ( = 0.095, SNet-SOM
set size (%) LVQ (%) = 0.1) (%) (4 4)
1000 10.5 9.1 8.7 7.2 [1000 500] 7.02
1000 15.4 14.3 12.9 10.1 [1000 500] 9.35
1000 18.5 16.9 15.5 13.2 [1000 500] 13.1
1000 22.9 19.4 18.7 16.1 [1000 500] 16.1
1000 29.7 24.7 23.1 21.2 [1000 500] 20.2
2000 12.3 10.1 8.4 7.1 [2000 400] 7.2
2000 18.2 12.3 9.9 9.8 [2000 50] 9.3
2000 26.7 15.7 13.2 12.1 [2000 200] 11.3
2000 34.6 17.5 14.2 12.9 [2000 200] 9.8
5000 14.3 11.5 9.3 9.4 [5000 100] 8.7
5000 17.8 12.1 9.6 10.18[5000 100] 10.5
5000 28.9 14.9 13.2 14.2 [5000 100] 13.1
5000 38.66 16.5 14 14.8 [5000 100] 13.2
15,000 12.4 11.2 11.1 Numerical solution
of the RBF net not
achievable for this
size of the training set
10.7
15,000 18.9 13.3 12.9 11.1
15,000 27.9 14.8 14.3 13.3
15,000 40.2 17.8 17.2 15.9
30,000 11.5 10.1 9.8 Numerical solution
of the RBF net not
achievable for this
size of the training set
9.0
30,000 19.8 13.9 13.2 12.2
30,000 29.8 15.2 14.9 13.7
30,000 35.5 16.9 16.7 15.1
LVQ succeeds to offer a reduction at the testing set
misclassication ratio. Further improvement (at least
for relatively small training sets) is obtained with the
RBF networks. The results are those contained at the
fth column of Table 4. At the brackets the dimension-
ality of the RBF network in terms of the number of
training set patterns and the number of RBF centers is
shown.
The RBFnetworks incorporate regularizationat their
design. Therefore, we expect to achieve superior per-
formance over designs that do not incorporate ex-
plicitly regularization techniques. Indeed, the simula-
tion experiments had yielded superior performances
for relatively small training sets (i.e. of about 1000
2000 training patterns). However, the RBF designs do
not scale well and therefore for large training sets,
SOM based schemes have demonstrated a better per-
formance. Therefore, for the larger training sets (i.e.
15,000 and 30,000 patterns) it was impossible to train
a RBF network.
The SNet-SOM consisted of a 4 4 CP-SOM train-
ed with the Manhattan type of distance. As a supervised
expert an RBF network has been selected. It can be
observed that the SNet-SOM results outperform those
obtained with the single net types in terms of the re-
duction at the misclassication performance.
3.2. Distinction of Chaos from Noise
The second application of SNet-SOM aims at the
problem of separating noise from chaos. Specically,
the Lorenz chaotic system has been used to gener-
ate a chaotic trajectory. This trajectory evolves over a
chaotic attractor lying at the three-dimensional space.
The Lorenz chaotic system is described according
to [34]:
dx(t )
dt
= x(t ) y(t )
dy(t )
dt
= x(t ) z(t ) r x(t ) y(t )
dz(t )
dt
= x(t ) y(t ) b z(t )
where = 10, b = 8/3, r = 28 is a typical congu-
ration of the parameters.
The dynamics of the Lorenz chaotic system evolve
to the well-known Lorenz chaotic attractor [34]. This
attractor is characterized by complex nonlinear dy-
namics. The objective of the experiment is to design
a classication system that is able of distinguishing
between a three-dimensional vector from the evolution
of the Lorenz chaotic system and random Gaussian
noise. Since the Lorenz attractor evolves to a signi-
cant portion of the three-dimensional state space, the
theoretical performance of any classier will be lim-
ited in proportion with the correlation dimension of the
Lorenz attractor [35]. However, since noise covers the
3-D space uniformly there exist regions that separate
well from the attractor and others that are placed near
the attractor boundaries. Moreover, since the attractor
is fractal, there exist regions within it that are not vis-
ited by the trajectories of the Lorenz system dynamics.
Therefore, the difculty of distinguishing noise from
the Lorenz state vectors is dependent on the state space
region. The peculiarities of this problem are well tted
to the philosophy of the SNet-SOM. The regions of the
state space far from the attractor can be handled effec-
tively with the CP-SOMclassication. The rest regions
(near the attractor or within its fractal structure) are
difcult and require the construction of complex deci-
sion boundaries. This is accomplished with the incor-
poration of effective supervised models. Specically,
we have experimented with the Radial Basis Func-
tions and the Support Vector Machines as supervised
experts. Finally, the regions of the three-dimensional
space that are covered by the attractor cannot be distin-
guished since these are the regions where the classes
overlap.
The results of the simulation experiments have been
obtained by generating 20,000 training patterns, half of
which are computed from the numerical integration
of the Lorenz system and the other half are samples
obtained from a Gaussian noise generator normalized
within the domain of evolution of the Lorenz attrac-
tor. The testing set consisted of another set of 20,000
values is constructed similarly. The size of the am-
biguous pattern set was near 2000 with the entropy
threshold criterion set to 0.2. We have tested at the per-
formance to this classication problem the plain SOM,
the SNet-SOM with RBF (SNet-SOM/RBF) and with
SVM (SNet-SOM/SVM) supervised experts. The av-
erage performances obtained were 79% for the SOM,
81% for the SNet-SOM/RBF and 82% for the SNet-
SOM/SVM.
3.3. Ischemia Detection
The third application of the SNet-SOM concerns the
classication of real data obtained from the biomed-
ical domain. Specically, the problem of maximizing
the performance of the detection of ischemia episodes
is addressed. This is a difcult pattern classication
problem [5, 6].
Myocardial ischemia is caused by a lack of oxy-
gen and nutrients to the contractile cells. Frequently,
it may lead to myocardial infarction with its severe
consequence of heart failure and arrhythmia that may
even lead to patient death. The ST-T Complex of the
ECG represents the time period from the end of the
ventricular depolarization to the end of corresponding
repolarizationinthe electrical cardiac cycle. Changes at
the values of measured amplitudes, times, and duration
on the ST-T complex are used to detect and quantify
ischemia noninvasively from the standard ECG [5, 6,
36, 37]. The ECG is a widespread noninvasive exam-
ination that provides information about the electrical
activity of the heart tissue. Different degrees of the
severity of ischemic progression can be described in
terms of ECG features [38]. The rst stage of ischemia
is characterized by T-wave amplitude increase without
simultaneous ST segment change. As the ischemia ex-
tends transmurally through the myocardium, the intra-
cellular action potential shortens and the injured cells
become hyperpolarized. This hyperpolarization in turn
produces an injury current which is reected at the
ECG as a horizontal ST segment deviation [39]. At
the nal stage the ischemia is so extensive that the ter-
minal portion of the active depolarization waveform,
represented at the ECG by the QRS complex, is altered
[38]. This stage is usually associated with myocardial
necrosis.
We used for our study the ECG signals of the
European ST-T Database, which are a set of long-
term Holter recordings provided by eight countries.
From the samples composing each beat, a window of
400 milliseconds is selected (100 samples at the 250 Hz
sampling frequency). This signal component forms the
input to the Principal Component Analysis [7, 37] in
order to describe most of its content within a few
(i.e. ve) coefcients. The original data space under-
goes a dimensionality reduction as the feature space is
constructed. The term dimensionality reduction refers
to the fact that each 100-dimensional data vector x of
the original data space is represented with a vector of a
much smaller dimensionality (i.e. with a 5-dimensional
vector), yet most of the intrinsic information content of
the data is retained.
The PCA transformation describes the original vec-
tors (ST-TSegments) accordingtothe directionof max-
imum variance reduction in the training set. The latter
information is obtained by analyzing the data covari-
ance matrix. PCAselects as basis functions the orthog-
onal eigenvectors q
k
of the covariance matrix for the
signal projection operation. The corresponding eigen-
values
k
represent the average dispersionof the projec-
tion of the input vectors onto the corresponding eigen-
vectors (basis functions) q
k
. The numerical value of
each eigenvalue
k
quanties the amount of variance
that is accounted for by projecting the signal onto the
corresponding eigenvector q
k
. Therefore it represents
the contribution of the eigenvectors analysis direction
Table 3. The results obtained from the CP-SOM of the SNet-SOM as the main classication tool (average classication
performance: 73.31%).
Beat classication Beat classication Beat classication
Record number performance (%) Record number performance (%) Record number performance (%)
E0103 66.5 E0122 75.1 E0151 85.5
E0106 65 E0127 81.9 E0154 82.7
E0108 64 E0129 66 E0159 75
E0111 81.8 E0139 78 E0166 79
E0118 63.2 E0147 71 E0202 65
Table 4. The results obtained from the SNet-SOM with a Radial Basis Function network as supervised expert (average
performance: 76.6%).
E0103 81 E0122 78.4 E0151 92.7
E0106 73.4 E0127 74.2 E0154 74.6
E0108 65.5 E0129 69.5 E0159 69.3
E0111 88.5 E0139 72.1 E0166 91.2
E0118 69 E0147 78.2 E0202 71
to the signal reconstruction in the mean squared error
sense.
After obtaining these PCs, a wavelet based denoising
technique [40] based on Lipschitz regularization theory
is applied, in order to improve the signal-to-noise ratio
of the ve coefcients. These denoised Principal Com-
ponents constitute the input space for the SNet-SOM.
The utilization of the Wavelet Denoising at the domain
of Principal Components has improved the classica-
tion performance.
The training set consists of 15,000 ST-T segment
data extracted fromabout 44,000 beats. This set is con-
structed by using samples taken from 6 records (differ-
ent from those used at the testing sets). The training set
patterns are selectedfromthe relativelyat regions of
the PC time series representations. Also, the two main
classes (i.e. normal and ischemic) are represented by
an approximately equal number of samples.
The evaluation of the SOMand the SNet-SOMmod-
els has been performed on another 15 records out of the
90 records of the European ST-T database. From these
records testing sets have been constructed. The whole
test set contains principal component projection coef-
cients from approximately 120,000 ECG beats.
Tables 35 present the classication performances
for ischemic beat classication. The classication
Table 5. The results obtained from the SNet-SOM with a Support Vector Machine as supervised expert (average
performance: 78.8 %).
E0103 82.5 E0122 80.1 E0151 92.6
E0106 75.6 E0127 78.2 E0154 75.1
E0108 65.9 E0129 72.5 E0159 76.9
E0111 87.5 E0139 74.3 E0166 92.8
E0118 71 E0147 81.7 E0202 75.3
performance ratio is a global one: it expresses the ratio
of correct classications to the total ones. The CP-SOM
already performs well given that it has an increased
size in order to perform the classication directly. The
related performances are described in Table 3. These
results have been obtained by using a CP-SOM orga-
nized as a 10 10 lattice of neurons. The performed
experiments have illustrated that this size yields the
best results for direct classication. Also, with the uti-
lization of the Manhattan distance measure [17] better
results are obtained in comparison to the alternative
Euclidean measure. Although the CP-SOM is trained
with the usual SOM unsupervised training algorithm
[17], it has the potential to obtain beat classication
accuracy close to those reported in [5, 6, 36] with su-
pervised neural models. Table 4 presents the results for
ischemia beat classication obtained from the SNet-
SOM with a Radial Basis Function network as a su-
pervised expert. The training set size corresponding to
the number of training set patterns mapped to ambigu-
ous neurons is about 2000 and the number of centers
is 500. Also, the regularization parameter is 0.1. The
CP-SOM is consisted of a two-dimensional lattice of
neurons of size 4 4. The average beat classication
accuracy of the RBF network as a supervised expert is
76.6%. Table 5 displays the corresponding results with
a Support Vector Machine (SVM) as a supervised ex-
pert. The training set for the SVM case is the same as
for the RBF. The inner-product kernel of the SVM is
based on Radial Basis Functions with spread
2
=8
and a regularization parameter of a =0.1. The CP-
SOM is of the same size (i.e. 4 4 lattice). The aver-
age beat classication performance has been improved
to 78.8 %.
The detection of ischemic episodes from the de-
tected ischemic beats follows the approach of [5,
36] for the detection of ischemic episodes, the ECG
recordings are broken into ten beat groups. For each
such group the number of normal and abnormal beats
is counted. Groups with a number of normal beats
larger than the number of the abnormal (i.e. more
than ve) are assigned to the normal state (i.e. all
the beats for that group are considered as abnormal).
In the opposite case the group is considered abnor-
mal. Ischemic episodes are considered as those for
which consecutive beat groups are ischemic for at
least 30 sec (i.e. three or more consecutive groups are
ischemic).
Correctly detected episodes are termed True Posi-
tive (TP) episodes. Missed episodes are termed False
Negatives (FN). Also when a nonischemic episode is
detected as ischemic, a False Positive (FP) situation
has occurred. The ST Episode Sensitivity is dened as
the ratio of the number of detected episodes matching
the database annotations to the number of annotated
episodes. In terms of the above denitions:
Ischemia Episode Sensitivity =(TP)/(TPFN)
Another important indexis the STEpisode Predictiv-
ity, which is dened as the number of correctly detected
episodes to the total number of episodes detected,
i.e.
Ischemia Episode Predictivity =(TP)/(TPFP)
Table 6 displays the results of the average ischemia
episode detection performance evaluated with the three
network types. The second column displays the sensi-
tivity while the third one the predictivity of episode
detection. As it is expected from the beat classi-
cation results, the SNet-SOM with SVM as super-
vised expert yields a better average episode detection
performance.
Table 6. The average ischemia episode detection perfor-
mance evaluatedwiththe correspondingnetworks (i.e. SOM,
SNet-SOM with RBF as supervised expert and SNet-SOM
with SVM as supervised expert).
Network Ischemia episode Ischemia episode
type sensitivity (%) predictivity (%)
SOM 74.9 73.7
SNet-SOM/RBF 79.5 77.6
SNet-SOM/SVM 82.8 81.3
4. Conclusions
This work has proposed a new supervised extension
to the Self-Organizing Map (SOM) model [1517]
that is called the Network Self-Organizing Map (SNet-
SOM). This model exploits the ordering potential of the
SOM in order to split the global state space into two
subspaces. The rst subspace corresponds to regions
over which the classication task can be performed di-
rectly with the unsupervised SOMalgorithm. However,
for the second subspace complex decision boundaries
should be enforced and the generalization performance
should be explicitly designed. The SOM algorithm is
not appropriate for this task and therefore supervised
training networks capable of achieving good general-
ization performance (i.e. Radial Basis Functions and
the Support Vector Machines) are used. We have de-
veloped the SNet-SOM with Radial Basis Function
networks [7, 9] and the Support Vector Machines as
supervised experts [7, 12]. All these designs construct
approximations that involve local tting to the dynam-
ics of the target function. The locality of these networks
ts well with the locality of the subspaces that consti-
tute the ambiguous region. The RBF networks address
the issue of regularization in a disciplining mathemat-
ical way through the Tikhonov regularization theory
[79]. The Support Vector Machines have obtained the
best discrimination capability for the ambiguous re-
gions (Table 5).
The main objective of using SNet-SOM for dif-
cult pattern classication tasks is to obtain signicant
computational benets in large scale problems. The
SNet-SOMutilizes the computationally effective SOM
algorithm for resolving most of the regions of the state
space while it uses advanced supervised learning al-
gorithm to confront with the difculties of enforcing
complex decision boundaries over regions character-
izedbyclass ambiguity(quantiedwiththe entropycri-
terion). Moreover, without a kind of divide and conquer
approach (as the one of SNet-SOM) it is difcult to ap-
proach directly some large problems with nearly op-
timal models, as the Support Vector Machines, due
to the computational complexity of their numerical
solution.
The SNet-SOM is a modular architecture that can
be improved along many directions. The utilization of
different frameworks for self-organization as the Adap-
tive Subspace Self-Organizing Map (ASOM) [17] and
information theoretic frameworks for self-organization
[7, 23] can improve the phase of the state space par-
titioning. Also, we currently formulate the SOM al-
gorithm within the framework of the Modied Value
Distance Metric (MVDM) [41, 42]. This will per-
mit to use its partitioning potential for coping with
complex data mining applications, including symbolic
features (e.g. protein secondary structure prediction
[4]). All these research efforts on the SNet-SOM are
with the general philosophy that the best network ar-
chitecture depends on the structure of the problem
that is confronted. Therefore, for complex problems
with irregular state spaces a device capable of inte-
grating effectively multiple architectures as the pre-
sented SNet-SOM can perform better than individual
architectures.
Acknowledgment
The authors wish to thank the Greek Scholarship Foun-
dation for the nancial support of this research with
the scholarships for the PhD student L. Vladutu and
postdoc. researcher S. Papadimitriou.
References
1. C.S. Herrmann, Symbolical reasoning about numerical data:
A hybrid approach, Applied Intelligence, vol. 7, pp. 339354,
1997.
2. H. Liu and R. Setiono, Incremental feature selection, Applied
Intelligence, vol. 9, pp. 217230, 1998.
3. G. Piatetsky-Shapiro, R. Brachman, T. Khabaza, W. Kloesgen,
and E. Simoudis, An overview of issues in developing in-
dustrial data mining and knowledge discovery application, in
Proceedings, Second International Conference on Knowledge
Discovery and Data Mining, AAAI Press: Menlo Park, CA,
1996.
4. B. Rost and S. ODonohue, Sisyphus and protein structure pre-
diction, BioInformatics, vol. 13, pp. 345356, 1997.
5. N. Maglaveras, T. Stamkopoulos, C. Pappas, and M.G. Strintzis,
An adaptive backpropagation neural network for real-time
ischemia episodes detection: Development and performance
analysis using the european ST-T database, IEEE Transactions
on Biomedical Engineering, vol. 45, no. 7, 1998.
6. R. Silipo and C. Marchesi, Articial neural networks for auto-
matic ECG analysis, IEEE Transactions on Signal Processing,
vol. 46, no. 5, 1998.
7. S. Haykin, Neural Networks, 2nd edn., MacMillan College Pub-
lishing Company: London, 1999.
8. C.M. Bishop, Neural Networks for Pattern Recognition,
Clarendon Press: Oxford, 1996.
9. T. Poggio and F. Girosi, Regularization algorithms for learning
that are equivalent to multilayer perceptrons, Science, vol. 247,
pp. 978982, 1990.
10. T. Poggio and F. Girosi, Networks for approximation and
learning, in Proceedings of the IEEE, vol. 78, pp. 14811497,
1990.
11. F. Girosi, An equivalence between sparse approximation and
support vector machines, Neural Computation, vol. 10, no. 6,
pp. 14551480, 1998.
12. V.N. Vapnik, Statistical Learning Theory, Wiley: New York,
1998.
13. T. Joachims, Making large-scale SVM learning practical, in
Advances in Kernel MethodsSupport Vector Learning, edited
by B. Scholkopf, C.J.C. Burges, and A.J. Smola, MIT Press:
Cambridge, USA, 1998.
14. B. Kosko, Fuzzy Enginnering, Prentice Hall: Upper Saddle
River, NJ, 1997.
15. T. Kohonen, The self-organizing map, in Proceedings of
the Institute of Electrical and Electronics Engineers, vol. 78,
pp. 14641480.
16. H. Ritter, T. Martinetz, and K. Schulten, Neural Computa-
tion and Self-Organizing Maps, Addison-Wesley: Reading, MA,
1992.
17. T. Kohonen, Self-Organized Maps, Springer-Verlag: Berlin,
1997.
18. O.L. Mangasarian and D.R. Musicant, Successive overrelax-
ation for support vector machine expansions, IEEE Trans-
actions on Neural Networks, vol. 10, no. 5, pp. 10321037,
1999.
19. E. Osuna, R. Freund, and F. Girosi, An improved training al-
gorithm for support vector machines, Neural Networks for Sig-
nal Processing VII, Proceedings of the 1997 IEEE Workshop,
Amelia Island, FL, 1997, pp. 276-285.
20. D.P Bertsekas, Dynamic Programming and Optimal Control,
vol. I and II, Athenas Scientic: Belmont, MA, 1995.
21. D. Alahakoon, S.K. Halgamuge, and B. Srinivasan, Dynamic
self-organizing maps with controlled growth for knowledge
discovery, IEEE Transactions on Neural Networks, vol. 11,
no. 3, 2000.
22. R. Silipo, P. Laguna, C. Marchesi, and R.G. Mark, ST-T
segment change recognition using articial neural networks
and principal component analysis, Computers in Cardiology,
pp. 213216, 1995.
23. T.-W. Lee, Independent Component Analysis, Theory
and Applications, Kluwer Academic Publishers: Dordrecht,
1998.
24. S. Papadimitriou, A. Bezerianos, and A. Bountis, Radial basis
function networks as chaotic generators for secure communica-
tion systems, International Journal on Bifurcation and Chaos,
vol. 9, no. 1, pp. 221232, 1999.
25. A. Bezerianos, S. Papadimitriou, and D. Alexopoulos, Radial
basis function neural networks for the characterization of heart
rate variability dynamics, Articial Intelligence in Medicine,
vol. 15, pp. 215234, 1999.
26. Z. Uykan, C. Guzelis, M.E. Celebi, and H.N. Koivo, Analysis of
input-output clustering for determining centers of RBFN, IEEE
Transactions on Neural Networks, vol. 11, no. 4, pp. 851858,
2000.
27. C. Cortes and V. Vapnik, Support vector networks, Machine
Learning, vol. 20, pp. 273297, 1995.
28. P. Bartlett and J.S. Taylor, Generalization performance of
support vector machines and other pattern classiers, Advances
in Kernel Methods, Support Vector Learning, The MIT Press:
Cambridge, pp. 4355, 1999.
29. V.N. Vapnik, Three remarks on the support vector method
of function estimation, Advances in Kernel Methods, Sup-
port Vector Learning, The MIT Press: Cambridge, pp. 2541,
1999.
30. V. Cherkassky, X. Shao, F.M. Mulier, and V.N. Vapnik, Model
complexity control for regression using VC generalization
bounds, IEEE Trans. on Neural Networks, vol. 10, no. 5,
pp. 10751089, 1999.
31. V.N. Vapnik, An overview of statistical learning theory, IEEE
Trans. on Neural Networks, vol. 10, no. 5, pp. 988999,
1999.
32. B. Scholkopf, C. Burges, and V. Vapnik, Extracting support
data for a given task, in Proceedings, First International Con-
ference on Knowledge Discovery & Data Mining, AIII Press:
1995, pp. 252257.
33. J.J. More and G. Toraldo, On the solution of large quadratic
programming problems with bound constraints, SIAM J. Opti-
mization, vol. 1, no. 1, pp. 93113, 1991.
34. E. Ott, Chaos in Dynamical Systems, Cambridge University
Press: Cambridge, 1993.
35. A.A. Tsonis, Chaos: From Theory to Applications, Plenum
Press: New York, 1992.
36. T. Stamkopoulos, K. Diamantaras, N. Maglaveras, and M.
Strintzis, ECG analysis using nonlinear PCA neural networks
for ischemia detection, IEEE Transactions on Signal Process-
ing, vol. 46, no. 11, 1998.
37. J. Garcia, P. Lander, L. Sornmo, S. Olmos, G. Wagner, and P.
Laguna, Comparative study of local and Karhounen-Love-
based ST-T indexes in recordings from human subjects with
induced myocardial ischemia, Computers and Biomedical Re-
search, vol. 31, pp. 271297, 1998.
38. Y. Birnbaum, S. Sclarovsky, A. Blum, A. Mager, and U.
Cabby, Prognostic signicance of the initial electrocar-
diographic pattern in a rst acute anterior wall myocar-
dial infarction, Chest, vol. 103, no. 6, pp. 16811687,
1993.
39. R.L. Verrier and B.D. Nearing, T wave Alternans as a Harbin-
ger of Ischemia-Induced Sudden Cardiac Death, in D.
Zipes, J. Jaliffe, Cardiac ElectrophysiologyFrom Cell to
Bedside, 2nd edn., W.B. Saunders Company: Philadelphia,
1995.
40. S. Mallat and Wen Liang Hwang, Singularity detection and
processing with wavelets, IEEE Transactions on Information
Theory, vol. 38, no. 22, pp. 617643, 1992.
41. D.R. Wilson and T.R. Martinez, Improved heterogeneous dis-
tance functions, Journal of Articial Intelligence Research,
pp. 134, 1997.
42. S. Cost and S. Salzberg, Aweighted nearest neighbor algorithm
for learning with symbolic features, Machine Learning, vol. 10,
pp. 578.
Stergios Papadimitriou received the Dipl. Eng. degreee and the
Ph.D. degree from the Computer Engineering and Informatics
Department, University of Patras, Greece, in 1990 and 1996, respec-
tively. He worked for ve years as a Research Assistant at the Institute
of Computer Technology, Patras, Greece. His main research inter-
ests are neuro-fuzzy computing, chaotic dynamics, chaotic encryp-
tion, computational and articial intelligence, support vector learn-
ing and recurrent neural-network architectures. He is now a Senior
Researcher at the Biomedical Signal Processing Laboratory of the
Medical Physics Department and at the Articial Intelligence Lab-
oratory of the Computer Engineering Department of the University
of Patras.
Seferina Mavroudi received the Dipl. Eng. degree in electronical en-
gineering from the Department of Electronical and Computer Engi-
neering of the Aristotle University of Thessaloniki, Greece, in 1998.
She participated in an inter-departmental post-graduate program,
where she received the Masters degree in biomedical engineering
from the Medical School of the University of Patras, the Department
of Electrical and Computer Engineering and the Department of
Mechanical Engineering of the National Technical University of
Athens, Greece in 2000. She is now working as a Ph.D. Fellow
at the Biomedical Signal Processing Laboratory of the Medical
Physics Department of the University of Patras. Her main research
interests include neural networks, neuro-fuzzy architectures, arti-
cial intelligence and nonlinear dynamics, mainly for biomedical
applications.
Liviu Vladutu (S01) received the B.S. degree (Hons.) in au-
tomation and computers science in 1987 from Craiova University,
Romania, and the M.S. degree in biomedical engineering from the
University of Patras, Greece, in 1999. He is currently working as
a Ph.D. Fellow at the Department of Medical Physics, University
of Patras, Greece. His current work involves application of compu-
tational intelligence methods for biomedical signal processing. His
other interests include statistical learning theory and multiresolution
analysis.
Prof. G. Pavlides has been working in several universities in Greece
and abroad for about 20 years. He has also been working as a
senior consultant in the private industry and nancial institutions.
He is the director of the Information Systems and Articial Intel-
ligence Laboratory of the University of Patras. His areas of exper-
tise and interests include Data Warehousing, Encryption Algorithms,
Security,Workow systems, OLTP, OLCP, OLAP. He is a mem-
ber of international professional organisations, such as IEEE and
ACM. He has been the leader in several projects, some of which
are funded by the European Union, as well as large-scale software
engineering projects in Information Systems. They include prototyp-
ing environments, document ling and retrieval systems, hierarchical
data replication techniques etc.
Anastasios Bezerianos (M97) was born in Patras, Greece, in 1953.
He received B.Sc. in physics from Patras University in 1976, the
M.Sc. degree in telecommunications and electronics from Athens
University, and the Ph.D. degree in Medical Physics from Patras
University. He is currently Associate Professor in Medical School
of Patras University. His main research interests are concentrated
in biomedical signal processing and medical image processing, as
follows: 1) data acquisition and on line processing using digital signal
processors, 2) nonlinear time series analysis of elec-trocardiogram, 3)
wavelet analysis of high resolution ECG, 4) modeling of heart muscle
and heart rate variability, and 5) wavelet analysis of medical images.

The Supervised Network Self-Organizing Map For Classification of Large Data Sets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Supervised Network Self-Organizing Map For Classification of Large Data Sets

Uploaded by

Copyright:

Available Formats

Applied Intelligence 16, 185203, 2002

c _2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

2 500 = 0.707. The only parameters that

is the center of the sphere

You might also like