Professional Documents
Culture Documents
k=1
p
k
log p
k
(2)
where N
c
denotes the number of classes and p
k
=
V
k
/V
total
is the ratio of votes V
k
for class k to the total
number of votes V
total
to neuron m. Clearly, the en-
tropy is zero for unambiguous neurons and increases
as the uncertainty about the class label of the neu-
ron increases. The upper bound of H(m) is log N
c
,
and corresponds to the situation where all classes are
equiprobable (i.e. the voting mechanism does not fa-
vor a particular class). Therefore, with these voting
schemes the regions of the SOM that are placed at
ambiguous regions of the state space can be readily
identied. For these regions the supervised expert is
designed and optimized for obtaining adequate gener-
alization performance.
The computation time for each iteration of the adap-
tation phase of the SOM algorithm (which is the most
time consuming) scales with a factor that is almost lin-
ear to the size of the training set. As a general rule, the
number of iterations that are performed at the adapta-
tion phase of the CP-SOM is much smaller than at the
corresponding phase of the standard SOM algorithm.
Specically, we have determined experimentally that
a good initial value of the learning rate is n(k) = 0.2
and having it to decrease in 10 steps to 0.02 yields
good results (thus the factor for decreasing the learn-
ing rate is exp(log(0.1)/10) = 0.7943. The CP-SOM
only separates the state spaces without extensive ne-
tuning. However, the required number of neurons de-
pends mainly on the number of classes that are to be
separated (i.e. is independent of the training set size).
Since the size of CP-SOM is small, and the number
of training epochs (i.e. map expansions) is also very
small (3 to 5 map expansions are usually sufcient) the
unsupervised phase of the SNet-SOM algorithm scales
linearly with the size of the pattern set (although with
a large scaling factor).
190 Papadimitriou et al.
In contrast, supervised learning approaches, as i.e.
the Radial Basis Function networks have computa-
tional demands in terms of memory and processing
resources that scale with a factor that is about cubic to
the size of the training set. Practically, it was infeasi-
ble at our computing environment to train Radial Basis
Function networks with a size more than 2000 500
where 2000 is the training set size and 500 the number
of hidden units (i.e. RBF centers). Therefore, for the
applications considered with more than 9000 patterns
at the training set it is very difcult to train effectively
an RBF network to accomplish directly the classica-
tion task.
2.2. The Supervised Expert Network
The supervised expert network has the task of discrim-
inating over the state space regions where the class de-
cision boundaries are complex. The supervised expert
network should be of a local approximation type and it
should incorporate formalism in its design for obtain-
ing adequate generalization performance. Appropriate
neural network models that fulll these requirements
are the Radial Basis Function (RBF) and the Support
Vector Machines (SVM). The SOM prototype vectors
create piecewise linear class boundaries [17] that are
usually not effective for resolving the class ambiguity
over all the regions of the state space. Moreover, even
if the learning procedure enlarges the SOM adaptively
until each of its neurons represent unambiguously a
single class, this solution addresses only the minimiza-
tion of the training error and ignores the generalization
performance. On account of this, in the absence of a for-
mal setting for designing generalization, the decision
boundaries that the SOM constructs for the ambiguous
regions are not expected to cope well for discriminating
new patterns. To the contrary, a Support Vector Ma-
chine implementation of the supervised expert offers
the potential to construct near perfect decision bound-
aries. For example, the results discussed in [7] illustrate
close to optimal separation for a classication problem
involving overlapping Gaussian distributions. Below
we discuss the implementation of the supervised ex-
pert with a Radial Basis Function and with Support
Vector machine, emphasizing on the later choice that
provides better performance and disciplined design.
2.2.1. Radial Basis Function Supervising Expert.
The Radial Basis Function networks have been used
successfully in many applications [24, 25]. They
explore the Tikhonovs regularization theory for ob-
taining generalization performance by which obtain
a tradeoff between a term that measures the tness
of the solution to the training set and one that eval-
uates the smoothness of the solution. Denoting x
i
, d
i
,
F(x
i
) the input vectors, the desired responses and the
corresponding realizations of the network respectively
this tradeoff can be formulated with a cost function as
[7, 9]:
C(F) = C
s
(F) C
r
(F) (3)
where,
C
s
(F) =
1
2
l
i =1
[d
i
F(x
i
)]
2
(4)
C
r
(F) =
1
2
|DF|
2
The C
s
(F) is the standard error term that accounts
for the tting to the training set in the least squares
error sense, is a positive real number called the regu-
larization parameter and C
r
(F) is the regularized term
that favors the smoothness of the solution. The latter
term is the most important from the point of view of
generalization performance. The operator D is a sta-
bilizer because it stabilizes the solution by providing
smoothness. In turn, a smooth solution is signicantly
more robust to erroneous examples of the training set.
Nonetheless, the design of a proper generalization per-
formance for RBF networks still remains a difcult and
complex issue that involves heuristic criteria for the se-
lection of centers and of their parameters [26].
The SNet-SOM with an RBF network as supervised
expert has been congured to grow adaptively un-
til about 2000 patterns (i.e. SupervisedExpertMaxPat-
terns =2000) map onto the ambiguous neurons. This
size of the training set is appropriate for an RBF so-
lution by means of a numerically effective approach.
Specically, m =500 xed centers are selected at
random from the training patterns. Their spread
is common and is computed according to the em-
pirical formula =d
max
/
2m, where d
max
is the
maximum distance between the chosen centers, that
for the normalized 5-dimensional input vector is
=
5/
i =1
1
2
[ f (x
i
) y
i
[ (6)
The Support Vector Machine model of machine
learning establishes formal bounds on the generaliza-
tion error [28, 29]. These results allow to bound the
generalization error for a chosen signicance level.
They relate the number of examples, the training set
error and the complexity of the hypothesis space to the
generalization error [30].
VapnikChervonenkis (VC) theory [12] shows that
is imperative to restrict the class of functions from
which f is chosen to one that has a capacity suit-
able for the amount of the available training data. The
developed theory provides bounds on the test error.
The inductive principle of structural risk minimization
[29] introduces a systematic method to minimize these
bounds by considering both the empirical risk and the
capacity of the function class that is accounted by the
VapnikChervonenkis (VC) dimension h. For binary
classication, h is the maximal number of points that
can be separated by the functions that can be imple-
mented with the learning machine into two classes in
all possible 2
h
ways. An important VCbound is the one
outlined by the following theorem (Theorem 1), which
provides a bound on the rate of uniform convergence
of the training error to the classication error, for a set
of classication functions with VC dimension h [31].
Theorem 1. If h < l is the VC dimension of the class
of functions that the learning machine can implement,
then for all functions of that class, with probability of
at least 1 p, the bound
R[ f ] R
emp
[ f ]
_
h
l
,
log( p)
l
_
holds, where the condence term is dened as
_
h
l
,
log( p)
l
_
=
_
h
_
log
2l
h
1
_
log
_
p
4
_
l
This theorem states clearly the dependence of the gen-
eralization error R[ f ] on the VC dimension parameter
h. The Support Vector Machine is perhaps the only
model where this important parameter can be explic-
itly controlled. This is done by enforcing maximum
separation between the patterns of different classes
with the construction of the optimal linear separating
hyperplane, usually in a very high-dimensional feature
space where the input data are mapped by means of a
kernel function.
The key concepts of the Support Vector Machine ap-
proach to the implementation of the principle of struc-
tural risk minimization are briey presented. Details
can be found in [12]. This material serves also as the
192 Papadimitriou et al.
theoretical basis for the comprehension of the results
of the SVM supervised expert implementation.
The Support Vector (SV) algorithm implements
Structural Risk Minimization based on a structure of
separating hyperplanes imposed on a dot product space
X. For a set of pattern vectors x
1
, . . . , x
l
X, these
hyperplanes can be written as {x X : w x b = 0],
where w is an adjustable weight vector and b is a bias.
In order to enforce the uniqueness of the hyperplane
we require
min
i =1,...,l
[w x
i
b[ = 1 (7)
i.e. the data point closest to the hyperplane has a dis-
tance of 1/|w|. This distance is referred to as the mar-
gin of separation. Hyperplanes constrained with (7)
are termed canonical hyperplanes.
The following important theorem provides a rigor-
ous way of controlling the SVM generalization per-
formance with the computation of the appropriate
weight vector w for the maximization of the margin
[12, 31].
Theorem 2. Let the l training set vectors x
1
, x
2
, . . . ,
x
l
X belong to a sphere of radius R, and center at
a, i.e. B
R
(a) ={x X : |x a| < R], a X. Also, let
f
w,b
=sgn(w x) b be canonical hyperplane deci-
sion functions, dened on these points. Then the set
{ f
w,b
: |w| A] has a VC-dimension h satisfying
h min(R
2
A
2
, n) 1 (8)
Theorem 2 states that control over the VC dimen-
sion (i.e. complexity) of the optimal hyperplane can
be exercised independently of the dimensionality n of
the input space, by properly choosing the margin of
separation =
1
|w|
1
A
. Applying the framework
of the structural risk minimization for linear machines,
a set of separating hyperplanes of varying VC dimen-
sion is constructed such that the decrease of the VC
dimension occurs at the expense of the smallest pos-
sible increase in training error. The SVM imposes a
structure on the set of separating hyperplanes by con-
straining the Euclidean norm of the weight vector w,
in order to minimize the VC dimension of the learning
machine, according to Theorem 2.
Suppose we are given a set of examples
(x
1
, y
1
), . . . , (x
l
y
l
), x
i
X, y
i
{1] and we assume
that the two classes of the classication problem are
linearly separable. In this case, we can nd an optimal
weight vector w
0
such that |w
0
|
2
is minimum(in order
to maximize the margin =
1
|w|
of Theorem 2) and
y
i
(w
0
x
i
b) 1.
The support vectors are those training examples that
satisfy the equality, i.e. y
i
(w
0
x
i
b) =1. The sup-
port vectors dene two hyperplanes. The one hyper-
plane goes through the support vectors of one class
and the other through the support vectors of the other
class. The distance between the two hyperplanes de-
nes the margin of separation, which is maximized
when the norm of the weight vector |w
0
| is mini-
mum. This minimization can proceed by maximizing
the following function with respect to the variables
i
(Lagrange multipliers) [12]:
W() =
l
i =1
i
1
2
l
i =1
l
j =1
i
j
(x
i
x
j
) y
i
y
j
(9)
subject to the constraint: 0
i
. If
i
>0 then x
i
cor-
responds to a support vector.
The classication of an unknown vector x is obtained
by computing
F(x) = sgn{w
0
x b], where w
0
=
l
i =1
i
y
i
x
i
,
where the sum accounts only N
s
l nonzero sup-
port vectors (i.e. training set vectors x
i
whose
i
are
nonzero). Clearly, after the training, the classication
can be accomplished efciently by taking the dot prod-
uct of the optimum weight vector w
0
with the input
vector x.
The case that the data is not linearly separable is
handled by introducing slack variables (
1
,
2
, . . . ,
l
)
with
i
0 [27] such that, y
i
(w x
i
b) 1
i
, i =
1, . . . , l. The introduction of the variables
i
, allows
misclassied points, which have their corresponding
i
> 1. Thus,
l
i =1
i
is an upper bound on the num-
ber of training errors. The corresponding generaliza-
tion of the concept of optimal separating hyperplane is
obtained by the solution of the following optimization
problem
minimize
1
2
w w C
l
i =1
i
(10)
subject to: y
i
(w x
i
b) 1
i
and
i
0, i = 1, . . . , l (11)
Supervised Network Self-Organizing Map 193
The control of the learning capacity is achieved by
the minimization of the rst term of (10) while the pur-
pose of the second term is to punish for misclassica-
tion errors. The parameter C is a kind of regularization
parameter, that controls the tradeoff between learning
capacity and training set errors. Clearly, a large C cor-
responds to assigning a higher penalty to errors.
Finally, the case of nonlinear Support Vector Ma-
chines should be considered. The input data in this
case are mapped into a high dimensional feature space
through some nonlinear mapping chosen a priori
[12]. The optimal separating hyperplane is then con-
structed in this space. The corresponding optimization
problem is obtained from (9) by substituting x by its
mapping z = (x) in the feature space:
maximize W() =
l
i =1
i
1
2
l
i =1
l
j =1
i
j
((x
i
) (x
j
)) y
i
y
j
(12)
Also, the constraint 0
i
, becomes 0
i
C
(assuming the nonseparable case). When it is possi-
ble to derive a proper kernel functional K such that
K(x
i
, x
j
) = (x
i
) (x
j
), the mapping is not ex-
plicitly used. Conversely, given a symmetric positive
kernel K(x, y), Mercers theorem [12] states that there
exists a mapping such that K(x, y) =(x) (y).
By designing a kernel K that satises Mercers condi-
tion, the training algorithm is reformulated to the fol-
lowing optimization:
W() =
l
i =1
i
1
2
l
i =1
l
j =1
i
j
K(x
i
x
j
) y
i
y
j
(13)
with the constraint 0
i
C, and the decision func-
tion becomes
F(x) = sgn
_
l
i =1
i
y
i
K(x
i
, x) b
_
With different expressions for inner products
K(x
i
, x) we can construct different learning machines
with arbitrary types of decision surfaces (nonlinear in
input space). The best known kernel types are the poly-
nomial and the radial basis. Polynomial kernels specify
polynomials of any xed order d for the inner product
in the corresponding feature space, i.e.
K(x
i
, x) = ((x
i
x) 1)
d
(14)
Radial Basis Function (RBF) kernels construct de-
cision functions of the form:
F(x) = sgn
_
l
i =1
i
y
i
exp
_
[x x
i
[
2
2
_
b
_
,
with kernel of the type:
K(x
i
, x) = exp
_
[x x
i
[
2
2
_
.
In the RBF case, the SVM training algorithm deter-
mines both the centers (support vectors) x
i
, the cor-
responding weights
i
and the threshold b. Another
kernel of the RBF type is the Laplacian RBF that uses
K(x
i
, x) = exp
_
i
[x x
i
[
2
_
.
The SVM is perhaps the only model which permits
disciplined model order selection for the optimization
of the generalization performance. This makes it highly
suited for the implementation of the supervising expert
at the context of sNet-SOM.
In order to formulate a means for obtaining the best
possible generalization by controlling characteristics
of the learning machine we utilize the ideas of [32]
at the context of determining the kernel degree, which
yields the best generalization from the training data
transferred to the supervising expert. We work with
polynomial type of kernel, i.e. a kernel of type (14).
Following the results of [12, pp. 428430, 32], the
VC-dimension can be estimated as
h c
1
h
est
= R
2
|w|
2
with c
1
< 1 independent of the kernel.
Thus, inorder tocompute h
est
, we needtocompute R,
the radius of the smallest sphere enclosing the training
data in feature space. This task is formulated with the
following quadratic programming problem:
minimize R
2
subject to |z
i
z
R
2
| (15)
where z
i
= (x
i
) and z
i
f
i
(A
i
) (17)
At the simulations, most of the functions f
i
are not
implemented explicitly with a mathematical formula,
but rather their input /output relationship is tabulated
at the form of an input /output mapping. This choice
is selected to mimic the difculties of modeling real-
world problems with closed form functions.
The training set is derived by inducing noise both
at the observation variables A
i
, i = 1, . . . , N, and at
the observed outcome variable Y. This comes in ac-
cordance with the fact that usually in practice at data
mining problems both the observed variables and the
outcomes are subject to inaccuracies. Therefore, the
construction of the training set takes the following
steps:
1. Generation (randomly) of some proper values
V
i
, i = 1, . . . , N, for the attributes (i.e. for the input
variables) A
i
, i = 1, . . . , N.
2. Induction of observation noise to these attributes
values. Denote by V
/
i
, i =1, . . . , N, the correspond-
ing values that model disturbed observations of the
attributes.
3. Computation of the values of the outcome vari-
ables. The case of one outcome variable Y is con-
sidered for simplicity (the consideration of more
outcome variables can be performed similarly). The
value O of the outcome variable Y is computed
as O =f(V
1
, V
2
, . . . , V
n
) =
i
f
i
(V
i
), i.e. it is ex-
pressed as a function of the correct values of the
attributes and not the observed ones. The observed
disturbed attribute values V
/
i
, have not been used,
since the observation inaccuracy does not affect
the operation (and therefore the outcomes) of the
system.
4. Induce observation noise to the outcome variables.
For the case of the outcome variable O, denote by
O
/
the measured inaccurate outcome. Therefore,
one training set sample consists of the values V
/
1
,
V
/
2
, . . . , V
/
N
, O
/
.
The clear sample V
1
, V
2
, . . . , V
N
, O, serves to es-
timate the noise level of the training set. That is, for
every pattern V
1
, V
2
, . . . , V
N
, the closest noisy pattern
V
/
1
, V
/
2
, . . . , V
/
N
, at the training set is detected and their
class labels are compared. Therefore, an estimate of the
training set misclassication ratio due to observation
noise is obtained. It is expected from the regularization
networks not only to uncover the hidden dependencies
and thereby to be able to predict the outcome variable
fromthe input observations, but also to reduce the mis-
classication ratio on new testing patterns (testing set
misclassication ratio). This comes as a consequence
of the capability to regularize the input /output mapping
by capturing the smooth dependence of the outcomes
on the attributes.
Thereafter, random testing patterns are generated,
their classications are computed (according to the
hidden dependencies) and a similar smallest distance
over the training set classication procedure is fol-
lowed. The resulting misclassication ratio is referred
to as the raw training set misclassication ratio. This
parameter serves as a reference for the estimation of
the generalization performance that the networks ob-
tain from their robust regularization framework. The
regularization networks are able under the proper con-
ditions not only to uncover the laws that govern the
dynamics of the physical process but also to improve
substantiallythe rawtrainingset misclassicationratio.
This fact will become evident by the results of the sim-
ulation experiments of Section 5. The most prominent
factors for their successful application is the sufcient
size of the training set and the sufciency of the set of
approximation functions realizable by the network to
match the complexity of the environment from which
the training data are obtained. The learning capacity
of the machine can be incorporated within the mathe-
matical framework of the VapnikChervonenkis (VC)
dimension [7, 12].
Table 2 presents some illustrative results from the
simulation experiments. The rst column is the num-
ber of patterns in the (disturbed) training set. The sec-
ond column quanties the level of noise induced at
the training set using the methodology described in
Section 2 in order to perturb both the attribute values
and the related outcomes. The next four columns list the
misclassication ratio for the corresponding neural net-
works that are evaluated. Specically, the third column
displays the classication performance of the plain un-
supervised SOMmodel and the fourth incorporates the
LVQ supervised ne-tuning. It becomes evident that
196 Papadimitriou et al.
Table 2. Some indicative results from the evaluation of SOM, SOM/LVQ, RBF and SNet-SOM designs for the
problem of extracting information from a noisy training set. For the RBF case, the number in brackets denote the
number of training patterns and the number of RBF centers respectively.
Raw training set
misclassication ratio
(disturbance of the training set)
Testing set misclassication ratio
Training SOM SOM with RBF ( = 0.095, SNet-SOM
set size (%) LVQ (%) = 0.1) (%) (4 4)
1000 10.5 9.1 8.7 7.2 [1000 500] 7.02
1000 15.4 14.3 12.9 10.1 [1000 500] 9.35
1000 18.5 16.9 15.5 13.2 [1000 500] 13.1
1000 22.9 19.4 18.7 16.1 [1000 500] 16.1
1000 29.7 24.7 23.1 21.2 [1000 500] 20.2
2000 12.3 10.1 8.4 7.1 [2000 400] 7.2
2000 18.2 12.3 9.9 9.8 [2000 50] 9.3
2000 26.7 15.7 13.2 12.1 [2000 200] 11.3
2000 34.6 17.5 14.2 12.9 [2000 200] 9.8
5000 14.3 11.5 9.3 9.4 [5000 100] 8.7
5000 17.8 12.1 9.6 10.18[5000 100] 10.5
5000 28.9 14.9 13.2 14.2 [5000 100] 13.1
5000 38.66 16.5 14 14.8 [5000 100] 13.2
15,000 12.4 11.2 11.1 Numerical solution
of the RBF net not
achievable for this
size of the training set
10.7
15,000 18.9 13.3 12.9 11.1
15,000 27.9 14.8 14.3 13.3
15,000 40.2 17.8 17.2 15.9
30,000 11.5 10.1 9.8 Numerical solution
of the RBF net not
achievable for this
size of the training set
9.0
30,000 19.8 13.9 13.2 12.2
30,000 29.8 15.2 14.9 13.7
30,000 35.5 16.9 16.7 15.1
LVQ succeeds to offer a reduction at the testing set
misclassication ratio. Further improvement (at least
for relatively small training sets) is obtained with the
RBF networks. The results are those contained at the
fth column of Table 4. At the brackets the dimension-
ality of the RBF network in terms of the number of
training set patterns and the number of RBF centers is
shown.
The RBFnetworks incorporate regularizationat their
design. Therefore, we expect to achieve superior per-
formance over designs that do not incorporate ex-
plicitly regularization techniques. Indeed, the simula-
tion experiments had yielded superior performances
for relatively small training sets (i.e. of about 1000
2000 training patterns). However, the RBF designs do
not scale well and therefore for large training sets,
SOM based schemes have demonstrated a better per-
formance. Therefore, for the larger training sets (i.e.
15,000 and 30,000 patterns) it was impossible to train
a RBF network.
The SNet-SOM consisted of a 4 4 CP-SOM train-
ed with the Manhattan type of distance. As a supervised
expert an RBF network has been selected. It can be
observed that the SNet-SOM results outperform those
obtained with the single net types in terms of the re-
duction at the misclassication performance.
3.2. Distinction of Chaos from Noise
The second application of SNet-SOM aims at the
problem of separating noise from chaos. Specically,
the Lorenz chaotic system has been used to gener-
ate a chaotic trajectory. This trajectory evolves over a
chaotic attractor lying at the three-dimensional space.
The Lorenz chaotic system is described according
Supervised Network Self-Organizing Map 197
to [34]:
dx(t )
dt
= x(t ) y(t )
dy(t )
dt
= x(t ) z(t ) r x(t ) y(t )
dz(t )
dt
= x(t ) y(t ) b z(t )
where = 10, b = 8/3, r = 28 is a typical congu-
ration of the parameters.
The dynamics of the Lorenz chaotic system evolve
to the well-known Lorenz chaotic attractor [34]. This
attractor is characterized by complex nonlinear dy-
namics. The objective of the experiment is to design
a classication system that is able of distinguishing
between a three-dimensional vector from the evolution
of the Lorenz chaotic system and random Gaussian
noise. Since the Lorenz attractor evolves to a signi-
cant portion of the three-dimensional state space, the
theoretical performance of any classier will be lim-
ited in proportion with the correlation dimension of the
Lorenz attractor [35]. However, since noise covers the
3-D space uniformly there exist regions that separate
well from the attractor and others that are placed near
the attractor boundaries. Moreover, since the attractor
is fractal, there exist regions within it that are not vis-
ited by the trajectories of the Lorenz system dynamics.
Therefore, the difculty of distinguishing noise from
the Lorenz state vectors is dependent on the state space
region. The peculiarities of this problem are well tted
to the philosophy of the SNet-SOM. The regions of the
state space far from the attractor can be handled effec-
tively with the CP-SOMclassication. The rest regions
(near the attractor or within its fractal structure) are
difcult and require the construction of complex deci-
sion boundaries. This is accomplished with the incor-
poration of effective supervised models. Specically,
we have experimented with the Radial Basis Func-
tions and the Support Vector Machines as supervised
experts. Finally, the regions of the three-dimensional
space that are covered by the attractor cannot be distin-
guished since these are the regions where the classes
overlap.
The results of the simulation experiments have been
obtained by generating 20,000 training patterns, half of
which are computed from the numerical integration
of the Lorenz system and the other half are samples
obtained from a Gaussian noise generator normalized
within the domain of evolution of the Lorenz attrac-
tor. The testing set consisted of another set of 20,000
values is constructed similarly. The size of the am-
biguous pattern set was near 2000 with the entropy
threshold criterion set to 0.2. We have tested at the per-
formance to this classication problem the plain SOM,
the SNet-SOM with RBF (SNet-SOM/RBF) and with
SVM (SNet-SOM/SVM) supervised experts. The av-
erage performances obtained were 79% for the SOM,
81% for the SNet-SOM/RBF and 82% for the SNet-
SOM/SVM.
3.3. Ischemia Detection
The third application of the SNet-SOM concerns the
classication of real data obtained from the biomed-
ical domain. Specically, the problem of maximizing
the performance of the detection of ischemia episodes
is addressed. This is a difcult pattern classication
problem [5, 6].
Myocardial ischemia is caused by a lack of oxy-
gen and nutrients to the contractile cells. Frequently,
it may lead to myocardial infarction with its severe
consequence of heart failure and arrhythmia that may
even lead to patient death. The ST-T Complex of the
ECG represents the time period from the end of the
ventricular depolarization to the end of corresponding
repolarizationinthe electrical cardiac cycle. Changes at
the values of measured amplitudes, times, and duration
on the ST-T complex are used to detect and quantify
ischemia noninvasively from the standard ECG [5, 6,
36, 37]. The ECG is a widespread noninvasive exam-
ination that provides information about the electrical
activity of the heart tissue. Different degrees of the
severity of ischemic progression can be described in
terms of ECG features [38]. The rst stage of ischemia
is characterized by T-wave amplitude increase without
simultaneous ST segment change. As the ischemia ex-
tends transmurally through the myocardium, the intra-
cellular action potential shortens and the injured cells
become hyperpolarized. This hyperpolarization in turn
produces an injury current which is reected at the
ECG as a horizontal ST segment deviation [39]. At
the nal stage the ischemia is so extensive that the ter-
minal portion of the active depolarization waveform,
represented at the ECG by the QRS complex, is altered
[38]. This stage is usually associated with myocardial
necrosis.
We used for our study the ECG signals of the
European ST-T Database, which are a set of long-
term Holter recordings provided by eight countries.
From the samples composing each beat, a window of
198 Papadimitriou et al.
400 milliseconds is selected (100 samples at the 250 Hz
sampling frequency). This signal component forms the
input to the Principal Component Analysis [7, 37] in
order to describe most of its content within a few
(i.e. ve) coefcients. The original data space under-
goes a dimensionality reduction as the feature space is
constructed. The term dimensionality reduction refers
to the fact that each 100-dimensional data vector x of
the original data space is represented with a vector of a
much smaller dimensionality (i.e. with a 5-dimensional
vector), yet most of the intrinsic information content of
the data is retained.
The PCA transformation describes the original vec-
tors (ST-TSegments) accordingtothe directionof max-
imum variance reduction in the training set. The latter
information is obtained by analyzing the data covari-
ance matrix. PCAselects as basis functions the orthog-
onal eigenvectors q
k
of the covariance matrix for the
signal projection operation. The corresponding eigen-
values
k
represent the average dispersionof the projec-
tion of the input vectors onto the corresponding eigen-
vectors (basis functions) q
k
. The numerical value of
each eigenvalue
k
quanties the amount of variance
that is accounted for by projecting the signal onto the
corresponding eigenvector q
k
. Therefore it represents
the contribution of the eigenvectors analysis direction
Table 3. The results obtained from the CP-SOM of the SNet-SOM as the main classication tool (average classication
performance: 73.31%).
Beat classication Beat classication Beat classication
Record number performance (%) Record number performance (%) Record number performance (%)
E0103 66.5 E0122 75.1 E0151 85.5
E0106 65 E0127 81.9 E0154 82.7
E0108 64 E0129 66 E0159 75
E0111 81.8 E0139 78 E0166 79
E0118 63.2 E0147 71 E0202 65
Table 4. The results obtained from the SNet-SOM with a Radial Basis Function network as supervised expert (average
performance: 76.6%).
Beat classication Beat classication Beat classication
Record number performance (%) Record number performance (%) Record number performance (%)
E0103 81 E0122 78.4 E0151 92.7
E0106 73.4 E0127 74.2 E0154 74.6
E0108 65.5 E0129 69.5 E0159 69.3
E0111 88.5 E0139 72.1 E0166 91.2
E0118 69 E0147 78.2 E0202 71
to the signal reconstruction in the mean squared error
sense.
After obtaining these PCs, a wavelet based denoising
technique [40] based on Lipschitz regularization theory
is applied, in order to improve the signal-to-noise ratio
of the ve coefcients. These denoised Principal Com-
ponents constitute the input space for the SNet-SOM.
The utilization of the Wavelet Denoising at the domain
of Principal Components has improved the classica-
tion performance.
The training set consists of 15,000 ST-T segment
data extracted fromabout 44,000 beats. This set is con-
structed by using samples taken from 6 records (differ-
ent from those used at the testing sets). The training set
patterns are selectedfromthe relativelyat regions of
the PC time series representations. Also, the two main
classes (i.e. normal and ischemic) are represented by
an approximately equal number of samples.
The evaluation of the SOMand the SNet-SOMmod-
els has been performed on another 15 records out of the
90 records of the European ST-T database. From these
records testing sets have been constructed. The whole
test set contains principal component projection coef-
cients from approximately 120,000 ECG beats.
Tables 35 present the classication performances
for ischemic beat classication. The classication
Supervised Network Self-Organizing Map 199
Table 5. The results obtained from the SNet-SOM with a Support Vector Machine as supervised expert (average
performance: 78.8 %).
Beat classication Beat classication Beat classication
Record number performance (%) Record number performance (%) Record number performance (%)
E0103 82.5 E0122 80.1 E0151 92.6
E0106 75.6 E0127 78.2 E0154 75.1
E0108 65.9 E0129 72.5 E0159 76.9
E0111 87.5 E0139 74.3 E0166 92.8
E0118 71 E0147 81.7 E0202 75.3
performance ratio is a global one: it expresses the ratio
of correct classications to the total ones. The CP-SOM
already performs well given that it has an increased
size in order to perform the classication directly. The
related performances are described in Table 3. These
results have been obtained by using a CP-SOM orga-
nized as a 10 10 lattice of neurons. The performed
experiments have illustrated that this size yields the
best results for direct classication. Also, with the uti-
lization of the Manhattan distance measure [17] better
results are obtained in comparison to the alternative
Euclidean measure. Although the CP-SOM is trained
with the usual SOM unsupervised training algorithm
[17], it has the potential to obtain beat classication
accuracy close to those reported in [5, 6, 36] with su-
pervised neural models. Table 4 presents the results for
ischemia beat classication obtained from the SNet-
SOM with a Radial Basis Function network as a su-
pervised expert. The training set size corresponding to
the number of training set patterns mapped to ambigu-
ous neurons is about 2000 and the number of centers
is 500. Also, the regularization parameter is 0.1. The
CP-SOM is consisted of a two-dimensional lattice of
neurons of size 4 4. The average beat classication
accuracy of the RBF network as a supervised expert is
76.6%. Table 5 displays the corresponding results with
a Support Vector Machine (SVM) as a supervised ex-
pert. The training set for the SVM case is the same as
for the RBF. The inner-product kernel of the SVM is
based on Radial Basis Functions with spread
2
=8
and a regularization parameter of a =0.1. The CP-
SOM is of the same size (i.e. 4 4 lattice). The aver-
age beat classication performance has been improved
to 78.8 %.
The detection of ischemic episodes from the de-
tected ischemic beats follows the approach of [5,
36] for the detection of ischemic episodes, the ECG
recordings are broken into ten beat groups. For each
such group the number of normal and abnormal beats
is counted. Groups with a number of normal beats
larger than the number of the abnormal (i.e. more
than ve) are assigned to the normal state (i.e. all
the beats for that group are considered as abnormal).
In the opposite case the group is considered abnor-
mal. Ischemic episodes are considered as those for
which consecutive beat groups are ischemic for at
least 30 sec (i.e. three or more consecutive groups are
ischemic).
Correctly detected episodes are termed True Posi-
tive (TP) episodes. Missed episodes are termed False
Negatives (FN). Also when a nonischemic episode is
detected as ischemic, a False Positive (FP) situation
has occurred. The ST Episode Sensitivity is dened as
the ratio of the number of detected episodes matching
the database annotations to the number of annotated
episodes. In terms of the above denitions:
Ischemia Episode Sensitivity =(TP)/(TPFN)
Another important indexis the STEpisode Predictiv-
ity, which is dened as the number of correctly detected
episodes to the total number of episodes detected,
i.e.
Ischemia Episode Predictivity =(TP)/(TPFP)
Table 6 displays the results of the average ischemia
episode detection performance evaluated with the three
network types. The second column displays the sensi-
tivity while the third one the predictivity of episode
detection. As it is expected from the beat classi-
cation results, the SNet-SOM with SVM as super-
vised expert yields a better average episode detection
performance.
200 Papadimitriou et al.
Table 6. The average ischemia episode detection perfor-
mance evaluatedwiththe correspondingnetworks (i.e. SOM,
SNet-SOM with RBF as supervised expert and SNet-SOM
with SVM as supervised expert).
Network Ischemia episode Ischemia episode
type sensitivity (%) predictivity (%)
SOM 74.9 73.7
SNet-SOM/RBF 79.5 77.6
SNet-SOM/SVM 82.8 81.3
4. Conclusions
This work has proposed a new supervised extension
to the Self-Organizing Map (SOM) model [1517]
that is called the Network Self-Organizing Map (SNet-
SOM). This model exploits the ordering potential of the
SOM in order to split the global state space into two
subspaces. The rst subspace corresponds to regions
over which the classication task can be performed di-
rectly with the unsupervised SOMalgorithm. However,
for the second subspace complex decision boundaries
should be enforced and the generalization performance
should be explicitly designed. The SOM algorithm is
not appropriate for this task and therefore supervised
training networks capable of achieving good general-
ization performance (i.e. Radial Basis Functions and
the Support Vector Machines) are used. We have de-
veloped the SNet-SOM with Radial Basis Function
networks [7, 9] and the Support Vector Machines as
supervised experts [7, 12]. All these designs construct
approximations that involve local tting to the dynam-
ics of the target function. The locality of these networks
ts well with the locality of the subspaces that consti-
tute the ambiguous region. The RBF networks address
the issue of regularization in a disciplining mathemat-
ical way through the Tikhonov regularization theory
[79]. The Support Vector Machines have obtained the
best discrimination capability for the ambiguous re-
gions (Table 5).
The main objective of using SNet-SOM for dif-
cult pattern classication tasks is to obtain signicant
computational benets in large scale problems. The
SNet-SOMutilizes the computationally effective SOM
algorithm for resolving most of the regions of the state
space while it uses advanced supervised learning al-
gorithm to confront with the difculties of enforcing
complex decision boundaries over regions character-
izedbyclass ambiguity(quantiedwiththe entropycri-
terion). Moreover, without a kind of divide and conquer
approach (as the one of SNet-SOM) it is difcult to ap-
proach directly some large problems with nearly op-
timal models, as the Support Vector Machines, due
to the computational complexity of their numerical
solution.
The SNet-SOM is a modular architecture that can
be improved along many directions. The utilization of
different frameworks for self-organization as the Adap-
tive Subspace Self-Organizing Map (ASOM) [17] and
information theoretic frameworks for self-organization
[7, 23] can improve the phase of the state space par-
titioning. Also, we currently formulate the SOM al-
gorithm within the framework of the Modied Value
Distance Metric (MVDM) [41, 42]. This will per-
mit to use its partitioning potential for coping with
complex data mining applications, including symbolic
features (e.g. protein secondary structure prediction
[4]). All these research efforts on the SNet-SOM are
with the general philosophy that the best network ar-
chitecture depends on the structure of the problem
that is confronted. Therefore, for complex problems
with irregular state spaces a device capable of inte-
grating effectively multiple architectures as the pre-
sented SNet-SOM can perform better than individual
architectures.
Acknowledgment
The authors wish to thank the Greek Scholarship Foun-
dation for the nancial support of this research with
the scholarships for the PhD student L. Vladutu and
postdoc. researcher S. Papadimitriou.
References
1. C.S. Herrmann, Symbolical reasoning about numerical data:
A hybrid approach, Applied Intelligence, vol. 7, pp. 339354,
1997.
2. H. Liu and R. Setiono, Incremental feature selection, Applied
Intelligence, vol. 9, pp. 217230, 1998.
3. G. Piatetsky-Shapiro, R. Brachman, T. Khabaza, W. Kloesgen,
and E. Simoudis, An overview of issues in developing in-
dustrial data mining and knowledge discovery application, in
Proceedings, Second International Conference on Knowledge
Discovery and Data Mining, AAAI Press: Menlo Park, CA,
1996.
4. B. Rost and S. ODonohue, Sisyphus and protein structure pre-
diction, BioInformatics, vol. 13, pp. 345356, 1997.
5. N. Maglaveras, T. Stamkopoulos, C. Pappas, and M.G. Strintzis,
An adaptive backpropagation neural network for real-time
ischemia episodes detection: Development and performance
Supervised Network Self-Organizing Map 201
analysis using the european ST-T database, IEEE Transactions
on Biomedical Engineering, vol. 45, no. 7, 1998.
6. R. Silipo and C. Marchesi, Articial neural networks for auto-
matic ECG analysis, IEEE Transactions on Signal Processing,
vol. 46, no. 5, 1998.
7. S. Haykin, Neural Networks, 2nd edn., MacMillan College Pub-
lishing Company: London, 1999.
8. C.M. Bishop, Neural Networks for Pattern Recognition,
Clarendon Press: Oxford, 1996.
9. T. Poggio and F. Girosi, Regularization algorithms for learning
that are equivalent to multilayer perceptrons, Science, vol. 247,
pp. 978982, 1990.
10. T. Poggio and F. Girosi, Networks for approximation and
learning, in Proceedings of the IEEE, vol. 78, pp. 14811497,
1990.
11. F. Girosi, An equivalence between sparse approximation and
support vector machines, Neural Computation, vol. 10, no. 6,
pp. 14551480, 1998.
12. V.N. Vapnik, Statistical Learning Theory, Wiley: New York,
1998.
13. T. Joachims, Making large-scale SVM learning practical, in
Advances in Kernel MethodsSupport Vector Learning, edited
by B. Scholkopf, C.J.C. Burges, and A.J. Smola, MIT Press:
Cambridge, USA, 1998.
14. B. Kosko, Fuzzy Enginnering, Prentice Hall: Upper Saddle
River, NJ, 1997.
15. T. Kohonen, The self-organizing map, in Proceedings of
the Institute of Electrical and Electronics Engineers, vol. 78,
pp. 14641480.
16. H. Ritter, T. Martinetz, and K. Schulten, Neural Computa-
tion and Self-Organizing Maps, Addison-Wesley: Reading, MA,
1992.
17. T. Kohonen, Self-Organized Maps, Springer-Verlag: Berlin,
1997.
18. O.L. Mangasarian and D.R. Musicant, Successive overrelax-
ation for support vector machine expansions, IEEE Trans-
actions on Neural Networks, vol. 10, no. 5, pp. 10321037,
1999.
19. E. Osuna, R. Freund, and F. Girosi, An improved training al-
gorithm for support vector machines, Neural Networks for Sig-
nal Processing VII, Proceedings of the 1997 IEEE Workshop,
Amelia Island, FL, 1997, pp. 276-285.
20. D.P Bertsekas, Dynamic Programming and Optimal Control,
vol. I and II, Athenas Scientic: Belmont, MA, 1995.
21. D. Alahakoon, S.K. Halgamuge, and B. Srinivasan, Dynamic
self-organizing maps with controlled growth for knowledge
discovery, IEEE Transactions on Neural Networks, vol. 11,
no. 3, 2000.
22. R. Silipo, P. Laguna, C. Marchesi, and R.G. Mark, ST-T
segment change recognition using articial neural networks
and principal component analysis, Computers in Cardiology,
pp. 213216, 1995.
23. T.-W. Lee, Independent Component Analysis, Theory
and Applications, Kluwer Academic Publishers: Dordrecht,
1998.
24. S. Papadimitriou, A. Bezerianos, and A. Bountis, Radial basis
function networks as chaotic generators for secure communica-
tion systems, International Journal on Bifurcation and Chaos,
vol. 9, no. 1, pp. 221232, 1999.
25. A. Bezerianos, S. Papadimitriou, and D. Alexopoulos, Radial
basis function neural networks for the characterization of heart
rate variability dynamics, Articial Intelligence in Medicine,
vol. 15, pp. 215234, 1999.
26. Z. Uykan, C. Guzelis, M.E. Celebi, and H.N. Koivo, Analysis of
input-output clustering for determining centers of RBFN, IEEE
Transactions on Neural Networks, vol. 11, no. 4, pp. 851858,
2000.
27. C. Cortes and V. Vapnik, Support vector networks, Machine
Learning, vol. 20, pp. 273297, 1995.
28. P. Bartlett and J.S. Taylor, Generalization performance of
support vector machines and other pattern classiers, Advances
in Kernel Methods, Support Vector Learning, The MIT Press:
Cambridge, pp. 4355, 1999.
29. V.N. Vapnik, Three remarks on the support vector method
of function estimation, Advances in Kernel Methods, Sup-
port Vector Learning, The MIT Press: Cambridge, pp. 2541,
1999.
30. V. Cherkassky, X. Shao, F.M. Mulier, and V.N. Vapnik, Model
complexity control for regression using VC generalization
bounds, IEEE Trans. on Neural Networks, vol. 10, no. 5,
pp. 10751089, 1999.
31. V.N. Vapnik, An overview of statistical learning theory, IEEE
Trans. on Neural Networks, vol. 10, no. 5, pp. 988999,
1999.
32. B. Scholkopf, C. Burges, and V. Vapnik, Extracting support
data for a given task, in Proceedings, First International Con-
ference on Knowledge Discovery & Data Mining, AIII Press:
1995, pp. 252257.
33. J.J. More and G. Toraldo, On the solution of large quadratic
programming problems with bound constraints, SIAM J. Opti-
mization, vol. 1, no. 1, pp. 93113, 1991.
34. E. Ott, Chaos in Dynamical Systems, Cambridge University
Press: Cambridge, 1993.
35. A.A. Tsonis, Chaos: From Theory to Applications, Plenum
Press: New York, 1992.
36. T. Stamkopoulos, K. Diamantaras, N. Maglaveras, and M.
Strintzis, ECG analysis using nonlinear PCA neural networks
for ischemia detection, IEEE Transactions on Signal Process-
ing, vol. 46, no. 11, 1998.
37. J. Garcia, P. Lander, L. Sornmo, S. Olmos, G. Wagner, and P.
Laguna, Comparative study of local and Karhounen-Love-
based ST-T indexes in recordings from human subjects with
induced myocardial ischemia, Computers and Biomedical Re-
search, vol. 31, pp. 271297, 1998.
38. Y. Birnbaum, S. Sclarovsky, A. Blum, A. Mager, and U.
Cabby, Prognostic signicance of the initial electrocar-
diographic pattern in a rst acute anterior wall myocar-
dial infarction, Chest, vol. 103, no. 6, pp. 16811687,
1993.
39. R.L. Verrier and B.D. Nearing, T wave Alternans as a Harbin-
ger of Ischemia-Induced Sudden Cardiac Death, in D.
Zipes, J. Jaliffe, Cardiac ElectrophysiologyFrom Cell to
Bedside, 2nd edn., W.B. Saunders Company: Philadelphia,
1995.
40. S. Mallat and Wen Liang Hwang, Singularity detection and
processing with wavelets, IEEE Transactions on Information
Theory, vol. 38, no. 22, pp. 617643, 1992.
41. D.R. Wilson and T.R. Martinez, Improved heterogeneous dis-
tance functions, Journal of Articial Intelligence Research,
pp. 134, 1997.
202 Papadimitriou et al.
42. S. Cost and S. Salzberg, Aweighted nearest neighbor algorithm
for learning with symbolic features, Machine Learning, vol. 10,
pp. 578.
Stergios Papadimitriou received the Dipl. Eng. degreee and the
Ph.D. degree from the Computer Engineering and Informatics
Department, University of Patras, Greece, in 1990 and 1996, respec-
tively. He worked for ve years as a Research Assistant at the Institute
of Computer Technology, Patras, Greece. His main research inter-
ests are neuro-fuzzy computing, chaotic dynamics, chaotic encryp-
tion, computational and articial intelligence, support vector learn-
ing and recurrent neural-network architectures. He is now a Senior
Researcher at the Biomedical Signal Processing Laboratory of the
Medical Physics Department and at the Articial Intelligence Lab-
oratory of the Computer Engineering Department of the University
of Patras.
Seferina Mavroudi received the Dipl. Eng. degree in electronical en-
gineering from the Department of Electronical and Computer Engi-
neering of the Aristotle University of Thessaloniki, Greece, in 1998.
She participated in an inter-departmental post-graduate program,
where she received the Masters degree in biomedical engineering
from the Medical School of the University of Patras, the Department
of Electrical and Computer Engineering and the Department of
Mechanical Engineering of the National Technical University of
Athens, Greece in 2000. She is now working as a Ph.D. Fellow
at the Biomedical Signal Processing Laboratory of the Medical
Physics Department of the University of Patras. Her main research
interests include neural networks, neuro-fuzzy architectures, arti-
cial intelligence and nonlinear dynamics, mainly for biomedical
applications.
Liviu Vladutu (S01) received the B.S. degree (Hons.) in au-
tomation and computers science in 1987 from Craiova University,
Romania, and the M.S. degree in biomedical engineering from the
University of Patras, Greece, in 1999. He is currently working as
a Ph.D. Fellow at the Department of Medical Physics, University
of Patras, Greece. His current work involves application of compu-
tational intelligence methods for biomedical signal processing. His
other interests include statistical learning theory and multiresolution
analysis.
Prof. G. Pavlides has been working in several universities in Greece
and abroad for about 20 years. He has also been working as a
senior consultant in the private industry and nancial institutions.
He is the director of the Information Systems and Articial Intel-
ligence Laboratory of the University of Patras. His areas of exper-
tise and interests include Data Warehousing, Encryption Algorithms,
Security,Workow systems, OLTP, OLCP, OLAP. He is a mem-
ber of international professional organisations, such as IEEE and
ACM. He has been the leader in several projects, some of which
are funded by the European Union, as well as large-scale software
engineering projects in Information Systems. They include prototyp-
ing environments, document ling and retrieval systems, hierarchical
data replication techniques etc.
Anastasios Bezerianos (M97) was born in Patras, Greece, in 1953.
He received B.Sc. in physics from Patras University in 1976, the
M.Sc. degree in telecommunications and electronics from Athens
Supervised Network Self-Organizing Map 203
University, and the Ph.D. degree in Medical Physics from Patras
University. He is currently Associate Professor in Medical School
of Patras University. His main research interests are concentrated
in biomedical signal processing and medical image processing, as
follows: 1) data acquisition and on line processing using digital signal
processors, 2) nonlinear time series analysis of elec-trocardiogram, 3)
wavelet analysis of high resolution ECG, 4) modeling of heart muscle
and heart rate variability, and 5) wavelet analysis of medical images.