You are on page 1of 22

BIOINFORMATICS Vol.

No
Pages

Gene Expression Analysis with a Dynamically extended Self-
Organized Map that exploits class information

Seferina Mavroudi, Stergios Papadimitriou, Liviu Vladutu, Anastasios Bezerianos

Department of Medical Physics, School of Medicine, University of Patras,
26500 Patras, Greece, tel: +30-61-996115,
email: severina@heart.med.upatras.gr, stergios@heart.med.upatras.gr

ABSTRACT
Motivation Currently the most popular approach to analyse
genome-wide expression data is clustering. One of the major
drawbacks of most of the existing clustering methods is that
the number of clusters has to be specified a priori.
Furthermore, by using pure unsupervised algorithms prior
biological knowledge is totally ignored e.g. there is no
simple means to handle genes of known similar function
being allocated to different clusters based on their
expression profiles. Moreover, most current tools lack an
effective framework for tight integration of unsupervised
and supervised learning for the analysis of high-dimensional
expression data.
Results: The paper adapts a novel Self-Organizing map
called supervised Network Self-Organized Map (sNet-SOM)
to the peculiarities of gene expression data. The sNet-SOM
determines adaptively the number of clusters with a
dynamic extension process which is able to exploit class
information whenever exists. Specifically, the sNet-SOM
accepts available class information to control a dynamical
extension process with an entropy criterion. This process
extracts information about the structure of the decision
boundaries. A supervised network can be connected
additionally in order to resolve better at the difficult parts of
the state space. In the case that there is no classification
available, a similar dynamical extension is controlled with
criteria based on the computation of local variances or
resource counts.
The sNet-SOM grows within a rectangular grid that
provides effective visualization while at the same time it
allows the implementation of efficient training algorithms.
The expansion of the sNet-SOM is based on an adaptive
process. This process grows nodes at the boundary nodes,
ripples weights from the internal nodes towards the outer
nodes of the grid, and inserts whole columns within the
map. The growing process determines automatically the
appropriate level of expansion with criteria dependent upon
whether unsupervised or supervised training is used. For the
unsupervised training the criterion is the similarity between
the gene expression patterns of the same cluster to fulfill a
designer definable statistical confidence level of not being a
random event. The supervised mode of training grows the
map until criteria defined on approximation/generalization
performance are fulfilled. The voting schemes for the
winner node have been designed in order to amplify the
representation of rare gene expression patterns.
The results indicate that sNet-SOM yields competitive
performance to other recently proposed approaches for
supervised classification at a significantly reduced
computational cost and it provides extensive exploratory
analysis potentiality within the unsupervised analysis
framework. Furthermore, it explores simple design decisions
that are easier to comprehend and computationally efficient.
Availability: The source code of the algorithms presented in
the paper can be downloaded from
http://heart.med.upatras.gr. The implementation is in
Borland C++ Builder 4.0.
Contact: severina@heart.med.upatras.gr,
stergios@heart.med.upatras.gr


1
BIOINFORMATICS Vol. No
Pages






1. Introduction
The recent development of DNA microarray technology
provides the ability to measure the expression levels of
thousands of genes in a single experiment [ , , ]. The
interpretation of such massive expression data is a new
challenge for bioinformatics and opens new perspectives for
functional genomics. A key question within this context is if
given some expression data for a gene, this gene does
belong to a particular functional class (i.e. it encodes for a
protein of interest).
12
12
12
12
12
6
6
5
5
Currently, the most popular analysis of gene expression data
in order to provide insight to the structure of the data and to
aid at the discovery of functional classes, is clustering, i.e
the grouping of genes with similar expression patterns into
clusters [ , ]. Such approaches unravel relations between
genes and help to deduce their biological role, since genes of
similar function tend to display similar expression patterns.
Most of the so far developed algorithms perform the
clustering of the expression patterns in an unsupervised
manner [ , , ]. However, frequently genes of similar
function become allocated to different clusters. In this case,
a pure unsupervised approach is unable to deduce the correct
"rule" for the characterization of the gene class. On the other
hand, there already exists valuable biological knowledge,
which is manifested in the form of collections of genes
knowing to encode proteins of similar biological function,
e.g. genes that code for ribosomal proteins [ ].
17
17
24
24
24
Some of the clustering algorithms used so far for the
clustering of gene expression data include hierarchical
clustering [ ], K-means clustering, Bayesian clustering [ ]
and the Self-Organizing Map (SOM) [ ]
13
Nevertheless, despite of the fact that most of the widely
approved clustering methods, as K-means and SOM, ignore
existing class information, another major drawbacks of these
methods is that they require an a priori decision on the
number and structure of distinct clusters. Moreover, most of
the proposed models do not incorporate flexible means for
coupling effectively the unsupervised phase with a
supervised complementary phase, in order to benefit the
most from both of these approaches.
A major drawback of hierarchical clustering is that although
the data points are organized into a strict hierarchy of nested
subsets there is no reason to believe that expression data
actually follows a true hierarchical descent, like for
example, the evolution of the species [ , ]. Furthermore,
decisions made early about grouping points to specific
clusters cannot be reevaluated and often adversely affect the
result. This later disadvantage is owned also by the dynamic
non-fuzzy hierarchical schemes proposed recently [ , ].
Also, the traditional hierarchical clustering schemes suffer
from lack of robustness, and from nonuniqueness and
inversion problems.
11
7
Bayesian clustering is a highly structured approach, which
imposes a strong prior hypothesis on the data [ ]. However,
a prior hypotheses on expression data though is usually not
available.
8
K-means clustering on the other hand imposes no structure
at all on the data, proceeds in a local fashion and produces
an unorganized collection of clusters that is not conducive to
interpretation [ ].
In contrary, the standard SOM algorithm has a number of
properties, which render it to a candidate of particular
interest. SOMs can be implemented easily, are fast, robust
and scale well to large data sets. They allow one to impose
partial structure on the clusters and facilitate visualization
and interpretation. In the case hierarchical information is
required, it can be implemented on top of SOM, as in [ ].
However, there is still an inherent requirement of the
standard SOM algorithm, which constitutes a major
drawback. The number of distinct clusters has to be
specified a priori, although there is no mean to objectively
predetermine the optimum number in the case of gene
expression data.
27
2
Gene expression analysis with a dynamically extended Self-Organized Map
Recently, several dynamically extended schemes have been
proposed that overcome the limitation of the fixed non-
adaptable architecture of the SOM. Some examples are the
Dynamic Topology Representing structures [ ], the
Growing Cell Structures [ , ], Self-Organized Tree
Algorithms [ , ] and the Adaptive Resonance Theory [ ].
The presented approach has many similarities to these
dynamically extended schemes. However, in contrast to the
complexity of these schemes, we built simple algorithms
that through the restriction of growing on a rectangular grid,
can be implemented easily and the training of the models is
very efficient. Also, the benefits of the more complex
alternatives to the dynamical extension are still retained.
23
14 9
7 17 2
We call the proposed model sNet-SOM from supervised
Network SOM, since although it is SOM based it
incorporates many provisions for supervised
complementation of learning. These provisions start with the
supervised versions of the map growing process and run
through the possibility of integrating a pure supervised
model.
Specifically, our clustering algorithm modifies the original
SOM algorithm with a dynamic expansion process
controlled by an entropy-based measure whenever gene
functional class information exists. The later measure
quantifies to which extend the available information for the
biological function (i.e. class) of the gene is represented
accurately by the cluster (i.e. the SOM node) on which the
gene is allocated. Accordingly, the model is adapted
dynamically in order to minimize the entropy within the
generated clusters. This approach detects effectively the
regions where the decision boundaries between different
classes lie. At these regions, the classification task becomes
difficult and a special supervised network can be connected
with the sNet-SOM in order to resolve better at the class
boundaries. Usually, only in the case of lack of class
information the dynamic expansion is controlled by local
variance or resource counts criteria. The entropy criterion
concentrates on the resolution of the regions characterized
by class ambiguity and therefore it is more effective.
The sNet-SOM has been designed in order to automatically
detect the appropriate level of expansion. At the
unsupervised case the distance threshold between patterns
below which two genes can be considered as co-expressed is
estimated. Then the map is grown automatically until its
nodes correspond to gene clusters with distances that adhere
to this limit. In the supervised case the criteria for stopping
the network expansion can be expressed either in terms of
the approximation or in terms of the classification
performance.
Furthermore, the sNet-SOM overcomes the problem of
irrelevant (flat) profiles that can populate much more
clusters than necessary at the traditional SOM. The solution
we adopted is the careful redesign of the voting mechanism.
The paper is outlined as follows: Initially, Section 2
summarizes the microarray expression experiments and the
associated data used to evaluate the presented computational
learning schemes. Section 3 describes the extensions to the
SOM that lead to the sNet-SOM and the overall architecture
of the later. Section 4 deals with the learning algorithms that
adapt both the structure and the parameters of the sNet-
SOM. The expansion phase of the sNet-SOM learning is
described in separate sections since it is rather complicated
and depends on whether the learning is supervised or
unsupervised. Specifically, Section 5 elaborates on the
details of the expansion phase for the unsupervised case and
Section 6 for the supervised one. Section 7 discusses results
obtained from an application to yeast expression microarray
data. Finally, Section 8 presents results the conclusions
along with some directions onto which further research can
proceed for improvements.

2 Microarray expression experiments
Recently, new approaches have been developed for
accessing large scale gene expression data. One of the most
effective ones is by using the DNA microarray technology
[ ]. In this method, thousands of distinct DNA probes are
attached to a microarray. These probes can be Polymerase
Chain Reaction (PCR) products or oligonucleotides whose
sequences correspond to target genes or Expressed Sequence
10
3
Gene expression analysis with a dynamically extended Self-Organized Map
Tags (ESTs) of the genome being studied. RNA is extracted
from the sample tissue or cells, reverse transcribed into
labeled with fluorescent dyes cDNA, which is then allowed
to hybridize with the probes on the microarray. The cDNA
corresponds to transcripts produced by genes in the samples,
and the amount of a particular cDNA sequence present will
be in proportion to the level of expression of its
corresponding gene. The microarray is washed to remove
non-specific hybridization, and the level of hybridization for
each probe is calculated. An expression level for genes
corresponding to the probes is derived from these
measurements. This level represents a ratio between the
expression of the gene under some control condition
relatively to the reference condition.
Gene expression data obtained in this way are usually
arranged in tables whose rows correspond to the genes and
columns to the individual expression values of each gene in
a particular experimental condition represented by the
column. These raw data are characterized by highly
asymmetrical distributions that makes difficult the
realization of any distance metric for the assessment of the
differences among them. Therefore, the logarithmic
transformation is used as a preprocessing step, that expands
the scale for small values and compresses it for large values.
An additional desirable effect of the logarithmic
transformation is that it provides a symmetrical scale around
0.
The gene expression patterns reflect a cell's internal state
and microenviroment creating a molecular "picture" of the
cell's state. Thus DNA microarrays can be used to capture
these molecular pictures and deduce the condition of the
cells. Furthermore, since the expression profile of a gene is
correlated with its biological role, systematic microarray
studies of global gene expression can provide remarkably
detailed clues to the functions of specific genes. This is
important, since currently fewer than 5% of the functions of
the genes in the human genome are known.


3. The sNet-SOM model
The sNet-SOM is based on the standard SOM algorithm, but
is dynamically extendable, so that the number of clusters is
controlled by a properly defined measure of the algorithm
itself, with no need for any a priori specification. Because
all the previously mentioned clustering algorithms are
purely unsupervised, they ignore any available a priori
biological information. This means that not only existing
information is not explored in order to deduce the correct
expression characteristics of genes that make them part of
functional groups, but also that genes known to be
erroneously grouped to a cluster cannot be handled.
Following the basic design principle to include existing
prior knowledge, we manage to simultaneously consider
both gene expression data and class information (whenever
available) at the sNet-SOM training algorithms.
However, so far class annotation for gene expression data is
limited and not always available. In order to account also for
this case we additionally developed a second similar
algorithm so that for the two cases the algorithms differ only
in the criteria that control the dynamic expansion of the
map. Specifically, depending on the availability of class
information we design two variants of sNet-SOM.
The first variant, the unsupervised sNet-SOM performs node
expansion in the absence of class labels by exploiting either
a local variance measure that depends on the SOM
quantization performance or on node resource counts. These
criteria are used also at the Growing Cell Structures (GCS)
algorithms for growing cells [ , ]. The convergence criteria
are defined by a statistical assessment of the randomness of
the distance between gene expression patterns.
14
14
9
The second variant, the supervised sNet-SOM performs the
growing by exploiting the class information with an entropy
measure. The dynamic growth is based on the criterion of
neuron ambiguity (i.e. uncertainty about class assignment),
which is quantified with the entropy measure that is defined
over the sNet-SOM nodes. This approach differs from the
local quantization error approach of [ ] and of the resource
value of [ ] that grow the map at the nodes accumulating
the largest local variances and resource context of the
unsupervised sNet-SOM. In the absence of class information
1
4
Gene expression analysis with a dynamically extended Self-Organized Map
these are reasonable and well performing criteria. However,
these measures can be large even with no class ambiguity
while the entropy measure directly and objectively
quantifies the ambiguity. For that reason for the supervised
sNet-SOM the entropy based growing technique is
preferable.
We have developed the supervised sNet-SOM initially
within the context of an ischemia detection application
[ , ]. At this application, it is used in combination with
capable supervised models in order to maximize the
performance of the detection of ischemic episodes.
However, the peculiarities of the gene expression data made
mandatory significant redesign of the algorithms. Below we
discuss the sNet-SOM learning algorithms in detail.
22 3


4. Learning algorithms
The sNet-SOM is initialized with four nodes arranged in a
2X2 rectangular grid and grows nodes to represent the input
data. Weight values of the nodes are self-organized
according to a new method inspired by the SOM algorithm.
The self-organization process maps properties of the original
high-dimensional data space onto the lattice consisted of
sNet-SOM nodes. The map is expanded to represent the
input space by creating new nodes, either from the boundary
nodes performing boundary extension, or by inserting whole
columns (or rows) of new units with a column extension (or
row extension).
The decision to grow either with the boundary or with the
column (row) extension does not limit the potentiality for
dimensionality reduction of the model and its modeling
effectiveness, while its implementation is easier and the
training becomes more efficient. The later advantage is
important for the large data sets produced by the microarray
experiments. Usually, new nodes are created by expanding
the map at its boundaries. However, when the expansion
focus becomes a node placed deep in the interior of a large
map, far from the boundary nodes, the adaptive expansion
process inserts a whole column of nodes directly adjacent to
this node. Therefore, the node becomes directly a boundary
node and the expansion process can generate new nodes in
the neighborhood. The implementation of this exception to
the general grow from boundary rule, has accelerated
significantly the training of large maps (2 to 4 times faster
computation for maps of size of about 100 nodes).
The growing structure takes the form of a nonuniform
rectangular grid. It develops within a large N M grid that
provides slots for the new dynamically created nodes.
Generally, we require i.e. since the
insertion of whole columns results in faster expansion rate
along columns (note that the opposite is true when we
implement the alternative of row insertion, instead of
column insertion).
, M N > , 2 M N =
A training epoch consists of the presentation of all the
training patterns to the sNet-SOM. A training run is defined
as the training of the sNet-SOM with a fixed number of
neurons at its lattice i.e. the training between successive
node insertions/deletions.
After the preliminary discussion we can now proceed to
describe the sNet-SOM learning algorithms in more detail.
The top level sNet-SOM learning algorithm is the same for
both, the unsupervised and the supervised case. In
algorithmic form it can be described as:

Top-level sNet-SOM learning algorithm
1. <Initialization phase>
While <global criteria for convergence of learning not
satisfied> do
2. <Training Run Adaptation phase>
3. <Expansion phase>
End While
4. <Fine Tuning Adaptation phase>

The details of the algorithm, i.e. the initialization,
adaptation, expansion and fine tuning phases and the
convergence criteria are described in detail below.

A. Initialization phase.
The weight vectors of the four starting nodes that are
arranged in a 2X2 grid are initialized with random numbers
5
Gene expression analysis with a dynamically extended Self-Organized Map
within the domain of feature values (i.e. of the normalized
ratio fluorescent coefficients).
B. Training Run Adaptation phase
The purpose of this phase is to stabilize the current map
configuration in order to be able to evaluate its effectiveness
and the requirements for further expansion. During this
phase, the input patterns are repeatedly presented and the
corresponding self-organization actions are performed until
the map converges sufficiently. The training run adaptation
phase takes the following algorithmic form.

<Training Run Adaptation Phase>:
MapConverged := false;
while MapConverged = false do
for all input patterns do
k
x
present and adapt the map by applying the map
adaptation rules
k
x
endfor
Evaluate map training run convergence condition and set
MapConverged accordingly
endwhile

Map adaptation rules
The map adaptation rules that govern the processing of each
input pattern are as follows:
k
x
1. Determination of the weight vector that is closest to
the input vector (i.e. of the winner node).
i
w
k
x
2. Adaptation of the weight vectors only for the four
nodes in the direct neighborhood of the winner i and for
the winner itself according to the following formula:
j
w
j

e A +
e
= +
k j k k j
k j
j
N j k i j d k n k
N j k
k
)), ( ))( , ( ( ) ( ) (
), (
) 1 (
w x w
w
w

where the learning rate , ) (k n N k e , is a monotonically
decreasing sequence of positive parameters, is the
neighborhood at the k th learning step and is
the neighborhood function implementing different
adaptation rates even within the same neighborhood.
k
N
( (d
k
)) ,i j A
The learning rate starts from a value of 0.1 and decreases
down to 0.02. These values are specified with the empirical
criterion of having relatively fast convergence, without
however sacrificing the stability of the map.
The neighborhood function depends on the
distance between node and the winning node i . It
decreases monotonically with increasing distance from the
winning neuron (i.e. nodes closer to the winner are adapted
more), like in the standard SOM algorithm. The initial
neighborhood, , includes the entire map.
)) , ( ( i j d
k
A
j ) , ( i j d
0
N
Unlike the standard SOM, these parameters (i.e. ,
k
N
)) , ( ( i j d
k
A ) do not need to shrink with time and can be
kept constant i.e. )) , ( ( )) , ( ( ,
0 0
i j d i j d N N
k k
A = A = . This
is explained by the following: Initially, the neighborhood is
large enough to include the whole map. The sNet-SOM
starts with a much smaller size than a usual SOM: thus a
large neighborhood is not required to train the whole map at
the first learning steps (e.g. with 4 nodes initially at the map,
a neighborhood of 1 only is required). As training proceeds,
during subsequent training epochs, the area defined by the
neighborhood becomes localized near the winning neuron,
not by shrinking the vicinity radius (as in the standard SOM)
but by enlarging the SOM with the dynamic growing.
Usually, we use the following simple and efficiently
computed formula for the neighborhood function A (where
denote the row and column of node i respectively):
c r
i i ,

= + < <
=
= A
otherwise 0,
1 | | | | if , 1 0 ,
if , 1
)) , ( (
c c r r k
j i j i
i j
i j d o o
An alternative rectangular neighborhood that updates also
the diagonal nodes with a smaller learning rate yields also
appropriate results:

= + < < <


= + < <
=
= A
otherwise 0,
2 | | | | if , 1 0 ,
1 | | | | if , 1 0 ,
if , 1
)) , ( (
c c r r
c c r r
k
j i j i
j i j i a a
i j
i j d
| |

Evaluation of the map training run convergence condition
6
Gene expression analysis with a dynamically extended Self-Organized Map
The map training run convergence condition is tested by
evaluating the reduction of the total quantization error for
the unsupervised case and of the total entropy for the
supervised one, before and after the presentation of all the
input patterns (i.e. one training epoch). Specifically, denote
and the errors before and after the presentation of
patterns (similar is the formulation for the entropies). Then
the map converges when the relative change of the error
between successive epochs drops below a threshold value,
i.e.
b
E
a
E
|
.
|

\
|
<

= shold eErrorThre Convergenc


Ea
a
E
b
E
: ed MapConverg
| |
.
The setting of the ConvergenceErrorThreshold is somewhat
empirical but a value in the range 0.01 - 0.02 performs well
in assuming sufficient convergence without excessive
computation.

C. Fine Tuning Adaptation Phase
The fine tuning phase aims to optimize the final sNet-SOM
configuration. This phase is similar to the training run
adaptation phase described previously with two differences:
a. The criterion for map convergence is more
elaborate. We require much smaller change of the
total quantization error (unsupervised case) or of
the total entropy (supervised case) for accepting the
condition for map convergence.
b. The learning rate decreases to a smaller value in
order to allow fine adjustments to the final
structure of the map.
Typically, the ConvergenceErrorThreshold for the fine
tuning phase is about 0.00001 and the learning rate is set to
0.01 (or to an even smaller value).

3. Expansion Phase
The dynamic expansion of the CP-SOM depends on the
availability of class labels and therefore is referred as
supervised expansion when class labels are available and
unsupervised expansion if not. These processes are
described separately below since they explore different
expansion criteria. Moreover, the objective underlying their
development is different for each case. The unsupervised
expansion has the task of revealing insight onto the groups
of genes with correlated expression patterns while the
supervised entropy based expansion has the objective to
reduce the computational requirements for a "pure"
supervised solution. The later objective follows a design
principle of the sNet-SOM: partitioning of a complex
learning problem to domains that can be learned effectively
with simple and computationally effective unsupervised
models and to ones that require the utilization of a capable
supervised model since they are characterized by complex
decision boundaries [ ]. 22
To avoid misconception we should note that in our scheme
the term supervised refers mainly to the fact that class
information is a decisive factor in determining the expansion
criterion. As we shall describe in the next section though,
information about class information can be exploited even in
what we term unsupervised expansion process. The reason
for not using always the supervised expansion mode when
class information is available, is simply explained by the
two different objectives outlined above, i.e. if the insight to
the structure of the gene expression data is more important
than the classification task itself, the unsupervised approach
is used although class information is available.
Each of the two approaches to map expansion, the
unsupervised and the supervised one, is described below in
its own section.

5 . The Unsupervised Expansion Process
The unsupervised expansion is based on the detection of
the neurons with large local error, referred to as the
unresolved neurons. A neuron is considered unresolved if
its local error
i
exceeds a threshold value, denoted by
the parameter NodeErrorForConsideringUnresolved.
enote by
i
S the set of gene expression profiles
i
p mapped
to node i. Also, let
i
w be the weight vector of node i that
corresponds to the average expressi n profile of
i
S . Then
ocal error
i
LE is de
LE
D
o
the l fined as:
7
Gene expression analysis with a dynamically extended Self-Organized Map
( )

e
=
i
S
i i
LE
p
w p
2
.
The local error is commonly used for implementing
dynamically growing schemes [ , ]. However, the
peculiarities of the gene expression data motivated two
significant modifications to the classic local error measure.
Specifically:
1 17
1. Instead of the local error measure we use the
average local error
i
AV per patterns, i.e.

i
i
S
LE
AV
i
=
(1)
This measure does not increase when many similar
patterns are mapped to the same node. Therefore,
the objective of assigning all the functionally similar
genes to the same node is more easily achievable,
even when there are many of such genes. In contrast,
the accumulated local error increases monotonically,
as more genes are mapped to the same node. This in
turn can cause an undesired spreading of
functionally similar genes to different nodes.
2. The second provision applies when we have class
information available (either complete or partial)
and we want to exploit it in order to improve the
expansion. The local error that accumulates to a
winner node is amplified by a factor that is inversely
proportional to the square root of the frequency ratio
c
r of its corresponding class c . Specifically, let
patterns total #
class of patterns # c
r
c
= be the frequency ratio of
class c . Then the amplification factor is
2 1
c
r .
Therefore, the errors on the low frequency classes
account more. As a consequence the representation
of these low frequency classes is improved. We
should note that these classes are usually of most
biological significance. The utilization of the square
root prevents the overrepresentation of the very low
frequency classes (e.g. if class A is 100 times less
frequent than B it is amplified only 10 times
more). The error measure computed after this
additional class frequency dependent weighting is
called Class Frequency Average Local Error
(CFALE). In the absence of class information the
CFALE denotes the same quantity as the average
local error parameter defined above with
equation ( ).

i
AV
1
This provision also confronts to some extent the
serious problem of the creation of false positives for
the low frequency classes by noise. Probabilistically,
most of these noisy patterns will belong to the high
frequency classes. However, the effect of these
erroneously classified patterns will be attenuated
significantly, because they are derived from the high
frequency classes. The final result is an enhanced
robustness to noisy patterns.

Nodes that are selected as winners for very few (usually one
or two) training patterns, termed uncolonized nodes, are not
deleted by our scheme although they probably correspond to
noisy outliers. The gene expression patterns that consistently
(three times or more) are mapped to uncolonized nodes are
very unique and can either be artifacts or if not they have the
potential to provide biological knowledge. Therefore they
are amenable to further consideration. These patterns
therefore are marked and isolated for further study.
Nodes that are not selected as winners for any pattern are
removed from the map in order to keep it compact.

The steps of the unsupervised expansion process are as
follows:

<Unsupervised Expansion Phase:>
U.1. Computation of the CFALE measures for every node i.
repeat
U.2. let i = the node with the maximum CFALE
measure
U.3. if IsBoundaryNode(i) then
// expand at the neighbours boundary nodes
U.4. JoinSmoothlyNeighbours (i)
U.5. elseif IsNearBoundaryNode(i)
8
Gene expression analysis with a dynamically extended Self-Organized Map
U.6 RippleWeightsToNeighbours(i)
U.7. else InsertWholeColumn(i);
endif
U.8 Reset the local error measures.
U.9. Re-execute the Training Run Adaptation
Phase for the expanded map by presenting all the training
patterns.
until not RandomLikeClustersRemain();











Figure 1 Illustration of the weight initialization of a
new node with the function JoinSmoothlyNeighbors(). The
new node is allocated to join the map from the
boundary nodes. The weight of the new node is initialized by
computing the average weight nearby to the node
1 , c r
c r, ,
which initiates the expansion, according to the empirical
formula:
c r f c r f c r f
c r c r f c r c r f c r c r f
av
AN v AN v AN h
W AN v W AN v W AN h
W
c r
, 1 , 1 1 .
, 1 , 1 , 1 , 1 1 , 1 .
,
+ +
+ + + +
+ +
+ +
=
where is a boolean flag that denotes that the node
has been allocated to the growing structure and and
are the horizontal and vertical factors of weighting
across the corresponding dimensions. Then the weight of the
new node N is estimated as:
j i
AN
,
j i,
f
h
f
v
( )
c , av
W
c r c r c r N
W W W W
, , 1 ,
+ =

f f
v h >
.
The concept of direction of weight growing is maintained
by considering more the node in the direction of growth
(horizontal in the case illustrated), i.e. , typically
. , 1 =
f
h 5 . 0 =
f
v

We describe below shortly the main issues involved in these
steps. The while loop controls the sNet-SOM expansion.
The criteria for the establishment of the proper level of
expansion are described in the section that follows. The
function IsBoundaryNode() checks whether a node is a
boundary node. Training efficiency and implementation
simplicity were the motivations for the decision to expand
mostly from the boundary nodes. The expansion of the map
at the boundary nodes is straightforward: One to three nodes
are created and the weights of the new nodes are adjusted
heuristically to retain the weight flow with the function
JoinSmoothlyNeighbours() whose operation is illustrated in
Fig. . 1
The map configuration is slightly disturbed when the winner
node is not a boundary node but is a near boundary node. A
node is considered near boundary (declared by the function
IsNearBoundaryNode()) when the boundary of the map can
be reached from this node by traversing in any direction at
most two nodes.
For a near boundary node a percentage (usually 20-50%) of
the weight of the winner node is shifted towards the outer
nodes (with the function RippleWeightsToNeighbours()).
This operation alters locally the Voronoi regions and usually
with a few weight rippling operations the winner node is
propagated to a boundary node (which is located near).
Finally, if the winner is a node that is neither a boundary nor
a near boundary the alternative of inserting a whole empty
column is used. The rippling of weights is avoided in these
cases, because usually excessive computation times are
required before the winner propagates from a node placed
deep in the map to a boundary node. Instead of inserting
whole new columns we can insert alternatively whole new
rows, or we can perform a combination of row and column
insertion. The operation of the corresponding
InsertWholeColumn() function is illustrated by Figure 2.





v
f
new
node
initiating
expansion
node
1 , c r
c r,
1 , + c r
v
f
c r , 1

h
f
c r , 1 +
9
Gene expression analysis with a dynamically extended Self-Organized Map















Figure 2 Grow by grid insertion in the direction of largest
total error, i.e.
if ( ) ( )
1 , 1 1 , 1 , 1 1 , 1 1 , 1 , 1 + + + + +
+ + > + +
j i j i j i j i j i j i
E E E E E E
j i
E
,
) , ( j i node
,
where is the error measure of (i.e. CFALE),
then insert the new column at the left of column j else insert
the new column at the right of column j.


Criteria for controlling the sNet-SOM dynamic growing

One of the most critical tasks for the effective analysis of
gene expression data with the sNet-SOM is the proper
definition of the parameters that control the growing
process. The objective is to automatically reach the
appropriate level of expansion and then to allow some fine
tuning of the resolution level by the molecular biologists.
Systematically, the design of the criteria for stopping the
growing process can be approached by evaluating a
statistical distance threshold for gene expression patterns,
below which two genes can be considered as functionally
similar. When the average distance between patterns in a
cluster drops below this value, the clustering together of
these particular gene expression patterns corresponds to
nonrandom behavior and therefore interesting information
can be extracted by analyzing them.
To this end, we define a confidence level from which we
derive a threshold for the distance between gene
expression patterns. The confidence level has the
meaning that the probability of taking two random unrelated
expression profiles as functionally similar (i.e. to allocate
them at the same cluster) is lower than a if the distance
between them is smaller than the threshold.
a
thr
D
a
new column
E
i-1,j+1

Obviously, the definition of a statistical confidence level
would be only possible if the distribution of the distance
between random expression vectors were known.
Practically, although the distribution is unknown, it is easy
to approximate it. Specifically, we shuffle randomly the
experiment points of every expression profile randomly,
both by gene and by experiment point. This randomization
destroys the correlation between the different profiles, while
it retains the other characteristics of the data set (e.g. ranges
and histogram distribution of values). In this way, we
compute an approximation of the distribution of the distance
between random patterns.
It is evident that larger (smaller) correlation between genes
corresponds to smaller (larger) Euclidean or Manhattan
distance. Assume that we have chosen the Manhattan
distance measure. Then as figure illustrates the distances
lying in the interval
3
| |
h l
v v are considered as random.
Also, for distances smaller than a positive correlation
between genes is implied, while it is reasonable to assume
that the converse holds for distances larger than .
l
v
h
v

Figure 3 The results of the data shuffling illustrate that the
distances between the randomized data occupy a distinct
E
i,j-1
E
i,j
E
i-1,j
E
i,j+1
E
i+1,j+1 E
i+1,j
E
i+1,j-1
E
i-1,j-1
v
l
v
h
10
Gene expression analysis with a dynamically extended Self-Organized Map
distribution. For the gene expression data positive
correlation is favored while for the random the distribution
has a normal form.

The distributions of randomized and original gene
xpression patterns displayed in Figure 3 are used to
computation of the
lass assignment for each node i, and of a parameter
t i
s of the following steps:
r the map nodes. The ambiguity of class assignme
uation of the map over the whole training set in
mance <
te space on
est
m
ode(i) then
the map
set the node entropy measures.
the new expanded
urrentApproximationPerformance >
or the supervised
s already mentioned the objectives of the supervised
ed expansion exploits well the topological
e
implement the criteria on which the function
RandomLikeClustersRemain() is based. This function
evaluates the randomness of the genes allocated to one
cluster by computing all the pairwise distances between
them. If a considerable number of these distances are
random according to the specified significance level then the
cluster is considered to own unrelated genes and therefore
further decomposition is required. The percentage of
random pairwise distances above which we consider the
cluster as random, is specified empirically to a value of 5%.
Clearly, the smaller the required percentage parameter, the
larger the decomposition level becomes. The
aforementioned value (i.e. 5%) produces well behaved, from
a biological perspective, extensions of the map.

6. The Supervised Expansion Process

The supervised expansion is based on the
c
i
HN
characterizing the entropy of this assignment. This
parameter is derived according to Equation 2 tha s
discussed below. An advantage of the entropy is that it is
relatively insensitive to the overrepresentation of classes i.e.
independently of how many patterns of a class are mapped
to the same node, if the node does not represent other
classes, its entropy is zero.

The expansion phase consist

S.1 Computation of the class labels and entropies
i
HN
nt for fo
the genes of node i is quantified by
i
HN .
repeat
S.2. Eval
order to compute the approximation performance
CurrentApproximationPerformance
S.3 if CurrentApproximationPerfor
ThresholdOnApproximationPerformance then
// resolve better the difficult regions of the sta
which classification decisions cannot be deduced easily
S.3.1. let = i the node of higher ambiguity (i.e. larg
entropy para eter).
S.3.2. if IsBoundaryN
join smoothly the neighbours to
elseif node i near the boundary then
RippleWeightsToNeighbours(i)
else InsertWholeColumn(i);
endif
Re
Apply the Map Adaptation phase to
map.
endif
until C
ThresholdOnApproximationPerformance
S.4 Generate training and testing sets f
expert. Further supervising training will be performed with
these sets by the supervising learning algorithm in order to
better resolve the ambiguous parts of the state space.
endif

A
expansion differ from those of the unsupervised one. While
the unsupervised aims to insert nodes in order to detect
interesting clusters of genes, the supervised extension is
concentrating at the task of revealing the class decision
boundaries.
The supervis
ordering that the basic SOM provides and increases the
resolution of the representation over the regions of the state
space that lie near class boundaries. At this point, it should
be emphasized that simply increasing the SOM size with the
adaptive extension algorithm until each neuron represents
unambiguously a class (i.e., zero
i
HN for all the nodes)
11
Gene expression analysis with a dynamically extended Self-Organized Map
yields a SOM configuration that although fits to the training
set, fails to generalize well.
The ambiguous neurons i.e. those neurons for which the
task proceeds by feeding the pattern to the
e sNet-
ither majority
uncertainty of class assignment is significant, are identified
with the entropy criterion. The dynamic expansion phase of
the sNet-SOM is executed until the approximation
performance reaches the required level. Afterwards, training
and testing sets are created for the supervised expert. These
sets consist only of the patterns that are represented by the
ambiguous neurons. These neurons correspond to state
space regions on which classification decisions cannot be
deduced easily.
The classification
sNet-SOM. If the winning neuron is one that is not
ambiguous, the sNet-SOM classifies by using the class of
the winning neuron. In the opposite case, the supervised
expert is used to perform the classification decision.
The assignment of a class label to each neuron of th
SOM is performed according to a majority-voting scheme
[20]. This scheme acts as a local averaging operator defined
over the class labels of all the patterns that activate that
neuron as the winner (and accordingly are located at the
neighborhood of that neuron). The typical majority-voting
scheme considers one vote for each winning occurrence. An
alternative more "analog" weighted majority voting scheme
weights the votes each by a factor that decays with the
distance of the voting pattern from the winner (i.e. the
largest the distance the weakest the vote). The averaging
operation of the majority and weighted majority voting
schemes effectively attenuates the artifacts of the training
set patterns. As noted, in order to enhance the representation
of rare gene expression patterns we amplify the vote of each
pattern with a coefficient that is proportional to the inverse
of the frequency of appearance of that class.
In the context of sNet-SOM the utilization of e
or weighted majority voting is essential. These schemes
allow the degree of class discrepancy for a particular neuron
to be readily estimated. Indeed, by counting the votes at
each SOM neuron for every class, an entropy criterion that
quantifies the uncertainty of the class label of neuron can
be directly evaluated, as [ ]:
m
16

=
=
c
N
k
k k
p p m HN
1
log ) ( ,

(2)


where denotes the number of classes and
c
N
total
k
k
V
V
p = ,
is the ratio of votes for class to the total number of
votes to neuron m.
k
V k
total
V
Clearly, the entropy is zero for unambiguous neurons and
increases as the uncertainty about the class label of the
neuron increases. The upper bound of is ,
and corresponds to the situation where all classes are
equiprobable (i.e. the voting mechanism does not favor a
particular class). Consequently, within the framework posed
by these voting schemes, the regions of the SOM that are
placed at ambiguous regions of the state space can be easily
identified. For these regions, the supervised expert is
designed and optimized for obtaining adequate
generalization performance.
) (m H ) log(
c
N



7. Results and Discussion
We have applied the sNet-SOM to analyze public available
microarray expression data from the budding yeast
Saccharomyces cerevisiae. This fully sequenced organism
was studied during the diauxic shift, the mitotic cell division
cycle, sporulation, and temperature and reducing shocks by
using microarrays containing essentially every Open
Reading Frame (ORF). The data set on which we performed
extensive experiments consists of 2467 genes for which
there exists currently functional annotation in the
Saccharomyces Genome Database. The weighted K-nearest
neighbors imputation method presented in [ ] is applied in
order to fill up systematically the missing values.
25
Microarray gene expression data sets are large, complex,
contain many attributes and have an unknown internal
structure. For that, gaining insight onto the structure of the
data is the initial objective of most analysis methods rather
12
Gene expression analysis with a dynamically extended Self-Organized Map
the classification of the data itself. The sNet-SOM meets
this objective by:
- Achieving high quality and computationally
efficient clustering of the gene expression profiles
with the exploitation of either supervised or
unsupervised clustering criteria.
- Offering extensive visualization capabilities with
the irregular two dimensional growable grid that
the basic structure provides which can be
complemented with the Sammons nonlinear
mapping [ ]. 21
The sNet-SOM is not only a clustering but is additionally a
classification tool, although that in this case the sNet-SOM
does not claim to directly compete capable supervised
models like the Radial Basis Support Vector Machine
[ , , ]. The sNet-SOM rather aims to complement them by
reducing the complexity of the problem that remains for the
pure supervised solution. Taking into account the size and
the complexity of the gene expression data set, this
reduction proves essential. The results of the last two rows
of Table demonstrate this computational benefit.
16 26 6
1

Learning Model Training Time
SOM 5X5 8 min
SOM 10X10 65 min
Unsupervised sNet-SOM 25 min
Supervised sNet-SOM 28 min
Unsupervised sNet-SOM
with column insertion
19 min
Supervised sNet-SOM with
column insertion
23 min
SVM 3 hours 15 min
Supervised sNet-SOM with
SVM for the ambiguous
patterns
50 min
Table 1 A comparison of the execution times for the
learning of the gene expression data set. The results
obtained with a 450 MHz Pentium III PC.
1

Table illustrates that the sNet-SOM trains faster than the
conventional SOM. The first two rows are the execution
times for a SOM with a grid of size 5X5 and 10X10
respectively. Also, the unsupervised sNet-SOM trains
slightly faster than the supervised (3
rd
and 4
th
rows).
Furthermore, the utilization of column (row) insertion
provides further performance advantages (5
th
and 6
th
rows).
The Support Vector Machine takes the longest time for the
training on the whole data set (7
th
row). The implementation
approach of [ ] as implemented with the SVMLight
software package was used for the SVM solution. Finally,
the supervised sNet-SOM combined with the SVM
resolution of the difficult parts of the state space obtains
significantly better learning times without sacrificing the
quality of the results. The SVM classification results are
similar to those published in [ ] and are therefore not
repeated here.
19
6
6
The supervised phase was trained with the same functional
classes as in [ ]. This allows to perform some comparisons
relating the performance of the sNet-SOM with other
methods. These classes are summarized with Table . The
functional classifications were obtained from the Munich
information center for protein sequences yeast genome
database (
2
http://www.mips.biochem.mpg.de/proj/yeast).


1. Tricarboxylic-acid pathway (TCA)
2. Respiration-chain complexes (Resp)
3. Cytoplasmic ribosomal proteins (Cyto)
4. Proteasome (Proteas)
5. Histones (Hist)
6. Helix-turn-helix (HTH)
Table 2 Functional classes used for supervised sNet-SOM
training. The tricarboxylic-acid pathway is, also known as
Krebs cycle, consists of genes that encode enzymes that
break down pyruvate (produced from glucose) by oxidation.
The respiration chain complexes perform oxidation-
reduction reactions that capture the energy present in
NADH through electron transport and the chemiosmotic
synthesis of ATP. The cytoplasmic ribosomal proteins are a
13
Gene expression analysis with a dynamically extended Self-Organized Map
class of proteins required to make the ribosome. The
proteasome consists of proteins that perform the
degradation of proteins. Histones interact with the DNA
backbone to form nucleosomes. These nucleosomes are an
essential part of the chromatin of the cell. Finally, the helix-
turn-helix class, is not a functional class. It consists of genes
that code for proteins containing the helix-turn-helix
structural motif. This class is included as a control class.


At the presented supervised sNet-SOM training experiment
we used six functional classes from the MIPS Yeast
Genome Database: tricarboxylic acid (TCA) cycle,
respiration, cytoplasmic ribosomes, proteasome, histones
and helix-turn-helix (HTH) proteins. The first five classes
represent categories of genes that on biological grounds is
expected to induce similar expression characteristics. The
sixth class, i.e. the helix-turn-helix proteins is used as a
control group. Since there is not any biological justification
for a mechanism that enforces the genes of this class to the
same patterns of expression, we expect these genes to be
spread to diverse clusters by the sNet-SOM.
The measure of Entropy of Class Representation is
evaluated over the sNet-SOM nodes in order to quantify the
dispersion of class representation. We expect this measure to
be large in the case of HTH, expressing the diversity of the
HTH gene expression patterns. Indeed, the results of Table
support this intuitive expectation. The high entropy of the
Unassigned class is due to the fact that this class
accumulates hundreds of other known functional classes and
all the unknown ones.
3
Table 3 The entropies of class representations at the sNet-
SOM configuration of Figure .

Class Entropy
1. Tricarboxylic-acid pathway
(TCA)
2. Respiration-chain
complexes (Resp)
3. Cytoplasmic ribosomal
proteins (Ribo)
4. Proteasome (Proteas)

1.96

1.82

1.21
0.51
5. Histones
6. Helix-turn-helix
7. Rest and functionally
unknown classes (Unassigned)
0.60
2.78
4.13


7


Many functional classes of genes present strong similarity of
their gene expression patterns. This is evident in Figure 4,
where we can observe a high similarity of the gene
expression patterns of the class Ribo. The identities
(identifier, description and functional class) of the genes of
Figure 4 are displayed with Figure 6.
Figure 7 illustrates a snapshot of the progress of the learning
process. Each sNet-SOM node is colored according to the
predominant class. Also for each node three numbers are
continuously updated. The first is the numeric identifier of
the prevailing class. The second depends on the type of
training. For supervised training it is the entropy of the
node: nodes with high entropy lie near class separation
boundaries and the patterns can be used to train efficient
supervised models, like the Support Vector Machines, for
the effective discrimination of these parts of the state space.
For the unsupervised expansion this number is a resource
count (usually the local quantization error) that controls the
positions of the dynamic expansion. Finally, the third
number is the number of patterns mapped to the node.
Figure 8 displays a listbox with the characteristics of all the
nodes of the sNetSOM. The first two columns are the grid
coordinates of the node. The third column is the entropy of
the node and the fourth is the number of genes mapped to
the node. Finally, the last column is the name of the class
that the node represents. The biologist can obtain further
information about the genes mapped to the node by selecting
the corresponding element of the listbox. The main
parameters that control the operation of the sNet-SOM are
user defined with a parameter setting screen illustrated with
figure 9.
14
Gene expression analysis with a dynamically extended Self-Organized Map


8. Conclusions and future work

This work has presented a new self-growing adaptive neural
network model for the analysis of genome-wide expression
data. This model, called sNet-SOM overcomes elegantly the
main drawback of most of the existing clustering methods
that impose an a priori specification at the number of
clusters. The sNet-SOM determines adaptively the number
of clusters with a dynamic extension process which is able
to exploit class information whenever available.
The sNet-SOM grows within a rectangular grid that
provides the potential for the implementation of efficient
training algorithms. The expansion of the sNet-SOM is
based on an adaptive process. This process grows nodes at
the boundary nodes, ripples weights from the internal nodes
towards the outer nodes of the grid, and inserts whole
columns within the map. The growing algorithm is simple
and computationally effective. It prefers to grow from the
boundary nodes in order to minimize the map readjustment
operations. However, a mechanism for whole column (row)
insertion is implemented in order to deal with the case that a
large map should be expanded around a point that is deep
within its interior. The growing process determines
automatically the appropriate level of expansion in order the
similarity between the gene expression patterns of the same
cluster to fulfill a designer definable statistical confidence
level of not being a random event. The voting schemes for
the winner node have been designed in order to amplify the
representation of rare gene expression patterns.
A novel feature of the sNet-SOM compared with other
related approaches is the potentiality for the effective
exploitation of the available class information with an
entropy based measure that controls the dynamical extension
process. This process extracts information about the
structure of the decision boundaries. A supervised network
can be connected additionally in order to resolve better at
the difficult parts of the state space. This hybrid approach
(i.e. unsupervised competitive learning for the simple parts
of the state space and supervised for the difficult ones) can
compete in performance advanced supervised learning
models at a much less computational cost. In essence, the
sNet-SOM can utilize the pure supervised machinery only
where it is needed, i.e. for the construction of complex
decision boundaries over regions of the state space where
patterns cannot be separated easily.
Another way to incorporate supervised learning to the sNet-
SOM is to use the nodes as Radial Basis Function centers
and to model the classification of a gene as a nonlinear
function of the gene expression templates represented by
the adjacent nodes. This approach resembles qualitatively
the supervised harvesting approach of [ ]. The node
average profiles can be used as inputs to a supervised phase.
This reduces the redundancy of information and prevents an
overfitting of the training set. Proper parameters of these
centers can be estimated by heuristic criteria like signal
counters, local errors, and node entropies providing local
information of much importance.
15
The sNet-SOM dynamical extension algorithm is similar for
the more usual case in the context of gene expression
analysis, where there is no classification information
available. In this case criteria based on the computation of
local variances or resource counts are implemented.
Moreover, in order to enhance the exploratory potential of
the sNet-SOM for the analysis of the gene expression data,
we have adapted the Sammon distance preserving nonlinear
mapping. The Sammon mapping allows an effective
visualization of the intrinsic structure of the sNet-SOM
codebook vectors even at the unsupervised case. We will
provide an extensive discussion of the application of the
Sammon mapping at the context of sNetSOM for the
effective visualization of gene expression data in a
forthcoming work. Also, another main direction for the
improvement of the sNetSOM performance is the
incorporation of more advanced distance metrics to its
algorithms, as the Bayesian one proposed in [ ]. The
incorporation of the presented sNet-SOM dynamic growing
algorithms as a front end processing within Bayesian
18
15
Gene expression analysis with a dynamically extended Self-Organized Map
network structure learning algorithms [ ] is also an open
area for future work.
4
[4] Bishop C. M., Neural Networks for Pattern Recognition, Clarendon
Press-Oxford, 1996.

ACKNOWLEDGEMENTS
The authors wish to thank the Research Committee of the
University of Patras for the partial financial support of this
research with the contract Karatheodoris 2454

References
[1] Alahakoon Damminda, Halgamuge Saman K., Srinivasan Bala,
"Dynamic Self-Organizing Maps with Controlled Growth for Knowledge
Discovery", IEEE Transactions On Neural Networks, Vol. 11, No. 3, pp
601-614, May 2000.
[2] Azuaje Franscisco, A Computational Neural Approach to Support the
Discovery of Gene Function and Classes of Cancer, IEEE Trans. Biomed.
Eng., Vol 48, No. 3, March 2001, pp 332-339
[3]. Bezerianos A., Vladutu L., Papadimitriou S., "Hierarchical State
Space Partitioning with the Network Self-Organizing Map for the effective
recognition of the ST-T Segment Change", IEEE Medical & Biological
Engineering & Computing 2000, Vol 38, 406-415
[5] Brazma Alvis, Vilo Jaak, "Gene expression data analysis", FEBS
Letters, 480 (2000) 17-24
[6] Brown Michael P.S., Grundy William Noble, Lin David, Cristianini
Nello, Sugnet Charles Walsh, Furey Terrence S., Ares Manuel, Haussler
Jr., David, "Knowledge-based Analysis of Microarray Gene Expression
Data By Using Support Vector Machines"}, Proceedings of the National
Academy of Science, Vol 97, No 1, pp. 262-267, 1997
[7] Campos Marcos M., Carpenter Gail A., S-TREE: self-organizing trees
for data clustering and online vector quantization, Neural Networks 14
(2001), pp. 505-525
[8] Cheeseman, P., and Stutz, J. (1995) Bayesian Classification
(AutoClass): Theory and results, In Fayyad, U., Piatesky-Shapiro, G.,
Smyth, P., and Uthurusamy, R. editors, Advances in Knowledge Discovery
and Data Mining, pp. 153-180, AAAI Press, Menbo Park, CA
[9] Cheng Guojian and Zell Andreas, "Externally Growing Cell Structures
for Data Evaluation of Chemical Gas Sensors", Neural Computing &
Applications, 10, pp. 89-97, Springer-Verlag, 2001
[10] Cheung Vivian G., Morley Michael, Aguilar Francisco, Massimi Aldo,
Kucherlapati Raju, Childs Geoffrey, "Making and reading microarrays",
Nature genetics supplement, Vol. 21, January 1999
[11] Durbin R., Eddy S., Krogh A., Mitchison G., Biological Sequence
Analysis, Cambridge, University Press, 1998
[12] Eisen Michael B., Spellman Paul T., Patrick O. Brown, and David
Botstein, "Cluster analysis and display of genome-wide expression
patterns", Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863-14868,
December 1998
[13] Friedman, N., M. Linial, I. Nachman, and D Peer, Using Bayesian
networks to analyze expression data, J. Comp. Bio. 7, 2000, 601-620,
[14] Fritzke Bernd, "Growing Grid - a self organizing network with
constant neighborhood range and adaptation strength", Neural Processing
Letters, Vol. 2, No. 5, pp. 9-13, 1995
[15] Hastie Trevor, Tibshirani Robert, Botstein David, Brown Patrick,
Supervised Harvesting of expression trees, Genome Biology 2001, 2 (1),
http://genomebiology.com/2001/2/I
[16] Haykin S,, Neural Networks, Prentice Hall International, Second
Edition, 1999.
[17] Herrero Javier, Valencia Alfonso, and Dopazo Joaquin, A
hierarchical unsupervised growing neural network for clustering gene
expression patterns, Bioinformatics, (2001) Vol. 17, no. 2, pp. 126-136
[18] Hunter Lawrence, Taylor Ronald C., Leach Sonia M., Simon Richard,
GEST: a gene expression search tool based on a novel Bayesian similarity
metric, Bioinformatics, Vol. 17, Suppl 1, pp. 115-122, 2001
[19] Joachims Thorsten, Making Large-Scale SVM Learning Practical,
Advances in Kernel Methods Support Vector Learning, Bernhard
Scholkopf, Christopher J. C. Burges, and Alexander J. Smola (eds), MIT
Press, Cambridge, USA, 1998
[20] Kohonen T., Self-Organized Maps, Springer-Verlag, Second
Edition,1997.
[21] Pal Nikhil R., Eluri Vijary Kumar, Two Efficient Connectionist
Schemes for Structure Preserving Dimensionality Reduction, IEEE Teans.
Neural Networks, vol. 9, no. 6, November 1998, p. 1142-1154
[22] Papadimitriou S., Mavroudi S., Vladutu L., Bezerianos A., Ischemia
Detection with a Self Organizing Map Supplemented by Supervised
Learning, IEEE Trans. On Neural Networks, Vol. 12, No. 3, May 2001, p.
503-515
[23] Si J., Lin S., Vuong M. A., "Dynamic topology representing
networks", Neural Networks, 13, pp. 617-627, 2000
[24] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,
Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999) Interpreting patterns
of gene expression with self-organizing maps: methods and application to
hematopoietic differentiation, Proc. Natl. Acad. Sci., USA, 92, pp. 2907-
2912
[25] Troyanskaya Olga, Cantor Michael, Shelock Gavin, Brown Pat, Hastie
Trevor, Tibshirani Robert, Botstein David, Altman Russ B., Missing value
estimation methods for DNA microarrays, Bioinformatics, Vol. 17, no 6,
2001
[26] Vapnik V. N., Statistical Learning Theory, New York, John Wiley &
Sons, 1998.
[27] Vesanto Juha Alhoniemi, Esa, Clustering of the Self-Organized
Map, IEEE Transactions on Neural Networks, Vol. 11, No. 3, May 2000,
p. 586-600


16
Gene expression analysis with a dynamically extended Self-Organized Map
17




Figure 4 The expression profiles of the genes clustered an sNet-SOM node of class Ribo. A few patterns of the rest classes presenting
very similar expression profiles map also to this node.

Gene expression analysis with a dynamically extended Self-Organized Map

Figure 5 The average expression profile for the genes plotted by Figure 4
18
Gene expression analysis with a dynamically extended Self-Organized Map


Figure 6 The identities of the genes as plotted in Figure from the back of the figure towards its front (at the 3D view). The biologist
can extract easily useful information about which of the genes of unassigned class present similar expression profiles to the genes of
class Ribo.
4
19
Gene expression analysis with a dynamically extended Self-Organized Map





Figure 7 The outline of the configuration of the growing sNetSOM is displayed graphically and illustrates the progress of the learning
process to the user. The nodes that represent the Helix-Turn-Helix class are in blue color. It is visually evident that these nodes are
much more dispersed than nodes colored differently that represent other classes.




20
Gene expression analysis with a dynamically extended Self-Organized Map

Figure 8 The listbox that displays the characteristics of the nodes of the sNetSOM. The first two columns are the grid coordinates of the
node. The third column is the entropy of the node and the fourth is the number of genes mapped to the node. Finally, the last column is
the name of the class that the node represents.





21
Gene expression analysis with a dynamically extended Self-Organized Map
22


Figure 9 The parameter configuration screen allows to control directly the main parameters of the sNetSOM.

You might also like