BioinfNetSOM PDF

BIOINFORMATICS Vol.
No
Pages
Gene Expression Analysis with a Dynamically extended Self-

Organized Map that exploits class information
Seferina Mavroudi, Stergios Papadimitriou, Liviu Vladutu, Anastasios Bezerianos

Department of Medical Physics, School of Medicine, University of Patras,
26500 Patras, Greece, tel: +30-61-996115,
email: severina@heart.med.upatras.gr, stergios@heart.med.upatras.gr
ABSTRACT The expansion of the sNet-SOM is based on an adaptive

Motivation Currently the most popular approach to analyse process. This process grows nodes at the boundary nodes,
genome-wide expression data is clustering. One of the major ripples weights from the internal nodes towards the outer
drawbacks of most of the existing clustering methods is that nodes of the grid, and inserts whole columns within the
the number of clusters has to be specified a priori. map. The growing process determines automatically the
Furthermore, by using pure unsupervised algorithms prior appropriate level of expansion with criteria dependent upon
biological knowledge is totally ignored e.g. there is no whether unsupervised or supervised training is used. For the
simple means to handle genes of known similar function unsupervised training the criterion is the similarity between
being allocated to different clusters based on their the gene expression patterns of the same cluster to fulfill a
expression profiles. Moreover, most current tools lack an designer definable statistical confidence level of not being a
effective framework for tight integration of unsupervised random event. The supervised mode of training grows the
and supervised learning for the analysis of high-dimensional map until criteria defined on approximation/generalization
expression data. performance are fulfilled. The voting schemes for the
Results: The paper adapts a novel Self-Organizing map winner node have been designed in order to amplify the
called supervised Network Self-Organized Map (sNet-SOM) representation of rare gene expression patterns.
to the peculiarities of gene expression data. The sNet-SOM The results indicate that sNet-SOM yields competitive
determines adaptively the number of clusters with a performance to other recently proposed approaches for
dynamic extension process which is able to exploit class supervised classification at a significantly reduced
information whenever exists. Specifically, the sNet-SOM computational cost and it provides extensive exploratory
accepts available class information to control a dynamical analysis potentiality within the unsupervised analysis
extension process with an entropy criterion. This process framework. Furthermore, it explores simple design decisions
extracts information about the structure of the decision that are easier to comprehend and computationally efficient.
boundaries. A supervised network can be connected Availability: The source code of the algorithms presented in
additionally in order to resolve better at the difficult parts of the paper can be downloaded from
the state space. In the case that there is no classification http://heart.med.upatras.gr. The implementation is in
available, a similar dynamical extension is controlled with Borland C++ Builder 4.0.
criteria based on the computation of local variances or Contact: severina@heart.med.upatras.gr,
resource counts. stergios@heart.med.upatras.gr
The sNet-SOM grows within a rectangular grid that
provides effective visualization while at the same time it
allows the implementation of efficient training algorithms.
1
BIOINFORMATICS Vol. No
Pages
number and structure of distinct clusters. Moreover, most of

the proposed models do not incorporate flexible means for
coupling effectively the unsupervised phase with a
supervised complementary phase, in order to benefit the
1. Introduction most from both of these approaches.
The recent development of DNA microarray technology A major drawback of hierarchical clustering is that although
provides the ability to measure the expression levels of the data points are organized into a strict hierarchy of nested
thousands of genes in a single experiment [12,6,5]. The subsets there is no reason to believe that expression data
interpretation of such massive expression data is a new actually follows a true hierarchical descent, like for
challenge for bioinformatics and opens new perspectives for example, the evolution of the species [11,24]. Furthermore,
functional genomics. A key question within this context is if decisions made early about grouping points to specific
given some expression data for a gene, this gene does clusters cannot be reevaluated and often adversely affect the
belong to a particular functional class (i.e. it encodes for a result. This later disadvantage is owned also by the dynamic
protein of interest). non-fuzzy hierarchical schemes proposed recently [7,17].
Currently, the most popular analysis of gene expression data Also, the traditional hierarchical clustering schemes suffer
in order to provide insight to the structure of the data and to from lack of robustness, and from nonuniqueness and
aid at the discovery of functional classes, is clustering, i.e inversion problems.
the grouping of genes with similar expression patterns into Bayesian clustering is a highly structured approach, which
clusters [12, 5]. Such approaches unravel relations between imposes a strong prior hypothesis on the data [8]. However,
genes and help to deduce their biological role, since genes of a prior hypotheses on expression data though is usually not
similar function tend to display similar expression patterns. available.
Most of the so far developed algorithms perform the K-means clustering on the other hand imposes no structure
clustering of the expression patterns in an unsupervised at all on the data, proceeds in a local fashion and produces
manner [12,17,24]. However, frequently genes of similar an unorganized collection of clusters that is not conducive to
function become allocated to different clusters. In this case, interpretation [12].
a pure unsupervised approach is unable to deduce the correct In contrary, the standard SOM algorithm has a number of
"rule" for the characterization of the gene class. On the other properties, which render it to a candidate of particular
hand, there already exists valuable biological knowledge, interest. SOMs can be implemented easily, are fast, robust
which is manifested in the form of collections of genes and scale well to large data sets. They allow one to impose
knowing to encode proteins of similar biological function, partial structure on the clusters and facilitate visualization
e.g. genes that code for ribosomal proteins [6]. and interpretation. In the case hierarchical information is
Some of the clustering algorithms used so far for the required, it can be implemented on top of SOM, as in [27].
clustering of gene expression data include hierarchical However, there is still an inherent requirement of the
clustering [12], K-means clustering, Bayesian clustering [13] standard SOM algorithm, which constitutes a major
and the Self-Organizing Map (SOM) [24] drawback. The number of distinct clusters has to be
Nevertheless, despite of the fact that most of the widely specified a priori, although there is no mean to objectively
approved clustering methods, as K-means and SOM, ignore predetermine the optimum number in the case of gene
existing class information, another major drawbacks of these expression data.
methods is that they require an a priori decision on the
2
Gene expression analysis with a dynamically extended Self-Organized Map
Recently, several dynamically extended schemes have been The sNet-SOM has been designed in order to automatically
proposed that overcome the limitation of the fixed non- detect the appropriate level of expansion. At the
adaptable architecture of the SOM. Some examples are the unsupervised case the distance threshold between patterns
Dynamic Topology Representing structures [23], the below which two genes can be considered as co-expressed is
Growing Cell Structures [14, 9], Self-Organized Tree estimated. Then the map is grown automatically until its
Algorithms [7,17] and the Adaptive Resonance Theory [2]. nodes correspond to gene clusters with distances that adhere
The presented approach has many similarities to these to this limit. In the supervised case the criteria for stopping
dynamically extended schemes. However, in contrast to the the network expansion can be expressed either in terms of
complexity of these schemes, we built simple algorithms the approximation or in terms of the classification
that through the restriction of growing on a rectangular grid, performance.
can be implemented easily and the training of the models is Furthermore, the sNet-SOM overcomes the problem of
very efficient. Also, the benefits of the more complex irrelevant (“flat”) profiles that can populate much more
alternatives to the dynamical extension are still retained. clusters than necessary at the traditional SOM. The solution
We call the proposed model sNet-SOM from supervised we adopted is the careful redesign of the voting mechanism.
Network SOM, since although it is SOM based it The paper is outlined as follows: Initially, Section 2
incorporates many provisions for supervised summarizes the microarray expression experiments and the
complementation of learning. These provisions start with the associated data used to evaluate the presented computational
supervised versions of the map growing process and run learning schemes. Section 3 describes the extensions to the
through the possibility of integrating a pure supervised SOM that lead to the sNet-SOM and the overall architecture
model. of the later. Section 4 deals with the learning algorithms that
Specifically, our clustering algorithm modifies the original adapt both the structure and the parameters of the sNet-
SOM algorithm with a dynamic expansion process SOM. The expansion phase of the sNet-SOM learning is
controlled by an entropy-based measure whenever gene described in separate sections since it is rather complicated
functional class information exists. The later measure and depends on whether the learning is supervised or
quantifies to which extend the available information for the unsupervised. Specifically, Section 5 elaborates on the
biological function (i.e. class) of the gene is represented details of the expansion phase for the unsupervised case and
accurately by the cluster (i.e. the SOM node) on which the Section 6 for the supervised one. Section 7 discusses results
gene is allocated. Accordingly, the model is adapted obtained from an application to yeast expression microarray
dynamically in order to minimize the entropy within the data. Finally, Section 8 presents results the conclusions
generated clusters. This approach detects effectively the along with some directions onto which further research can
regions where the decision boundaries between different proceed for improvements.
classes lie. At these regions, the classification task becomes
difficult and a special supervised network can be connected 2 Microarray expression experiments
with the sNet-SOM in order to resolve better at the class Recently, new approaches have been developed for
boundaries. Usually, only in the case of lack of class accessing large scale gene expression data. One of the most
information the dynamic expansion is controlled by local effective ones is by using the DNA microarray technology
variance or resource counts criteria. The entropy criterion [10]. In this method, thousands of distinct DNA probes are
concentrates on the resolution of the regions characterized attached to a microarray. These probes can be Polymerase
by class ambiguity and therefore it is more effective. Chain Reaction (PCR) products or oligonucleotides whose
sequences correspond to target genes or Expressed Sequence
3
Tags (ESTs) of the genome being studied. RNA is extracted The sNet-SOM is based on the standard SOM algorithm, but
from the sample tissue or cells, reverse transcribed into is dynamically extendable, so that the number of clusters is
labeled with fluorescent dyes cDNA, which is then allowed controlled by a properly defined measure of the algorithm
to hybridize with the probes on the microarray. The cDNA itself, with no need for any a priori specification. Because
corresponds to transcripts produced by genes in the samples, all the previously mentioned clustering algorithms are
and the amount of a particular cDNA sequence present will purely unsupervised, they ignore any available a priori
be in proportion to the level of expression of its biological information. This means that not only existing
corresponding gene. The microarray is washed to remove information is not explored in order to deduce the correct
non-specific hybridization, and the level of hybridization for expression characteristics of genes that make them part of
each probe is calculated. An expression level for genes functional groups, but also that genes known to be
corresponding to the probes is derived from these erroneously grouped to a cluster cannot be handled.
measurements. This level represents a ratio between the Following the basic design principle to include existing
expression of the gene under some control condition prior knowledge, we manage to simultaneously consider
relatively to the reference condition. both gene expression data and class information (whenever
Gene expression data obtained in this way are usually available) at the sNet-SOM training algorithms.
arranged in tables whose rows correspond to the genes and However, so far class annotation for gene expression data is
columns to the individual expression values of each gene in limited and not always available. In order to account also for
a particular experimental condition represented by the this case we additionally developed a second similar
column. These raw data are characterized by highly algorithm so that for the two cases the algorithms differ only
asymmetrical distributions that makes difficult the in the criteria that control the dynamic expansion of the
realization of any distance metric for the assessment of the map. Specifically, depending on the availability of class
differences among them. Therefore, the logarithmic information we design two variants of sNet-SOM.
transformation is used as a preprocessing step, that expands The first variant, the unsupervised sNet-SOM performs node
the scale for small values and compresses it for large values. expansion in the absence of class labels by exploiting either
An additional desirable effect of the logarithmic a local variance measure that depends on the SOM
transformation is that it provides a symmetrical scale around quantization performance or on node resource counts. These
0. criteria are used also at the Growing Cell Structures (GCS)
The gene expression patterns reflect a cell's internal state algorithms for growing cells [14,9]. The convergence criteria
and microenviroment creating a molecular "picture" of the are defined by a statistical assessment of the randomness of
cell's state. Thus DNA microarrays can be used to capture the distance between gene expression patterns.
these molecular pictures and deduce the condition of the The second variant, the supervised sNet-SOM performs the
cells. Furthermore, since the expression profile of a gene is growing by exploiting the class information with an entropy
correlated with its biological role, systematic microarray measure. The dynamic growth is based on the criterion of
studies of global gene expression can provide remarkably neuron ambiguity (i.e. uncertainty about class assignment),
detailed clues to the functions of specific genes. This is which is quantified with the entropy measure that is defined
important, since currently fewer than 5% of the functions of over the sNet-SOM nodes. This approach differs from the
the genes in the human genome are known. local quantization error approach of [1] and of the resource
value of [14] that grow the map at the nodes accumulating
the largest local variances and resource context of the
3. The sNet-SOM model unsupervised sNet-SOM. In the absence of class information
4
these are reasonable and well performing criteria. However, node and the expansion process can generate new nodes in
these measures can be large even with no class ambiguity the neighborhood. The implementation of this exception to
while the entropy measure directly and objectively the general “grow from boundary” rule, has accelerated
quantifies the ambiguity. For that reason for the supervised significantly the training of large maps (2 to 4 times faster
sNet-SOM the entropy based growing technique is computation for maps of size of about 100 nodes).
preferable. The growing structure takes the form of a nonuniform
We have developed the supervised sNet-SOM initially rectangular grid. It develops within a large M  N grid that
within the context of an ischemia detection application provides “slots” for the new dynamically created nodes.
[22,3]. At this application, it is used in combination with Generally, we require N  M , i.e. N  2  M , since the
capable supervised models in order to maximize the insertion of whole columns results in faster expansion rate
performance of the detection of ischemic episodes. along columns (note that the opposite is true when we
However, the peculiarities of the gene expression data made implement the alternative of row insertion, instead of
mandatory significant redesign of the algorithms. Below we column insertion).
discuss the sNet-SOM learning algorithms in detail. A training epoch consists of the presentation of all the
training patterns to the sNet-SOM. A training run is defined
as the training of the sNet-SOM with a fixed number of
4. Learning algorithms neurons at its lattice i.e. the training between successive
The sNet-SOM is initialized with four nodes arranged in a node insertions/deletions.
2X2 rectangular grid and grows nodes to represent the input After the preliminary discussion we can now proceed to
data. Weight values of the nodes are self-organized describe the sNet-SOM learning algorithms in more detail.
according to a new method inspired by the SOM algorithm. The top level sNet-SOM learning algorithm is the same for
The self-organization process maps properties of the original both, the unsupervised and the supervised case. In
high-dimensional data space onto the lattice consisted of algorithmic form it can be described as:
sNet-SOM nodes. The map is expanded to represent the
input space by creating new nodes, either from the boundary Top-level sNet-SOM learning algorithm
nodes performing boundary extension, or by inserting whole 1. <Initialization phase>
columns (or rows) of new units with a column extension (or While <global criteria for convergence of learning not
row extension). satisfied> do
The decision to grow either with the boundary or with the 2. <Training Run Adaptation phase>
column (row) extension does not limit the potentiality for 3. <Expansion phase>
dimensionality reduction of the model and its modeling End While
effectiveness, while its implementation is easier and the 4. <Fine Tuning Adaptation phase>
training becomes more efficient. The later advantage is
important for the large data sets produced by the microarray The details of the algorithm, i.e. the initialization,
experiments. Usually, new nodes are created by expanding adaptation, expansion and fine tuning phases and the
the map at its boundaries. However, when the expansion convergence criteria are described in detail below.
focus becomes a node placed deep in the interior of a large
map, far from the boundary nodes, the adaptive expansion A. Initialization phase.
process inserts a whole column of nodes directly adjacent to The weight vectors of the four starting nodes that are
this node. Therefore, the node becomes directly a boundary arranged in a 2X2 grid are initialized with random numbers
5
within the domain of feature values (i.e. of the normalized The learning rate starts from a value of 0.1 and decreases
ratio fluorescent coefficients). down to 0.02. These values are specified with the empirical
B. Training Run Adaptation phase criterion of having relatively fast convergence, without
The purpose of this phase is to stabilize the current map however sacrificing the stability of the map.
configuration in order to be able to evaluate its effectiveness The neighborhood function  k ( d ( j, i )) depends on the
and the requirements for further expansion. During this distance d ( j, i ) between node j and the winning node i . It
phase, the input patterns are repeatedly presented and the
decreases monotonically with increasing distance from the
corresponding self-organization actions are performed until
winning neuron (i.e. nodes closer to the winner are adapted
the map converges sufficiently. The training run adaptation
more), like in the standard SOM algorithm. The initial
phase takes the following algorithmic form.
neighborhood, N 0 , includes the entire map.
Unlike the standard SOM, these parameters (i.e. N k ,

<Training Run Adaptation Phase>:
MapConverged := false;  k ( d ( j, i )) ) do not need to shrink with time and can be
while MapConverged = false do kept constant i.e. N k  N 0 ,  k (d ( j , i ))   0 ( d ( j, i )) . This

for all input patterns x k do is explained by the following: Initially, the neighborhood is
present x k and adapt the map by applying the map large enough to include the whole map. The sNet-SOM
adaptation rules starts with a much smaller size than a usual SOM: thus a
endfor large neighborhood is not required to train the whole map at
Evaluate map training run convergence condition and set the first learning steps (e.g. with 4 nodes initially at the map,
MapConverged accordingly a neighborhood of 1 only is required). As training proceeds,
endwhile during subsequent training epochs, the area defined by the

neighborhood becomes localized near the winning neuron,
Map adaptation rules not by shrinking the vicinity radius (as in the standard SOM)
The map adaptation rules that govern the processing of each but by enlarging the SOM with the dynamic growing.
input pattern x k are as follows: Usually, we use the following simple and efficiently
computed formula for the neighborhood function  (where
1. Determination of the weight vector w i that is closest to
ir , ic denote the row and column of node i respectively):
the input vector x k (i.e. of the winner node).
 1, if j  i
2. Adaptation of the weight vectors w j only for the four 
 k ( d ( j, i ))   , 0    1, if | i r  j r |  | ic  jc | 1

nodes j in the direct neighborhood of the winner i and for  0, otherwise
the winner itself according to the following formula: An alternative rectangular neighborhood that updates also
 w j ( k ), j  Nk the diagonal nodes with a smaller learning rate yields also

w j ( k  1)  
w j ( k )  n ( k )   k ( d ( j, i ))( x k  w j ( k )), j  Nk appropriate results:
 1, if j  i
 a, 0  a  1, if | i r  j r |  | ic  jc | 1
where the learning rate n( k ) , k  N , is a monotonically 
 k ( d ( j , i ))  
decreasing sequence of positive parameters, N k is the   , 0    α  1, if | i r  j r |  | ic  jc | 2
 0, otherwise
neighborhood at the k th learning step and  k ( d ( j, i )) is
the neighborhood function implementing different Evaluation of the map training run convergence condition
adaptation rates even within the same neighborhood.
6
The map training run convergence condition is tested by development is different for each case. The unsupervised
evaluating the reduction of the total quantization error for expansion has the task of revealing insight onto the groups
the unsupervised case and of the total entropy for the of genes with correlated expression patterns while the
supervised one, before and after the presentation of all the supervised entropy based expansion has the objective to
input patterns (i.e. one training epoch). Specifically, denote reduce the computational requirements for a "pure"
E b and E a the errors before and after the presentation of supervised solution. The later objective follows a design
patterns (similar is the formulation for the entropies). Then principle of the sNet-SOM: partitioning of a complex
the map converges when the relative change of the error learning problem to domains that can be learned effectively
between successive epochs drops below a threshold value, with simple and computationally effective unsupervised
i.e. models and to ones that require the utilization of a capable
|E E | 
supervised model since they are characterized by complex
MapConverged :  b  a  ConvergenceErrorThreshold  .
 Ea  decision boundaries [22].
The setting of the ConvergenceErrorThreshold is somewhat To avoid misconception we should note that in our scheme
empirical but a value in the range 0.01 - 0.02 performs well the term “supervised” refers mainly to the fact that class
in assuming sufficient convergence without excessive information is a decisive factor in determining the expansion
computation. criterion. As we shall describe in the next section though,
information about class information can be exploited even in
C. Fine Tuning Adaptation Phase what we term “unsupervised expansion process”. The reason
The fine tuning phase aims to optimize the final sNet-SOM for not using always the supervised expansion mode when
configuration. This phase is similar to the training run class information is available, is simply explained by the
adaptation phase described previously with two differences: two different objectives outlined above, i.e. if the insight to
a. The criterion for map convergence is more the structure of the gene expression data is more important
elaborate. We require much smaller change of the than the classification task itself, the unsupervised approach
total quantization error (unsupervised case) or of is used although class information is available.
the total entropy (supervised case) for accepting the Each of the two approaches to map expansion, the
condition for map convergence. unsupervised and the supervised one, is described below in
b. The learning rate decreases to a smaller value in its own section.
order to allow fine adjustments to the final
structure of the map. 5 . The Unsupervised Expansion Process
Typically, the ConvergenceErrorThreshold for the fine The unsupervised expansion is based on the detection of
tuning phase is about 0.00001 and the learning rate is set to the neurons with large local error, referred to as the
0.01 (or to an even smaller value). unresolved neurons. A neuron is considered unresolved if
its local error LEi exceeds a threshold value, denoted by
3. Expansion Phase the parameter NodeErrorForConsideringUnresolved.
The dynamic expansion of the CP-SOM depends on the Denote by Si the set of gene expression profiles p i mapped
availability of class labels and therefore is referred as to node i. Also, let w i be the weight vector of node i that
supervised expansion when class labels are available and
corresponds to the average expression profile of Si . Then
unsupervised expansion if not. These processes are
the local error LEi is defined as:
described separately below since they explore different
expansion criteria. Moreover, the objective underlying their
7
LEi   p  w 
pS i
i
2
. additional class frequency dependent weighting is
called Class Frequency Average Local Error
The local error is commonly used for implementing (CFALE). In the absence of class information the
dynamically growing schemes [1,17]. However, the CFALE denotes the same quantity as the average
peculiarities of the gene expression data motivated two local error AVi parameter defined above with
significant modifications to the classic local error measure. equation (1).
Specifically: This provision also confronts to some extent the
1. Instead of the local error measure we use the serious problem of the creation of false positives for
average local error AVi per patterns, i.e. the low frequency classes by noise. Probabilistically,
most of these noisy patterns will belong to the high
LEi (1) frequency classes. However, the effect of these
AVi 
Si
erroneously classified patterns will be attenuated
This measure does not increase when many similar significantly, because they are derived from the high
patterns are mapped to the same node. Therefore, frequency classes. The final result is an enhanced
the objective of assigning all the functionally similar robustness to noisy patterns.
genes to the same node is more easily achievable,
even when there are many of such genes. In contrast, Nodes that are selected as winners for very few (usually one
the accumulated local error increases monotonically, or two) training patterns, termed uncolonized nodes, are not
as more genes are mapped to the same node. This in deleted by our scheme although they probably correspond to
turn can cause an undesired spreading of noisy outliers. The gene expression patterns that consistently
functionally similar genes to different nodes. (three times or more) are mapped to uncolonized nodes are
2. The second provision applies when we have class very unique and can either be artifacts or if not they have the
information available (either complete or partial) potential to provide biological knowledge. Therefore they
and we want to exploit it in order to improve the are amenable to further consideration. These patterns
expansion. The local error that accumulates to a therefore are marked and isolated for further study.
winner node is amplified by a factor that is inversely Nodes that are not selected as winners for any pattern are
proportional to the square root of the frequency ratio removed from the map in order to keep it compact.
rc of its corresponding class c . Specifically, let
# patterns of class c The steps of the unsupervised expansion process are as

rc  be the frequency ratio of
# total patterns follows:
class c . Then the amplification factor is r 1 2 .
c
<Unsupervised Expansion Phase:>

Therefore, the errors on the low frequency classes
U.1. Computation of the CFALE measures for every node i.
account more. As a consequence the representation
repeat
of these low frequency classes is improved. We
U.2. let i = the node with the maximum CFALE
should note that these classes are usually of most
measure
biological significance. The utilization of the square
U.3. if IsBoundaryNode(i) then
root prevents the overrepresentation of the very low
// expand at the neighbours boundary nodes
frequency classes (e.g. if class A is 100 times less
U.4. JoinSmoothlyNeighbours (i)
frequent than B it is “amplified” only 10 times
U.5. elseif IsNearBoundaryNode(i)
more). The error measure computed after this
8
U.6 RippleWeightsToNeighbours(i) We describe below shortly the main issues involved in these
U.7. else InsertWholeColumn(i); steps. The while loop controls the sNet-SOM expansion.
endif The criteria for the establishment of the proper level of
U.8 Reset the local error measures. expansion are described in the section that follows. The
U.9. Re-execute the Training Run Adaptation function IsBoundaryNode() checks whether a node is a
Phase for the expanded map by presenting all the training boundary node. Training efficiency and implementation
patterns. simplicity were the motivations for the decision to expand
until not RandomLikeClustersRemain(); mostly from the boundary nodes. The expansion of the map
at the boundary nodes is straightforward: One to three nodes
are created and the weights of the new nodes are adjusted
r  1, c
heuristically to retain the “weight flow” with the function
vf JoinSmoothlyNeighbours() whose operation is illustrated in
Fig. 1.
r, c  1 r, c r, c  1
initiating The map configuration is slightly disturbed when the winner
new expansion
node
node
node is not a boundary node but is a near boundary node. A
hf
node is considered near boundary (declared by the function
vf IsNearBoundaryNode()) when the boundary of the map can
be reached from this node by traversing in any direction at
r  1, c
most two nodes.
Figure 1 Illustration of the weight initialization of a For a near boundary node a percentage (usually 20-50%) of
new node with the function JoinSmoothlyNeighbors(). The the weight of the winner node is shifted towards the outer
new node r, c  1 is allocated to join the map from the nodes (with the function RippleWeightsToNeighbours()).
boundary nodes. The weight of the new node is initialized by This operation alters locally the Voronoi regions and usually
computing the average weight “nearby” to the node r, c , with a few weight “rippling” operations the winner node is
which initiates the expansion, according to the empirical propagated to a boundary node (which is located near).
formula: Finally, if the winner is a node that is neither a boundary nor

h f  AN r.c 1  W r,c 1  v f  AN r 1,c  W r 1,c  v f  AN r 1,c  W r 1,c a near boundary the alternative of inserting a whole empty
Wavr ,c 
h f  AN r.c 1  v f  AN r 1,c  v f  AN r 1,c column is used. The rippling of weights is avoided in these
where AN i, j is a boolean flag that denotes that the node i, j cases, because usually excessive computation times are
has been allocated to the growing structure and hf and required before the winner propagates from a node placed
deep in the map to a boundary node. Instead of inserting
vf are the horizontal and vertical factors of weighting
whole new columns we can insert alternatively whole new
across the corresponding dimensions. Then the weight of the
rows, or we can perform a combination of row and column
new node N is estimated as: 
WN  Wr ,c 1  Wr ,c  Wr ,c  Wav ,c . insertion. The operation of the corresponding
The concept of “direction” of weight growing is maintained InsertWholeColumn() function is illustrated by Figure 2.
by considering more the node in the direction of growth
(horizontal in the case illustrated), i.e. hf  vf , typically
h f  1, v f  0.5 .
9
To this end, we define a confidence level a from which we

derive a threshold Dthr for the distance between gene
expression patterns. The confidence level a has the
meaning that the probability of taking two random unrelated
expression profiles as functionally similar (i.e. to allocate
new column
them at the same cluster) is lower than a if the distance
between them is smaller than the threshold.
E i-1,j E i-1,j+1
E i-1,j-1 Obviously, the definition of a statistical confidence level
would be only possible if the distribution of the distance
E i,j-1 between random expression vectors were known.

E i,j E i,j+1
Practically, although the distribution is unknown, it is easy
to approximate it. Specifically, we shuffle randomly the
E i+1,j-1 E i+1,j E i+1,j+1 experiment points of every expression profile randomly,

both by gene and by experiment point. This randomization
destroys the correlation between the different profiles, while
Figure 2 Grow by grid insertion in the direction of largest
it retains the other characteristics of the data set (e.g. ranges
total error, i.e.
and histogram distribution of values). In this way, we
if Ei 1, j 1  Ei, j 1  Ei 1, j 1   Ei 1, j 1  Ei, j 1  Ei 1, j 1  ,
compute an approximation of the distribution of the distance
where Ei , j is the error measure of node(i, j ) (i.e. CFALE), between random patterns.
then insert the new column at the left of column j else insert It is evident that larger (smaller) correlation between genes
the new column at the right of column j. corresponds to smaller (larger) Euclidean or Manhattan
distance. Assume that we have chosen the Manhattan
distance measure. Then as figure 3 illustrates the distances
Criteria for controlling the sNet-SOM dynamic growing lying in the interval vl v h  are considered as random.
Also, for distances smaller than vl a positive correlation

One of the most critical tasks for the effective analysis of between genes is implied, while it is reasonable to assume
gene expression data with the sNet-SOM is the proper that the converse holds for distances larger than v h .
definition of the parameters that control the growing
process. The objective is to automatically reach the
appropriate level of expansion and then to allow some “fine
tuning” of the resolution level by the molecular biologists.
Systematically, the design of the criteria for stopping the
growing process can be approached by evaluating a
statistical distance threshold for gene expression patterns,
below which two genes can be considered as functionally
similar. When the average distance between patterns in a
cluster drops below this value, the clustering together of
these particular gene expression patterns corresponds to
nonrandom behavior and therefore interesting information v vh
Figure 3 The resultsl of the data shuffling illustrate that the
can be extracted by analyzing them. distances between the randomized data occupy a distinct
10
distribution. For the gene expression data positive S.2. Evaluation of the map over the whole training set in
correlation is favored while for the random the distribution order to compute the approximation performance
has a normal form. CurrentApproximationPerformance
S.3 if CurrentApproximationPerformance <
The distributions of randomized and original gene ThresholdOnApproximationPerformance then
expression patterns displayed in Figure 3 are used to // resolve better the difficult regions of the state space on
implement the criteria on which the function which classification decisions cannot be deduced easily
RandomLikeClustersRemain() is based. This function S.3.1. let i  the node of higher ambiguity (i.e. largest
evaluates the “randomness” of the genes allocated to one entropy parameter).
cluster by computing all the pairwise distances between S.3.2. if IsBoundaryNode(i) then
them. If a considerable number of these distances are join smoothly the neighbours to the map
random according to the specified significance level then the elseif node i near the boundary then
cluster is considered to own unrelated genes and therefore RippleWeightsToNeighbours(i)
further decomposition is required. The percentage of else InsertWholeColumn(i);
random pairwise distances above which we consider the endif
cluster as random, is specified empirically to a value of 5%. Reset the node entropy measures.
Clearly, the smaller the required percentage parameter, the Apply the Map Adaptation phase to the new expanded
larger the decomposition level becomes. The map.
aforementioned value (i.e. 5%) produces well behaved, from endif
a biological perspective, extensions of the map. until CurrentApproximationPerformance >
ThresholdOnApproximationPerformance
6. The Supervised Expansion Process S.4 Generate training and testing sets for the supervised
expert. Further supervising training will be performed with
The supervised expansion is based on the computation of the these sets by the supervising learning algorithm in order to
class assignment for each node i, and of a parameter HN i better resolve the ambiguous parts of the state space.
characterizing the entropy of this assignment. This endif
parameter is derived according to Equation 2 that is
discussed below. An advantage of the entropy is that it is As already mentioned the objectives of the supervised
relatively insensitive to the overrepresentation of classes i.e. expansion differ from those of the unsupervised one. While
independently of how many patterns of a class are mapped the unsupervised aims to insert nodes in order to detect
to the same node, if the node does not represent other interesting clusters of genes, the supervised extension is
classes, its entropy is zero. concentrating at the task of revealing the class decision
boundaries.
The expansion phase consists of the following steps: The supervised expansion exploits well the topological
ordering that the basic SOM provides and increases the
S.1 Computation of the class labels and entropies HN i resolution of the representation over the regions of the state
space that lie near class boundaries. At this point, it should
for the map nodes. The ambiguity of class assignment for
be emphasized that simply increasing the SOM size with the
the genes of node i is quantified by HN i .
adaptive extension algorithm until each neuron represents
repeat
unambiguously a class (i.e., zero HN i for all the nodes)
11
yields a SOM configuration that although fits to the training quantifies the uncertainty of the class label of neuron m can
set, fails to generalize well. be directly evaluated, as [16]:
The ambiguous neurons i.e. those neurons for which the Nc
uncertainty of class assignment is significant, are identified HN ( m )    pk log pk , (2)

k 1
with the entropy criterion. The dynamic expansion phase of
the sNet-SOM is executed until the approximation Vk
where N c denotes the number of classes and pk  ,
performance reaches the required level. Afterwards, training Vtotal
and testing sets are created for the supervised expert. These is the ratio of votes Vk for class k to the total number of
sets consist only of the patterns that are represented by the
votes Vtotal to neuron m.
ambiguous neurons. These neurons correspond to state
Clearly, the entropy is zero for unambiguous neurons and
space regions on which classification decisions cannot be
increases as the uncertainty about the class label of the
deduced easily.
neuron increases. The upper bound of H (m) is log( N c ) ,
The classification task proceeds by feeding the pattern to the
sNet-SOM. If the winning neuron is one that is not and corresponds to the situation where all classes are
ambiguous, the sNet-SOM classifies by using the class of equiprobable (i.e. the voting mechanism does not favor a
the winning neuron. In the opposite case, the supervised particular class). Consequently, within the framework posed
expert is used to perform the classification decision. by these voting schemes, the regions of the SOM that are
The assignment of a class label to each neuron of the sNet- placed at ambiguous regions of the state space can be easily
SOM is performed according to a majority-voting scheme identified. For these regions, the supervised expert is
[20]. This scheme acts as a local averaging operator defined designed and optimized for obtaining adequate
over the class labels of all the patterns that activate that generalization performance.
neuron as the winner (and accordingly are located at the

neighborhood of that neuron). The typical majority-voting
scheme considers one vote for each winning occurrence. An
alternative more "analog" weighted majority voting scheme 7. Results and Discussion
weights the votes each by a factor that decays with the We have applied the sNet-SOM to analyze public available
distance of the voting pattern from the winner (i.e. the microarray expression data from the budding yeast
largest the distance the weakest the vote). The averaging Saccharomyces cerevisiae. This fully sequenced organism
operation of the majority and weighted majority voting was studied during the diauxic shift, the mitotic cell division
schemes effectively attenuates the artifacts of the training cycle, sporulation, and temperature and reducing shocks by
set patterns. As noted, in order to enhance the representation using microarrays containing essentially every Open
of rare gene expression patterns we amplify the vote of each Reading Frame (ORF). The data set on which we performed
pattern with a coefficient that is proportional to the inverse extensive experiments consists of 2467 genes for which
of the frequency of appearance of that class. there exists currently functional annotation in the
In the context of sNet-SOM the utilization of either majority Saccharomyces Genome Database. The weighted K-nearest
or weighted majority voting is essential. These schemes neighbors imputation method presented in [25] is applied in
allow the degree of class discrepancy for a particular neuron order to fill up systematically the missing values.
to be readily estimated. Indeed, by counting the votes at Microarray gene expression data sets are large, complex,
each SOM neuron for every class, an entropy criterion that contain many attributes and have an unknown internal
structure. For that, gaining insight onto the structure of the
data is the initial objective of most analysis methods rather
12
the classification of the data itself. The sNet-SOM meets Table 1 illustrates that the sNet-SOM trains faster than the
this objective by: conventional SOM. The first two rows are the execution
 Achieving high quality and computationally times for a SOM with a grid of size 5X5 and 10X10
efficient clustering of the gene expression profiles respectively. Also, the unsupervised sNet-SOM trains
with the exploitation of either supervised or slightly faster than the supervised (3rd and 4th rows).
unsupervised clustering criteria. Furthermore, the utilization of column (row) insertion
 Offering extensive visualization capabilities with provides further performance advantages (5th and 6th rows).
the irregular two dimensional growable grid that The Support Vector Machine takes the longest time for the
the basic structure provides which can be training on the whole data set (7th row). The implementation
complemented with the Sammon’s nonlinear approach of [19] as implemented with the SVMLight
mapping [21]. software package was used for the SVM solution. Finally,
The sNet-SOM is not only a clustering but is additionally a the supervised sNet-SOM combined with the SVM
classification tool, although that in this case the sNet-SOM resolution of the difficult parts of the state space obtains
does not claim to directly compete capable supervised significantly better learning times without sacrificing the
models like the Radial Basis Support Vector Machine quality of the results. The SVM classification results are
[16,26,6]. The sNet-SOM rather aims to complement them by similar to those published in [6] and are therefore not
reducing the complexity of the problem that remains for the repeated here.
pure supervised solution. Taking into account the size and The supervised phase was trained with the same functional
the complexity of the gene expression data set, this classes as in [6]. This allows to perform some comparisons
reduction proves essential. The results of the last two rows relating the performance of the sNet-SOM with other
of Table 1 demonstrate this computational benefit. methods. These classes are summarized with Table 2. The
functional classifications were obtained from the Munich
Learning Model Training Time information center for protein sequences yeast genome
SOM 5X5 8 min database (http://www.mips.biochem.mpg.de/proj/yeast).
SOM 10X10 65 min
Unsupervised sNet-SOM 25 min
Supervised sNet-SOM 28 min 1. Tricarboxylic-acid pathway (TCA)
Unsupervised sNet-SOM 19 min 2. Respiration-chain complexes (Resp)
with column insertion 3. Cytoplasmic ribosomal proteins (Cyto)
Supervised sNet-SOM with 23 min 4. Proteasome (Proteas)
column insertion 5. Histones (Hist)

6. Helix-turn-helix (HTH)
SVM 3 hours 15 min
Table 2 Functional classes used for supervised sNet-SOM
Supervised sNet-SOM with 50 min
training. The tricarboxylic-acid pathway is, also known as
SVM for the ambiguous
Krebs cycle, consists of genes that encode enzymes that
patterns
break down pyruvate (produced from glucose) by oxidation.
Table 1 A comparison of the execution times for the
The respiration chain complexes perform oxidation-
learning of the gene expression data set. The results
reduction reactions that capture the energy present in
obtained with a 450 MHz Pentium III PC.
NADH through electron transport and the chemiosmotic
synthesis of ATP. The cytoplasmic ribosomal proteins are a
13
class of proteins required to make the ribosome. The 5. Histones 0.60

proteasome consists of proteins that perform the 6. Helix-turn-helix 2.78
degradation of proteins. Histones interact with the DNA 7. Rest and functionally 4.13
backbone to form nucleosomes. These nucleosomes are an unknown classes (Unassigned)
essential part of the chromatin of the cell. Finally, the helix-
turn-helix class, is not a functional class. It consists of genes Table 3 The entropies of class representations at the sNet-
that code for proteins containing the helix-turn-helix SOM configuration of Figure 7.
structural motif. This class is included as a control class.
Many functional classes of genes present strong similarity of

At the presented supervised sNet-SOM training experiment their gene expression patterns. This is evident in Figure 4,
we used six functional classes from the MIPS Yeast where we can observe a high similarity of the gene
Genome Database: tricarboxylic acid (TCA) cycle, expression patterns of the class Ribo. The identities
respiration, cytoplasmic ribosomes, proteasome, histones (identifier, description and functional class) of the genes of
and helix-turn-helix (HTH) proteins. The first five classes Figure 4 are displayed with Figure 6.
represent categories of genes that on biological grounds is Figure 7 illustrates a snapshot of the progress of the learning
expected to induce similar expression characteristics. The process. Each sNet-SOM node is colored according to the
sixth class, i.e. the helix-turn-helix proteins is used as a predominant class. Also for each node three numbers are
control group. Since there is not any biological justification continuously updated. The first is the numeric identifier of
for a mechanism that enforces the genes of this class to the the prevailing class. The second depends on the type of
same patterns of expression, we expect these genes to be training. For supervised training it is the entropy of the
spread to diverse clusters by the sNet-SOM. node: nodes with high entropy lie near class separation
The measure of Entropy of Class Representation is boundaries and the patterns can be used to train efficient
evaluated over the sNet-SOM nodes in order to quantify the supervised models, like the Support Vector Machines, for
dispersion of class representation. We expect this measure to the effective discrimination of these parts of the state space.
be large in the case of HTH, expressing the diversity of the For the unsupervised expansion this number is a resource
HTH gene expression patterns. Indeed, the results of Table 3 count (usually the local quantization error) that controls the
support this intuitive expectation. The high entropy of the positions of the dynamic expansion. Finally, the third
Unassigned class is due to the fact that this class number is the number of patterns mapped to the node.
accumulates hundreds of other known functional classes and Figure 8 displays a listbox with the characteristics of all the
all the unknown ones. nodes of the sNetSOM. The first two columns are the grid
coordinates of the node. The third column is the entropy of
Class Entropy the node and the fourth is the number of genes mapped to
1. Tricarboxylic-acid pathway the node. Finally, the last column is the name of the class
(TCA) 1.96 that the node represents. The biologist can obtain further
2. Respiration-chain information about the genes mapped to the node by selecting
complexes (Resp) 1.82 the corresponding element of the listbox. The main
3. Cytoplasmic ribosomal parameters that control the operation of the sNet-SOM are
proteins (Ribo) 1.21 user defined with a parameter setting screen illustrated with
4. Proteasome (Proteas) 0.51 figure 9.
14
of the state space and supervised for the difficult ones) can
compete in performance advanced supervised learning
8. Conclusions and future work models at a much less computational cost. In essence, the
sNet-SOM can utilize the pure supervised machinery only
This work has presented a new self-growing adaptive neural where it is needed, i.e. for the construction of complex
network model for the analysis of genome-wide expression decision boundaries over regions of the state space where
data. This model, called sNet-SOM overcomes elegantly the patterns cannot be separated easily.
main drawback of most of the existing clustering methods Another way to incorporate supervised learning to the sNet-
that impose an a priori specification at the number of SOM is to use the nodes as Radial Basis Function centers
clusters. The sNet-SOM determines adaptively the number and to model the classification of a gene as a nonlinear
of clusters with a dynamic extension process which is able function of the gene expression “templates” represented by
to exploit class information whenever available. the adjacent nodes. This approach resembles qualitatively
The sNet-SOM grows within a rectangular grid that the supervised harvesting approach of [15]. The node
provides the potential for the implementation of efficient average profiles can be used as inputs to a supervised phase.
training algorithms. The expansion of the sNet-SOM is This reduces the redundancy of information and prevents an
based on an adaptive process. This process grows nodes at overfitting of the training set. Proper parameters of these
the boundary nodes, ripples weights from the internal nodes centers can be estimated by heuristic criteria like signal
towards the outer nodes of the grid, and inserts whole counters, local errors, and node entropies providing local
columns within the map. The growing algorithm is simple information of much importance.
and computationally effective. It prefers to grow from the The sNet-SOM dynamical extension algorithm is similar for
boundary nodes in order to minimize the map readjustment the more usual case in the context of gene expression
operations. However, a mechanism for whole column (row) analysis, where there is no classification information
insertion is implemented in order to deal with the case that a available. In this case criteria based on the computation of
large map should be expanded around a point that is deep local variances or resource counts are implemented.
within its interior. The growing process determines Moreover, in order to enhance the exploratory potential of
automatically the appropriate level of expansion in order the the sNet-SOM for the analysis of the gene expression data,
similarity between the gene expression patterns of the same we have adapted the Sammon distance preserving nonlinear
cluster to fulfill a designer definable statistical confidence mapping. The Sammon mapping allows an effective
level of not being a random event. The voting schemes for visualization of the intrinsic structure of the sNet-SOM
the winner node have been designed in order to amplify the codebook vectors even at the unsupervised case. We will
representation of rare gene expression patterns. provide an extensive discussion of the application of the
A novel feature of the sNet-SOM compared with other Sammon mapping at the context of sNetSOM for the
related approaches is the potentiality for the effective effective visualization of gene expression data in a
exploitation of the available class information with an forthcoming work. Also, another main direction for the
entropy based measure that controls the dynamical extension improvement of the sNetSOM performance is the
process. This process extracts information about the incorporation of more advanced distance metrics to its
structure of the decision boundaries. A supervised network algorithms, as the Bayesian one proposed in [18]. The
can be connected additionally in order to resolve better at incorporation of the presented sNet-SOM dynamic growing
the difficult parts of the state space. This hybrid approach algorithms as a front end processing within Bayesian
(i.e. unsupervised competitive learning for the simple parts
15
network structure learning algorithms [4] is also an open [13] Friedman, N., M. Linial, I. Nachman, and D’ Pe’er, “Using Bayesian
networks to analyze expression data”, J. Comp. Bio. 7, 2000, 601-620,
area for future work.
[14] Fritzke Bernd, "Growing Grid - a self organizing network with
constant neighborhood range and adaptation strength", Neural Processing
ACKNOWLEDGEMENTS Letters, Vol. 2, No. 5, pp. 9-13, 1995
The authors wish to thank the Research Committee of the [15] Hastie Trevor, Tibshirani Robert, Botstein David, Brown Patrick,
“Supervised Harvesting of expression trees”, Genome Biology 2001, 2 (1),
University of Patras for the partial financial support of this
http://genomebiology.com/2001/2/I
research with the contract Karatheodoris 2454 [16] Haykin S,, Neural Networks, Prentice Hall International, Second
Edition, 1999.
References [17] Herrero Javier, Valencia Alfonso, and Dopazo Joaquin, “A
[1] Alahakoon Damminda, Halgamuge Saman K., Srinivasan Bala, hierarchical unsupervised growing neural network for clustering gene
"Dynamic Self-Organizing Maps with Controlled Growth for Knowledge expression patterns”, Bioinformatics, (2001) Vol. 17, no. 2, pp. 126-136
Discovery", IEEE Transactions On Neural Networks, Vol. 11, No. 3, pp [18] Hunter Lawrence, Taylor Ronald C., Leach Sonia M., Simon Richard,
601-614, May 2000. “GEST: a gene expression search tool based on a novel Bayesian similarity
[2] Azuaje Franscisco, “A Computational Neural Approach to Support the metric”, Bioinformatics, Vol. 17, Suppl 1, pp. 115-122, 2001
Discovery of Gene Function and Classes of Cancer”, IEEE Trans. Biomed. [19] Joachims Thorsten, “Making Large-Scale SVM Learning Practical”,
Eng., Vol 48, No. 3, March 2001, pp 332-339 Advances in Kernel Methods – Support Vector Learning”, Bernhard
[3]. Bezerianos A., Vladutu L., Papadimitriou S., "Hierarchical State Scholkopf, Christopher J. C. Burges, and Alexander J. Smola (eds), MIT
Space Partitioning with the Network Self-Organizing Map for the effective Press, Cambridge, USA, 1998
recognition of the ST-T Segment Change", IEEE Medical & Biological [20] Kohonen T., Self-Organized Maps, Springer-Verlag, Second
Engineering & Computing 2000, Vol 38, 406-415 Edition,1997.
[4] Bishop C. M., Neural Networks for Pattern Recognition, Clarendon [21] Pal Nikhil R., Eluri Vijary Kumar, “Two Efficient Connectionist
Press-Oxford, 1996. Schemes for Structure Preserving Dimensionality Reduction”, IEEE Teans.
[5] Brazma Alvis, Vilo Jaak, "Gene expression data analysis", FEBS Neural Networks, vol. 9, no. 6, November 1998, p. 1142-1154
Letters, 480 (2000) 17-24 [22] Papadimitriou S., Mavroudi S., Vladutu L., Bezerianos A., “Ischemia
[6] Brown Michael P.S., Grundy William Noble, Lin David, Cristianini Detection with a Self Organizing Map Supplemented by Supervised
Nello, Sugnet Charles Walsh, Furey Terrence S., Ares Manuel, Haussler Learning”, IEEE Trans. On Neural Networks, Vol. 12, No. 3, May 2001, p.
Jr., David, "Knowledge-based Analysis of Microarray Gene Expression 503-515
Data By Using Support Vector Machines"}, Proceedings of the National [23] Si J., Lin S., Vuong M. A., "Dynamic topology representing
Academy of Science, Vol 97, No 1, pp. 262-267, 1997 networks", Neural Networks, 13, pp. 617-627, 2000
[7] Campos Marcos M., Carpenter Gail A., “S-TREE: self-organizing trees [24] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S.,
for data clustering and online vector quantization”, Neural Networks 14 Dmitrovsky, E., Lander, E.S. and Golub, T.R. (1999) “Interpreting patterns
(2001), pp. 505-525 of gene expression with self-organizing maps: methods and application to
[8] Cheeseman, P., and Stutz, J. (1995) “Bayesian Classification hematopoietic differentiation”, Proc. Natl. Acad. Sci., USA, 92, pp. 2907-
(AutoClass): Theory and results”, In Fayyad, U., Piatesky-Shapiro, G., 2912
Smyth, P., and Uthurusamy, R. editors, Advances in Knowledge Discovery [25] Troyanskaya Olga, Cantor Michael, Shelock Gavin, Brown Pat, Hastie
and Data Mining, pp. 153-180, AAAI Press, Menbo Park, CA Trevor, Tibshirani Robert, Botstein David, Altman Russ B., “Missing value
[9] Cheng Guojian and Zell Andreas, "Externally Growing Cell Structures estimation methods for DNA microarrays”, Bioinformatics, Vol. 17, no 6,
for Data Evaluation of Chemical Gas Sensors", Neural Computing & 2001
Applications, 10, pp. 89-97, Springer-Verlag, 2001 [26] Vapnik V. N., Statistical Learning Theory, New York, John Wiley &
[10] Cheung Vivian G., Morley Michael, Aguilar Francisco, Massimi Aldo, Sons, 1998.
Kucherlapati Raju, Childs Geoffrey, "Making and reading microarrays", [27] Vesanto Juha Alhoniemi, Esa, “Clustering of the Self-Organized
Nature genetics supplement, Vol. 21, January 1999 Map”, IEEE Transactions on Neural Networks, Vol. 11, No. 3, May 2000,
[11] Durbin R., Eddy S., Krogh A., Mitchison G., Biological Sequence p. 586-600
Analysis, Cambridge, University Press, 1998

[12] Eisen Michael B., Spellman Paul T., Patrick O. Brown, and David
Botstein, "Cluster analysis and display of genome-wide expression
patterns", Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863-14868,
December 1998
16
Figure 4 The expression profiles of the genes clustered an sNet-SOM node of class Ribo. A few patterns of the rest classes presenting
very similar expression profiles map also to this node.
17
Figure 5 The average expression profile for the genes plotted by Figure 4
18
Figure 6 The identities of the genes as plotted in Figure 4 from the back of the figure towards its front (at the 3D view). The biologist
can extract easily useful information about which of the genes of unassigned class present similar expression profiles to the genes of
class Ribo.
19
Figure 7 The outline of the configuration of the growing sNetSOM is displayed graphically and illustrates the progress of the learning
process to the user. The nodes that represent the Helix-Turn-Helix class are in blue color. It is visually evident that these nodes are
much more dispersed than nodes colored differently that represent other classes.
20
Figure 8 The listbox that displays the characteristics of the nodes of the sNetSOM. The first two columns are the grid coordinates of the
node. The third column is the entropy of the node and the fourth is the number of genes mapped to the node. Finally, the last column is
the name of the class that the node represents.
21
Figure 9 The parameter configuration screen allows to control directly the main parameters of the sNetSOM.
22

BioinfNetSOM PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BioinfNetSOM PDF

Uploaded by

Copyright:

Available Formats

BIOINFORMATICS Vol.

Gene Expression Analysis with a Dynamically extended Self-

Seferina Mavroudi, Stergios Papadimitriou, Liviu Vladutu, Anastasios Bezerianos

ABSTRACT The expansion of the sNet-SOM is based on an adaptive

number and structure of distinct clusters. Moreover, most of

protein of interest). non-fuzzy hierarchical schemes proposed recently [7,17].

aid at the discovery of functional classes, is clustering, i.e inversion problems.

similar function tend to display similar expression patterns. available.

function become allocated to different clusters. In this case, interpretation [12].

existing class information, another major drawbacks of these expression data.

methods is that they require an a priori decision on the

Unlike the standard SOM, these parameters (i.e. N k ,

while MapConverged = false do kept constant i.e. N k  N 0 ,  k (d ( j , i ))   0 ( d ( j, i )) . This

endfor large neighborhood is not required to train the whole map at

MapConverged accordingly a neighborhood of 1 only is required). As training proceeds,

endwhile during subsequent training epochs, the area defined by the

 w j ( k ), j  Nk the diagonal nodes with a smaller learning rate yields also

i.e. models and to ones that require the utilization of a capable

# patterns of class c The steps of the unsupervised expansion process are as

<Unsupervised Expansion Phase:>

formula: Finally, if the winner is a node that is neither a boundary nor

To this end, we define a confidence level a from which we

E i,j-1 between random expression vectors were known.

E i+1,j-1 E i+1,j E i+1,j+1 experiment points of every expression profile randomly,

Also, for distances smaller than vl a positive correlation

uncertainty of class assignment is significant, are identified HN ( m )    pk log pk , (2)

neuron as the winner (and accordingly are located at the

Unsupervised sNet-SOM 19 min 2. Respiration-chain complexes (Resp)

with column insertion 3. Cytoplasmic ribosomal proteins (Cyto)

Supervised sNet-SOM with 23 min 4. Proteasome (Proteas)

column insertion 5. Histones (Hist)

class of proteins required to make the ribosome. The 5. Histones 0.60

Many functional classes of genes present strong similarity of

Engineering & Computing 2000, Vol 38, 406-415 Edition,1997.

Jr., David, "Knowledge-based Analysis of Microarray Gene Expression 503-515

(AutoClass): Theory and results”, In Fayyad, U., Piatesky-Shapiro, G., 2912

Analysis, Cambridge, University Press, 1998

You might also like