Barbara Hammer and Thomas Villmann - Generalized Relevance Learning Vector Quantization

Generalized Relevance Learning Vector Quantization
Barbara Hammer Thomas Villmann March 11, 2002
Abstract We propose a new scheme for enlarging generalized learning vector quantization (GLVQ) with weighting factors for the input dimensions. The factors allow an appropriate scaling of the input dimensions according to their relevance. They are adapted automatically during training according to the specic classication task whereby training can be interpreted as stochastic gradient descent on an appropriate error function. This method leads to a more powerful classier and to an adaptive metric with little extra cost compared to standard GLVQ. Moreover, the size of the weighting factors indicates the relevance of the input dimensions. This proposes a scheme for automatically pruning irrelevant input dimensions. The algorithm is veried on articial data sets and the iris data from the UCI repository. Afterwards, the method is compared to several well known algorithms which determine the intrinsic data dimension on real world satellite image data. Keywords: clustering, learning vector quantization, adaptive metric, relevance determination.
1 Introduction
Self-organizing methods such as the self-organizing map (SOM) or vector quantization (VQ) as introduced by Kohonen provide a successful and intuitive method of processing data for easy access [18]. Assumed data are labeled, an automatic clustering can be learned via attaching maps to the SOM or enlarging VQ with a supervised component to so-called learning vector quantization (LVQ) [19, 23]. Various modications of LVQ exist which ensure faster convergence, a better adaptation of the receptive elds to optimum Bayesian decision, or an adaptation for complex data structures, to name just a few [19, 29, 33]. A common feature of unsupervised algorithms and LVQ consists in the fact that information is provided by the distance structure between the data points which is determined by the chosen metric. Learning heavily relies on the commonly used Euclidian
University of Osnabr ck, Department of Mathematics/Computer Science, Albrechtstrae 28, 49069 Osu nabr ck, Germany u University of Leipzig, Clinic for Psychotherapy and Psychosomatic Medicine, Karl-TauchnitzStrae 25, 04107 Leipzig, Germany
metric and hence crucially depends on the fact that the Euclidian metric is appropriate for the respective learning task. Therefore data are to be preprocessed and scaled appropriately such that the input dimensions have approximately the same importance for the classication. In particular, the important features for the respective problem are to be found, which is usually done by experts or with rules of thumb. Of course, this may be time consuming and requires prior knowledge which is often not available. Hence methods have been proposed which adapt the metric during training. Distinction sensitive LVQ (DSLVQ), as an example, automatically determines weighting factors to the input dimensions of the training data [26]. The algorithm adapts LVQ3 for the weighting factors according to plausible heuristics. The approaches [17, 32] enhance unsupervised clustering algorithms by the possibility of integrating auxiliary information such as a labeling into the metric structure. Alternatively, one could use information geometric methods in order to adapt the metric such as in [14]. Concerning SOM, another major problem consists in nding an appropriate topology of the initial lattice of prototypes such that the prior topology of the neural architecture mirrors the intrinsic topology of the data. Hence various heuristics exist to measure the degree of topology preservation, to adapt the topology to the data, to dene the lattice a posteriori, or to evolve structures which are appropriate for real world data [2, 7, 20, 27, 37]. In all tasks the intrinsic dimensionality of data plays a crucial role since it determines an important aspect of the optimum neural network: the topological structure, i.e., the lattice for SOM. Moreover, superuous data dimensions slow down the training for LVQ as well. They may even cause a decrease in accuracy since they add possibly noisy or misleading terms to the Euclidian metric where LVQ is based on. Hence a data dimension as small as possible is desirable for the above mentioned methods in general, for the sake of efciency, accuracy, and simplicity of neural network processing. Therefore various algorithms exist which allow to estimate the intrinsic dimension of the data: PCA and ICA constitute well established methods which are often used for adequate preprocessing of data and which can be implemented with neural methods [15, 25]. A Grassberger-Procaccia analysis estimates the dimensionality of attractors in a dynamic system [12]. SOMs which adapt the dimensionality of the lattice during training like the growing SOM (GSOM) automatically determine the approximate dimensionality of the data [2]. Naturally, all adaptation schemes which determine weighting factors or relevance terms for the input dimensions constitute an alternative method for determining the dimensionality: The dimensions which are ranked as least important, i.e. they possess the smallest relevance terms, can be dropped. The intrinsic dimensionality is reached when an appropriate quality measure such as an error term changes signicantly. There exists a wide variety of input relevance determination methods in statistics and the eld of supervised neural networks, e.g. pruning algorithms for feedforward networks as proposed in [10], the application of adaptive relevance determination for the support vector machine or Gaussian processes [9, 24, 31], or adaptive ridge regression and the incorporation of penalizing function as proposed in [11, 28, 30]. However, note that our focus lies on improving metric based algorithms via involving an adaptive metric which allows dimensionality reduction as a byproduct. The above mentioned methods do not yield a metric which could be used in self-organizing algorithms but primarily investigate the goal of sparsity and dimensionality reduction in neural network architectures or 2
alternative classiers. In the following, we will focus on LVQ since it combines the elegancy of simple and intuitive updates in unsupervised algorithms with the accuracy of supervised methods. We will propose a possibility of automatically scaling the input dimensions and hence adapting the Euclidian metric to the specic training problem. As a byproduct, this leads to a pruning algorithm for irrelevant data dimensions and the possibility of computing the intrinsic data dimension. Approaches like [16] clearly indicate that often a considerable reduction of the data dimension is possible without loss of information. The main idea of our approach is to introduce weighting factors to the data dimensions which are adapted automatically such that the classication error becomes minimal. Like LVQ, the formulas are intuitive formulas and can be interpreted as Hebbian learning. From a mathematical point of view, the dynamics constitute a stochastic gradient descent on an appropriate error surface. Small factors in the result indicate that the respective data dimension is irrelevant and can be pruned. This idea can be applied to any generalized LVQ (GLVQ) scheme as introduced in [29] or other plausible error measures such as the Kullback-Leibler-divergence. With the error measure of GLVQ, a robust and efcient method results which can push the classication borders near to the optimum Bayesian decision. This method, generalized relevance LVQ (GRLVQ), generalizes relevance LVQ (RLVQ) [3] which is based on simple Hebbian learning and leads to worse and instable results in case of noisy real life data. However, like RLVQ, GRLVQ has the advantage of an intuitive update rule and allows efcient input pruning compared to other approaches which adapt the metric to the data involving additional transformations as proposed in [8, 13, 34] or depend on less intuitive differentiable approximations of the original dynamics [21]. Moreover, it is based on a gradient dynamics compared to heuristic methods like DSLVQ [26]. We will verify our method on various small data sets. Moreover, we will apply GRLVQ to classify a real life satellite image with approx. mio. data points. As already mentioned, weighting factors allow us to approximately determine the intrinsic data dimensionality. An alternative method is the growing SOM (GSOM) which automatically adapts the lattice of neurons to the data and hence gives hints about the intrinsic dimensionality as well. We compare our GRLVQ experiments to the results provided by GSOM. In addition, we relate it to a Grassberger-Procaccia analysis. We obtain comparable results concerning the intrinsic dimensionality of our data. In the following, we will rst introduce our method GRLVQ, present applications to simple articial and real life data, and nally discuss the results for the satellite data.
2 The GRLVQ Algorithm

Assume a nite training set of training data is given and the clustering of the data into classes is to be learned. We denote the components of a vector by in the following. GLVQ chooses a xed number of vectors in for each class, so called prototypes. Denote the set of prototypes by and assign the label to iff belongs
b 9Y a`W6546454 T XW W W c b ! ' 6645464VU SRSQ T ' % ' P% 9 2 A9 ) '% # IHG645464FEDCB@@876646454327 10(&$"!
to the th class,
The training algorithm adapts the prototypes such that for each class , the corresponding prototypes represent the class as accurately as possible. That means, the difference of the points belonging to the th class, , and the receptive elds of the corresponding prototypes, , should be as small as possible for each class. For a given data point denote by some function which is negative if is classied correct, i.e., it belongs to a receptive eld with , and which is positive if is classied wrong, i.e., it belongs to a receptive eld with . Denote by some monotonically increasing function. The general scheme of GLVQ consists in minimizing the error term (1)
via a stochastic gradient descent. , the update rule of LVQ2.1 is Given an example
where is the so-called learning rate and is the nearest correct prototype, is the nearest incorrect prototype. Usually, this update is only performed if the prototypes fall within a certain window of the decision border. This update can be obtained as a stochastic gradient descent on the error function (1) if we choose as , and being the squared Euclidian distances of to the nearest correct or wrong prototype, respectively. is the identity restricted to the window of interest and outside. The concrete choice of as the identity and , being the squared Euclidian distance of to the nearest prototype, say , and if is classied correct, , if is classied wrong, would yield the standard LVQ update if otherwise (2)
where . Note that the condition on of being negative iff is classied correctly is here violated. Consequently, the resulting function is highly discontinuous. Hence the usefulness of this error function can be doubted and the corresponding gradient descent method will likely show instable behavior. The choice of as the sigmoidal function sgd and
where is the squared Euclidian distance to the next prototype labeled with , say , and is the squared Euclidian distance to the next prototype labeled with a label not equal to , say , yields a particular powerful and noise tolerant behavior
U !US X!52xso P W o r c b !! ll duun ll WW }yd l W Wdu od 2u }a{ 2 ~{ | { lW U yl l y gz ! S x ry ly ry u ly U lW !62fxRwo r W P ! rW o u rW d rW !l W o n l W d l sdu dvscutsqepdu X1DcymW "! U ! ! S( T jk i % g hg% d fe gGb c b f f US ! P! d` C9b A5 f b P 9 @7G64546432IP`b `W F4y9AxsWdv gxw A dv tsU8qpihXg f r u A W u ArW A P `W 9 P @87e645464FF2d`b b
. The receptive eld of is dened by 4
p
rr y ! n ! y T 6!Uu 62tU
l un l yy S ! S
r W
yr l y l W
since it combines adaptation near the optimum Bayesian borders like LVQ2.1, whereby prohibiting the possible divergence of LVQ2.1 as reported in [29]. We refer to the update as GLVQ:
Obviously, the success of GLVQ crucially depends on the fact that the Euclidian metric is appropriate for the data and the input dimensions are approximately equally scaled and equally important. Here, we introduce input weights , in order to allow a different scaling of the input dimensions hence making possibly time consuming preprocessing of the data superuous. Substituting the Euclidian by its scaled variant metric (4)
the receptive eld of prototype
Replacing by in the error function in (1) yields a different weighting of the input dimensions and hence an adaptive metric. Appropriate weighting factors can be determined automatically via a stochastic gradient descent as well. Hence the rule (2) where the relevance factors of the metric are integrated is accompanied by the update if (5) otherwise for each , where . We add a normalization to obtain such that we avoid numerical instabilities for the weighting factors. This update constitutes RLVQ as proposed in [3]. We remark that this update can be interpreted in a Hebbian way: Assumed the nearis correct then those weighting factors are decreased only slightly for est prototype which the term is small. Taking the normalization of the weighting factors into account, the weighting factors are increased in this situation iff they contribute to the correct classication. Conversely, those factors are increased most for which the term is large if the classication is wrong. Hence if the classication is wrong, precisely those weighting factors are increased which do not contribute to the wrong classication. Since the error function is not continuous in this case, this yields merely a plausible explanation of the update rule. However, it is not surprising that the method shows instabilities for large datasets which are subject to noise as we will see later. We can apply the same idea to GLVQ. Then the modication of (3) which involves the relevance factors of the metric is accompanied by
2 G X!52x6To H P W jj o r !! jjrr duu CCGTGTeun jj yd j b Wd o1 f f i F49 psdv w dv sqhXg f r W u W u r W A P W ! du T ' k $e u u ve x ! ' 8G546464V"v T ! r w u l (!| !r !py Sn l y F|yd r W e! l ` u r !r !ypSl y F|cyd l W W ou W | ! n o y y
sgd sgd becomes
(3)
r l r l | o W W 6 ! rj u j ! n u ! l du ! n eTeu j vud j yly y j j y r y y

sgd 5
W j ! lj u W j ! lj Du l W
(6)
for each , and being the closest correct or wrong prototype, respectively, and and the respective squared distances in the weighted Euclidian metric. Again, this is followed by normalization. We term this generalization of RLVQ and GLVQ generalized relevance learning vector quantization or GRLVQ, for short. Note that the update can be motivated intuitively by the Hebb paradigm taking the normalization into account: they comprise the same terms as in (5). Hence those weighting factors are reinforced most, which coefcients are closest to the respective data point if this point is classied correct; otherwise, if is classied wrong, those factors are reinforced most, which coefcients are far away. The difference in (6) compared to (5) consists in appropriate situation dependent weightings for the two terms and in the simultaneous update according to the next correct and next wrong prototype. Besides, the update rule obeys a gradient dynamics on the corresponding error function (1) as we show in the appendix. Obviously, the same idea could be applied to any gradient dynamics. We could, for example, minimize a different error function such as the Kullback-Leibler divergence of the distribution which is to be learned and the distribution which is implemented by the vector quantizer. Moreover, this approach is not limited to supervised tasks, we could enlarge unsupervised methods like the neural gas algorithm [20] which obey a gradient dynamics with weighting factors in order to obtain an adaptive metric.
3 Relation to previous research

The main characteristics of GRLVQ as proposed in the previous section are as follows: The method allows an adaptive metric via scaling the input dimensions. The metric is restricted to a diagonal matrix. The advantages are the efciency of the method, interpretability of the matrix elements as relevance factors, and the correlated possibility of pruning. The update proposed in GRLVQ is intuitive and efcient, at the same time a thorough mathematical foundation can be found due to the gradient dynamics. As we will see in the next section, GRLVQ provides a robust classication system which is appropriate for real-life data. Naturally, various approaches in the literature consider the questions of an adaptive metric, input pruning, and dimensionality determination, too. The most similar approach we are aware of constitutes distinction sensitive LVQ (DSLVQ) [26]. The method introduces weighting factors, too, and is based on LVQ3. The main advantages of our iterative update scheme compared to the DSLVQ update are threefold: Our update is very intuitive and can be explained with Hebbian learning; our method is more efcient since in DSLVQ each update step requires twice normalization; and, which we believe is the most important difference, our update constitutes a gradient descent on an error function, hence the dynamics can be mathematically analyzed and a clear objective can be identied. Recently, Kaski et.al. proposed two different approaches which allow an adaptive metric for unsupervised clustering if additional information in an auxiliary space is available [17, 32]. Their focus lies on unsupervised clustering and they use the Bayesian-framework in order to derive appropriate algorithm. The approach in [17] explicitely adapts the metric, however it needs a model for explaining the auxiliary 6

U
r l r W l W Hy y
data. Hence, we cannot apply the method for our purpose, explicit clustering, i.e. developing the model. In [32] an explicit model is no longer necessary. However, the method relies on several statistical assumptions and is derived for soft clustering instead of exact LVQ. One could borrow ideas from [32]. Alternatively to the statistical scenario, GRLVQ proposes another direct, efcient, and intuitive approach. Methods as proposed in [13] and variations allow an adaptive metric for other clustering algorithms like fuzzy clustering. The algorithm in [13] even allows a more exible metric with non-vanishing entries outside the diagonal; however, the algorithms are naturally less efcient and require a matrix inversion, for example. In addition, well known methods like RBF networks can be put in the same line since they can provide a clustering with adaptive metric as well. Commonly, training is less intuitive and efcient than GRLVQ. Moreover, a more exible metric which does not restrict to a diagonal matrix does no longer propose a natural pruning scheme. Apart from the exibility due to an adaptive metric, GRLVQ provides a simple way of determining which data dimensions are relevant: we can just drop those dimensions with lowest weighting factor until a considerable increase of the classication error is observed. This is a common feature for all methods which determine weighting factors describing the metric. Alternatively, one can use general methods for determining the dimensionality of the data which are not tted to the classier LVQ. The most popular approaches are probably ICA and PCA, as already mentioned [15, 25]. Alternatively, one could use the above mentioned GSOM algorithm [2]. However, because of its remaining hypercubical structure the results may be inaccurate. Another method is to apply a Grassberger-Procaccia-analysis to determine the intrinsic dimension. This method is unfortunately sensitive to noise [12, 38]. A wide variety of relevance determination methods exists in statistics or in the supervised neural network literature, e.g. [9, 10, 11, 24, 28, 30, 31]. These methods mostly focus on the task of obtaining sparse classications and they do not yield an adaptive metric which could be used in selforganizing metric-based algorithms like LVQ and SOM. Hence a comparison with our method which primarily focuses on an adaptive metric for self-organizing algorithms would be interesting, but beyond the scope of this article.
4 Experiments
Articial data
We rst tested GRLVQ on two articial data sets from [3] in order to compare it to RLVQ. We refer to the sets as data and data , respectively. The data comprise clusters with small or large overlap, respectively, of the clusters in two dimensions as depicted in Fig. 1. We embed the points in as follows: Assume is one data point. Then we add dimensions obtaining a point . We choose , where comprises noise with a Gaussian distribution with variances , , , and , respectively. , . . . , contain pure noise which is uniformly distributed in and or distributed according to Gaussian noise with variances and , respectively. We refer to the noisy data as data and data , respectively. In each run, data are randomly separated into a training
4 4 4 x 4ux 4 x fx U uU4 8x uu ux4 fu { f n 4 x T 4 { fx 3 x 2UT vxF p4 x 6646454CxTUp { nT ! VX! U6645464V T T T T %
7
class 1 class 2 class 3
class 1 class 2 class 3
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Figure 1: Articial data sets consisting of three classes with two clusters each and small or large overlap, respectively; only the rst two dimensions are depicted. and test set of the same size. is chosen as constant , is chosen as . Since the weighting factors are updated in each step compared to the prototypes, the learning rate for the weighting terms should be smaller than the learning rate for the prototypes. Pretraining with simple LVQ till the prototypes nearly converge is mandatory for RLVQ, otherwise, the classication error is usually large and the results are not stable. It is advisable to train the prototypes with GLVQ for a few epochs before using GRLVQ, either, in order to avoid instabilities. We use prototypes for each class according to the priorly known distribution. The results on training and test set are comparable in all runs, i.e. the test set accuracy is not worse or only slightly worse compared to the accuracy on the training set. GRLVQ obtains about the same accuracy as RLVQ on all data sets (see Tab. 1) and clearly indicates which dimensions are less important via assigning small weighting factors to the less important dimensions which are known in these examples. Typical weighting factors are the vectors RLVQ GRLVQ
for data or the vectors
RLVQ
for data , hence clearly separating the important rst two data dimensions from the remaining dimensions of which the rst contain some information. This is pointed out via a comparably large third weighting term for the second data set. The remaining four dimensions contain no information at all. However, GRLVQ shows a faster convergence and larger stability compared to RLVQ in particular if used for noisy data sets with large overlap of the classes as for data . There the separation of the important dimensions is clearer in GRLVQ than RLVQ. Concerning RLVQ, pre-training with LVQ and small learning rates were mandatory in order to ensure good results; the same situations turn out to be less critical for GRLVQ, although it is advisable to choose the learning rate for the weighting terms an order of magnitude smaller than the learning 8
F!Vx 8xF!x 4 x8xx Vx x8x4 xV 2 f84 x x 8xFFx x4 4x 38x24 84x 38x322 34 84x 8x23f4 f4x xf42 xf52x 4 4
GRLVQ
24 x x
3!xfx8x x3x xfxFx x x 8xf xfF "x 4 4 44 4 !x8x xfx 8x8xxFFxx fx 3xx 8xF 8x u"x 4 4 4 4
XTo 2fx 4
Fx5x2
LVQ RLVQ GRLVQ
data 91 - 96 91 - 96 94 - 97
data 81 - 89 90 - 96 93 - 97
data 79 - 86 80 - 86 83 - 87
data 56 - 70 79 - 86 83 - 86
) for the two articial Table 1: Percentage of correctly classied patterns (maximum training data, data and data , with and without additional noisy dimensions and RLVQ or GRLVQ, respectively. rate for the prototype update. These results indicate that GRLVQ is particularly well suited for noisy real life data sets. Based on the above weighting factors one can obtain a ranking of the input dimensions and drop all but the rst two dimensions without increasing the classication error.
Iris data
In a second test we applied GRLVQ to the well known Iris data set provided in the UCI repository of machine learning [4]. The task is to predict three classes of plants based on numerical attributes in instances, i.e., we deal with data points in with labels in . Both, LVQ and RLVQ obtain an accuracy of about for a training and test set if trained with prototypes for each class. RLVQ shows a slightly cyclic behavior in the limit, the accuracy changing between and . The computed weighting factors for RLVQ are RLVQ
indicating that based on the last dimension a very good classication would be possible. If more dimensions would be taken into account, a better accuracy of about would be possible as reported in the literature. We could not produce such a solution with LVQ or RLVQ. Moreover, a perfect recognition of would correspond to overtting since the data comprises small noise as reported in the literature. GRLVQ yields the better accuracy of at least on the training as well as the test set and obtains weighting factors of the form GRLVQ hence, indicating that the last dimension is most important as already found by RLVQ, and dimension contributes to a better accuracy which has not been pointed out by RLVQ. Note that the result obtained by GRLVQ is in coincidence with results obtained e.g. with rule extraction from feedforward networks [6].
Satellite data
Finally, we applied the algorithm to a large real world data set: a multi-spectral LANDSAT TM satellite image of the Colorado area.1 Satellites of LANDSAT-TM type produce pictures of the earth in 7 different spectral bands. The ground resolution in meter
1 Thanks
to M. Augusteijn (University of Colorado) for providing this image.
x F42
fx 4
!3 xFx 8x3x 8x x fx 4 4 24 4 F fx V x 4 4
Fx5x2
GF! xp fxfx"x 44
3xX2
3 x 4
9 8 3F2
mean (train) variance (train) mean (test) variance (test)
LVQ 85.21 0.59 85.2 0.46
RLVQ 86.1 0.18 86.36 0.16
GLVQ 87.32 0.17 87.28 0.1
GRLVQ 91.08 0.11 91.04 0.13
Table 2: Percentage of correctly classied patterns (maximum runs on the satellite data obtained in a 10-fold-crossvalidation. is
for the bands 1-5 and band 7. Band 6 (thermal band) has a resolution of only and, therefore, it is often dropped. The spectral bands represent useful domains of the whole spectrum in order to detect and discriminate vegetation, water, rock formations and cultural features [5, 22]. Hence, the spectral information, i.e., the intensity of the bands associated with each pixel of a LANDSAT scene, is represented by a vector in with . Generally, the bands are highly correlated [1, 35]. Additionally, the Colorado image is completely labeled by experts. There are labels describing different vegetation types and geological formations. Thereby, the label probability varies in a wide range [36]. The size of the image is pixels. We trained RLVQ and GRLVQ with prototypes ( for each class) on of the data set till convergence. The algorithm converged in less than cycles if and were chosen as and , respectively, as before. RLVQ yields an accuracy of about on the training data as well as the entire data set, however, it does not provide a ranking of the prototypes, i.e. all weighting terms are close to their initial value . GRLVQ leads to the better accuracy of on the training set as well as the entire data set and provides a clear ranking of the several data dimensions. See Table 2 for a comparison of the results obtained by the various algorithms. In all experiments, dimension is ranked as least important with weighting factor close to . The weighting factors approximate GRLVQ
in several runs. This weighting clearly separates the rst two dimensions via a small weighting factor. If we prune dimension , , and , still an accuracy of can be achieved. Hence this indicates, that the intrinsic data dimension is at most . Pruning one additional data dimension, dimension still allows an accuracy of more than , hence indicating that the intrinsic dimension may be even lower and the relevant directions are not parallel to the axes or even curved. These results are visualized in Fig. 2 where the misclassied pixels in the respective cases are colored in black, the other pixels are colored corresponding to their respective class. For comparison we applied a Grassberger-Procaccia-analysis and the GSOM approach. The rst estimates the intrinsic dimension as whereas GSOM generates a lattice of shape , hence indicating an intrinsic dimension between and . These methods show a good agreement with the drastic loss of information if more than dimensions are pruned with GRLVQ.
10
3x
Fo 3CCD25x) Fx52 2
62
x3
5f62 3y 2 4
3x5x2
) and variance of the
2 F!8x Vu8x3 fxx fx 2xF2fx 4 24 4 4 4
f 2
)) zDs2
24 4 x x 2fx
gd
' (%
52x 4 x3 GTo
FxdFx mFxi) 3x )
Figure 2: Colorado-satellite-image: the pixels are colored according to the labels; above-left: original labeling; above-right: GRLVQ without pruning; below-left: GRLVQ with pruning of dimensions , , ; below-right: GRLVQ with pruning of dimensions. Misclassied pixels in the GRLVQ-generated images are black colored. (A colored version of the image can be obtained from the authors on request.)
5 Conclusions
The presented clustering algorithm GRLVQ provides a new robust method for automatically adapting the Euclidian metric used for clustering to the data, determining the relevance of the several input dimensions for the overall classier, and estimating the intrinsic dimension of data. It reduces the input dimensions onto the essential parameters which is demanded to obtain optimal network structures. This is an important feature, if the network is used to reduce the data amount to subsequent systems in complex data analysis tasks as we can nd in medical applications (image analysis) or satellite remote sensing systems, for example. Here, the reduction of data to be transferred is one of the most important features, however, preserving the essential information in the data. The GRLVQ-algorithm was successfully tested on articial as well as real world data, a large and noisy satellite multi-spectral image. A comparison with other approaches validates the results even in real life applications. It should be noted that the GRLVQ algorithm can be easily adapted to other types 11
of neural vector quantizers as neural gas or SOM, to mention just a few. Furthermore, it is clear that if we assume an unknown probability distribution of the labels for a given data set, the here discussed variant of GRLVQ tries to maximize the Kullback-Leibler divergence. Hence, we can state for this feature some similarities in our approach to the work of Kaski [17, 32]. Further considerations of GRLVQ should incorporate information theory approaches like entropy maximization to improve the capabilities of the network.
References
[1] M. F. Augusteijn, K. A. Shaw, and R. J. Watson. A study of neural network input data for ground cover identication in satellite images. In S. Gielen and B. Kappen, editors, Proc. ICANN93, Int. Conf. on Articial Neural Networks, pages 10101013, London, UK, 1993. Springer. [2] H.-U. Bauer and T. Villmann. Growing a Hypercubical Output Space in a Self Organizing Feature Map. IEEE Transactions on Neural Networks, 8(2):218226, 1997. [3] T. Bojer, B. Hammer, D. Schunk, and K. Tluk von Toschanowitz. Relevance determination in learning vector quantization. In Proc. Of European Symposium on Articial Neural Networks (ESANN01), pages 271-276, Brussels, Belgium, 2001. D facto publications. [4] C.L. Blake and C. J. Merz, UCI Repository of machine learning databases , Irvine, CA: University of California, Department of Information and Computer Science. [5] J. Campbell. Introduction to Remote Sensing. The Guilford Press, U.S.A., 1996. [6] W. Duch, R. Adamczak, K. Grabczewski, A new method of extraction, optimization and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks 12: 277-306, 2001. [7] B. Fritzke. Growing grid: a self-organizing network with constant neighborhood range and adaptation strength. Neural Processing Letters, 2(5):913, 1995. [8] I. Gath and A. Geva. Unsupervised optimal fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:773791, 1989. [9] T. van Gestel, J. A. K. Suykens, B. de Moor, and J. Vandewalle. Automatic relevance determination for least squares support vector machine classiers. In M. Verleysen, editor, European Symposium on Articial Neural Networks, 1318, 2001. [10] Y. Grandvalet. Anisotropic noise injection for input variables relevance determination. IEEE Transactions on Neural Networks, 11(6):12011212, 2000.
12
[11] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization. In L. Niklasson, M. Boden, and T. Ziemke, editors, ICANN98, volume 1 of Perspectives in Neural Computing, pages 201206. Springer, 1998. [12] P. Grassberger and I. Procaccia. Measuring the strangeness of strange attractors. Physica, 9D:189208, 1983. [13] D. Gustafson and W. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proceedings of IEEE CDC79, pages 761766, 1979. [14] T. Hofmann. Learning the similarity of documents: An information geometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, and K. R. M ller, editors, Advances in Neural Information Processing Systems, u volume 12, pages 914920. MIT Press, 2000. [15] A. Hyv rinen and E. Oja. A fast xed-point algorithm for independent component a analysis. Neural Computation, 9(7):14831492, 1997. [16] S. Kaski. Dimensionality reduction by random mapping: fast similarity computation for clustering. In Proceedings of IJCNN92, pages 413418, 1998. [17] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. To appear in IEEE Transactions on Neural Networks. [18] T. Kohonen. Learning vector quantization. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 537540. MIT Press, 1995. [19] T. Kohonen. Self-Organizing Maps. Springer, 1997. [20] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks, 7(3):507522, 1993. [21] U. Matecki. Automatische Merkmalsauswahl fur Neuronale Netze mit Anwen dung in der pixelbezogenen Klassikation von Bildern. Shaker, 1999. [22] E. Merenyi. The challenges in spectral image analysis: An introduction and review of ANN approaches. In Proc. Of European Symposium on Articial Neural Networks (ESANN99), pages 9398, Brussels, Belgium, 1999. D facto publications. [23] A. Meyering and H. Ritter. Learning 3D-shape-perception with local linear maps. In Proceedings of IJCNN92, pages 432436, 1992. [24] R. Neal. Bayesian Learning for Neural Networks. Springer, 1996. [25] E. Oja. Principal component analysis. In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 753756. MIT Press, 1995. [26] M. Pregenzer, G. Pfurtscheller, and D. Flotzinger. Automated feature selection with distinction sensitive learning vector quantization. Neurocomputing 11:1929, 1996. 13
[27] H. Ritter. Self-organizing maps in non-euclidean spaces. In E. Oja and S. Kaski, editors, Kohonen Maps, pages 97108. Springer, 1999. [28] V. Roth. Sparse kernel regressors. In G. Dorffner, H. Bischof, and K. Hornik, editors, Articial Neural Networks ICANN 2001, pages 339346. Springer, 2001. [29] A. S. Sato and K. Yamada. Generalized learning vector quantization. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7, pages 423429. MIT Press, 1995. [30] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288, 1996. [31] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R. M ller, editors, Advances in Neural Information Processing Systems, volu ume 12, pages 652658. Cambridge MIT Press, 2000. [32] J. Sinkkonen and S. Kaski. Clustering based on conditional distribution in an auxiliary space. To appear in Neural Computation. [33] P. Somervuo and T. Kohonen. Self-organizing maps and learning vector quantization for feature sequences. Neural Processing Letters, 10(2):151159, 1999. [34] M.-K. Tsay, K.-H. Shyu, and P.-C. Chang. Feature transformation with generalized learning vector quantization for hand-written Chinese character recognition. IEICE Transactions on Information and Systems, E82-D(3):687692, 1999. [35] T. Villmann and E. Merenyi. Extensions and modications of the Kohonen-SOM and applications in remote sensing image analysis. In U. Seiffert and L.C. Jain (eds.): Self-Organizing Maps. Recent Advances and Applications, pages 121-145. Springer, 2001 [36] T. Villmann. Benets and limits of the self-organizing map and its variants in the area of satellite remote sensoring processing. In Proc. Of European Symposium on Articial Neural Networks (ESANN99), pages 111116, Brussels, Belgium, 1999. D facto publications. [37] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in SelfOrganizing Feature Maps: Exact Denition and Measurement. IEEE Transactions on Neural Networks, 8(2):256266, 1997. [38] W. Wienholt. Entwurf Neuronaler Netze. Verlag Harri Deutsch, Frankfurt/M., Germany, 1996.
Appendix
The general error function (1) has here the special form sgd
14
rr yy
l T un l yy j k i
being the quadratic weighted distance to the closest correct or wrong proand , respectively. For convenience we denote and . Assume data come from a distribution on the input space and a labeling function . Then the continuous version of the error function reads as sgd
We assume that the sets error term in the following way:
are measurable. Than we can write the
sgd
where denotes the indices of prototypes labeled with , denotes the indices of prototypes not labeled with , is an indicator function for being the closest prototype to among those labeled with , and is an indicator function for being the closest prototype to among those not labeled with . Denote by H the Heaviside function. Denote by or the number of prototypes labeled with or not labeled with , respectively. Then we nd H H
and
The derivative of the Heaviside function is the delta function which is a symmetric function with for and . We are interested in the derivative of (7) with respect to every and every , respectively. Assume is the label of . Then the derivative of (7) with respect to yields: sgd (8) (9)
sgd
sgd
sgd
15
AA8sW0ua AA y
and totype,
! " ! b Fy ! ( 8UW S ! 8b3S n r " 5 " n y y u r yy !k " k ! b y ! (8"S ! b FWS c n r 5 n yy u r yy " k x ! U b ! n n y ! (BSx! 8bU"S y y uxy yy u y y k k n ! y ! (8USx! 8bBS ! y n y y n y k b xy y u y b W W` wW 2 ! x x ! 4 b A u 3Az Q! y u y k S5!(8US b 3A (! r y u y k t6!8bFS b Au b b 3Az A 3A b b b A W r W b b 6!b(8US b 5!bFS b b ! " U!U y 5!b(8USx6!bFS n r 5k T k y y u r yy x C9bseiXud ! A 4! U Fy rr y un ll y y y 9 % @87G456464FF2g'dh U3 '% 'uuuueT r !! srr W1u r l r ( W W yr l y
(7)
(10)
(11)
(12) and (13) correspond to the update for in (5). (14) and (15) vanish since we obtain for the integrand the following equation:
(8) and (9) correspond up to a constant factor to the update (3). (10) and (11) vanish due to the following reason: Denote by the term sgd . The integrand in (10) yields
Again, this is zero because of the symmetry of and the fact, that is non-vanishing only for and , respectively. Hence the update of GRLVQ constitutes a stochastic gradient descent method with appropriate choices of the learning rates.
y y r y y ! ! r ! ! y ! y u y f8FD 5k u y ! y u y f8FD 5k k n ! " ! " ! ! 6 r y ! r y u y 8FD k u y ! r y u y 8FD k k ! " ! 6 s y u y ! y u y k n r y u y ! r y u y k 3D k ! b n b y 6!( U "S 5!bFS6!(U"S 6!bF S ! 5" n r T k k n xyy u r yy ! y W ! n r W ! n r x ! u xy r y y u ! r u ! y" y y 6!(8"Sx6!8bFS xy un rr y 5k T k b y y y y ry y F!suG! y u y 8B@ n ! r y u y FD ! 6 " ! k ! k k W u W ! r u yr y y y k ! " A (! r y u y k v ! (U"S y un rr y 5k b Au b y y ! (8US! bFS b ! 3D
sgd sgd sgd H
This term vanishes since is symmetric and non-vanishing only for and , respectively. In the same way, it can be seen that each integrand of (11) vanishes. The derivative of (7) with respect to can be computed as
16 (15) (14) (13) (12)

Barbara Hammer and Thomas Villmann - Generalized Relevance Learning Vector Quantization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Barbara Hammer and Thomas Villmann - Generalized Relevance Learning Vector Quantization

Uploaded by

Copyright:

Available Formats

Generalized Relevance Learning Vector Quantization

Barbara Hammer Thomas Villmann March 11, 2002

2 The GRLVQ Algorithm

the receptive eld of prototype

r l r l | o W W 6 ! rj u j  ! n  u ! l du  ! n  eTeu j vud j yly y j j y r y y 

3 Relation to previous research

class 1 class 2 class 3

class 1 class 2 class 3

for data or the vectors

LVQ RLVQ GRLVQ

to M. Augusteijn (University of Colorado) for providing this image.

!3 xFx 8x3x 8x x fx 4 4 24 4 F fx V x 4 4

GF! xp fxfx"x 44

mean (train) variance (train) mean (test) variance (test)

LVQ 85.21 0.59 85.2 0.46

RLVQ 86.1 0.18 86.36 0.16

GLVQ 87.32 0.17 87.28 0.1

GRLVQ 91.08 0.11 91.04 0.13

) and variance of the

2 F!8x Vu8x3 fxx fx 2xF2fx 4 24 4 4 4

We assume that the sets error term in the following way:

are measurable. Than we can write the

16 (15) (14) (13) (12)

You might also like

r l r l | o W W 6 ! rj u j ! n u ! l du ! n eTeu j vud j yly y j j y r y y

!3 xFx 8x3x 8x x fx 4 4 24 4 F fx V x 4 4

GF! xp fxfx"x 44

2 F!8x Vu8x3 fxx fx 2xF2fx 4 24 4 4 4