You are on page 1of 48

UNIVERSIDAD DE OVIEDO

MASTER IN SOFT COMPUTING


AND INTELLIGENT DATA ANALYSIS

PROYECTO FIN DE MASTER


MASTER PROJECT

Bi-clustering Olfactory Images


Using Non-negative Matrix Factorizing

Milad Avazbeigi
JULY 2011

UNIVERSIDAD DE OVIEDO

MASTER IN SOFT COMPUTING


AND INTELLIGENT DATA ANALYSIS

PROYECTO FIN DE MASTER


MASTER PROJECT

Bi-clustering Olfactory Images


Using Non-negative Matrix Factorizing

Milad Avazbeigi
TUTOR/ADVISOR:
Enrique H. Ruspini
Francisco Ortega Fernandez

Contents
Preface

1 Introduction
1.1 Historical Background . . . . . . . . . . .
1.2 NMF, PCA and VQ . . . . . . . . . . . .
1.3 Non-negative Matrix Factorization (NMF)
1.3.1 A Simple NMF and Its Solution .
1.3.2 Uniqueness . . . . . . . . . . . . .

5
5
6
7
7
8

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

2 Non-negative Matrix Factorization Algorithms


2.1 Comparison of Different NMFs . . . . . . . . . . . . . . . . . . . . . .
2.2 NMF Based on Alternating Non-Negativity Constrained Least Squares
tive Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Formulation of Sparse NMFs . . . . . . . . . . . . . . . . . . .

. . . . . .
And Ac. . . . . . 10
. . . . . . 11

3 Comparison of NMF and FCM


3.1 Fuzzy c-means (FCM) . . . . . . . . . . . . .
3.2 Cluster Validity Index . . . . . . . . . . . . .
3.2.1 Relative degree of sharing of two fuzzy
3.2.2 Validity Index . . . . . . . . . . . . .
3.3 NMF VS. FCM . . . . . . . . . . . . . . . . .

.
.
.
.
.

. . . . .
. . . . .
clusters
. . . . .
. . . . .

4 Bi-clustering of Images and Pixels


4.1 NMF for bi-clustering . . . . . . . . . . . . . . . . . .
4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . .
4.3 Choosing the Number of Clusters . . . . . . . . . . . .
4.3.1 Visual Assessment of Cluster Tendency (VAT)
4.3.2 VAT results . . . . . . . . . . . . . . . . . . . .
4.4 Stability Analysis of NMF . . . . . . . . . . . . . . . .
4.5 Bi-clustering of Images and Pixels . . . . . . . . . . .
4.6 Cross-Validation . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

9
9

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

14
14
15
15
15
16

.
.
.
.
.
.
.
.

22
23
23
24
24
29
33
33
33

5 Conclusions and Future Researches

39

A Experiments and Specifications


A.1 Chapter 2 . . . . . . . . . . . . . . . . . . . .
A.1.1 Comparison of Different Algorithms of
A.2 Chapter 3 . . . . . . . . . . . . . . . . . . . .
A.2.1 Comparison of FCM and NMF . . . .
A.3 Chapter 4 . . . . . . . . . . . . . . . . . . . .
A.3.1 coVAT . . . . . . . . . . . . . . . . . .
A.3.2 Bi-clustering and Feature Selection . .

41
41
41
41
41
41
41
42

. . . .
NMF
. . . .
. . . .
. . . .
. . . .
. . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

CONTENTS

A.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42


A.3.4 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

List of Figures
2.1
2.2
2.3

Comparison of RMSE for different numbers of clusters . . . . . . . . . . . . . . . 10


Demonstration of Clusters obtained by NMFNNLS . . . . . . . . . . . . . . . . . 12
Comparison of Three Images After Applying NMF . . . . . . . . . . . . . . . . . 13

3.1
3.2
3.3
3.4
3.5

FCM VS. NMF . . . . . . . . . . . . .


Comparison of Clusters for K = 2 . .
Comparison of Clusters for K = 3 . .
Demonstration of Clusters obtained by
Demonstration of Clusters obtained by

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10

A simple dissimilarity images before and after ordering with


Growth of Complexity of VAT . . . . . . . . . . . . . . . .
VAT results for K=20 in NMF . . . . . . . . . . . . . . . .
coVAT results for K=20 in NMF . . . . . . . . . . . . . . .
coVAT high contrast results for K=20 in NMF . . . . . . .
Mean and Standard Deviation of CVI for K = 2 to 14 . . .
Mean and Standard Deviation of RMSE for K = 2 to 14 . .
Clusters and Components for K = 6 . . . . . . . . . . . . .
V WH . . . . . . . . . . . . . . . . . . . . . . . . . . . .
W : Five random sets of clusters from five permutations . .

. . . . . . .
. . . . . . .
. . . . . . .
FCM . . .
NMFNNLS

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

16
18
19
20
21

VAT
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

26
30
30
31
32
34
35
36
37
38

Preface
With the rapid technological advances especially in the last two decades, a huge mass of biological data has been gathered through new tools or new experiments enabled by new equipments
and methods. Intelligent analysis of this mass of data is mandatory to extract some knowledge about the underlying biological mechanisms that causes such data. An example is the
application of statistical methods for analysis of image datasets that could be for example geneexpression images. One of the main goals in the analysis of image datasets is to identify groups
of images, or groups of pixels (regions in images), that exhibit similar expression patterns.
In the literature there have been numerous works related to the clustering or bi-clustering
of gene-expression images. However, to the knowledge of the writers of this report, no work has
been reported yet related to bi-clustering of olfactory images. Olfaction is the sense of smell and
the olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perception
of odors. The bi-clustering of olfactory images is so important since it is believed that it would
help us to understand the olfactory coding. A code is a set of rules by which information is
transposed from one form to another. In the case of olfactory coding, it would describe the
ways in which information about odorant molecules is transposed into neural responses. The
idea is that when we understood that code, we might be able to predict the odorant molecule
from the neural response, the neural response from the odorant molecule, and the perception
of the odorant molecule from the neural response.
The available dataset consists of 472 images of 80x44 pixels. Every image has been already
segmented from a background. Every image corresponds to the 2-DG uptake in the OB of a
rat in response to a particular chemical substance. The purpose of this research is to apply
Non-negative Matrix Factorization (NMF) for bi-clustering images and pixels simultaneously
and finally find pixels (regions of images) that respond in a similar manner for a selection of
chemicals.

Chapter 1

Introduction
1.1

Historical Background

Matrix factorization is a unifying theme in numerical linear algebra. A wide variety of matrix
factorization algorithms have been developed over many decades, providing a numerical platform
for matrix operations such as solving linear systems, spectral decomposition, and subspace
identification. Some of these algorithms have also proven useful in statistical data analysis,
most notably the singular value decomposition (SVD), which underlies principal component
analysis (PCA) (Ding et al. (2010)).
There is psychological and physiological (Biederman (1987)) evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such
representations (Ullman (2000)). But little is known about how brains or computers might
learn the parts of objects. Non-negative matrix factorization (NMF) that is able to learn parts
of images and semantic features of text. This is in contrast to other methods, such as principal
components analysis (PCA) and (VQ) vector quantization, that learn holistic, not parts-based,
representations. Non-negative matrix factorization is distinguished from the other methods by
its use of non-negativity constraints. These constraints lead to a parts-based representation
because they allow only additive, not subtractive, combinations (Lee and Seung (1999)).
Since the introduction of NMF by Paatero and Tapper (1994), the scope of research on NMF
has grown rapidly in the recent years. NMF has been shown to be useful in a variety of applied
settings, including:
Chemometrics (Chueinta et al. (2000))
Pattern recognition (Yuan and Oja (2005))
Multimedia data analysis (Benetos and Kotropoulos (2010))
Text mining (Batcha and Zaki (2010), Xu et al. (2003))
DNA gene expression analysis (Kim and Park (2007))
Protein interaction (Greene et al. (2008))
This chapter is organized as follows. First, in Section 1.2, NMF, PCA an QA which are the
most famous matrix factorization methods are explained briefly and compared to each other.
Then in section 1.3, a simple NMF model is presented. Also in this section the uniqueness of
NMSs solutions is discussed.

CHAPTER 1. INTRODUCTION

1.2

NMF, PCA and VQ

Factorizing a data matrix is extensively studied in Numerical Linear Algebra. There are different
methods for factorizing a data matrix:
Principal Components Analysis (PCA)
Vector Quantization (VQ)
Non-negative Matrix Factorization (NMF)
From another point of view Principal Components Analysis and Vector Quantization are methods for unsupervised learning also.
The formulation of the problem in all three methods PCA, VQ and NMF is as follows. Given
a matrix V , the goal is to find matrix factors W and H such that:
V WH
or
Vi (W H)i =

K
X

(1.1)

Wia Ha

(1.2)

a=1

where
V is a n m matrix where m is the number of examples (in the data set) and n is related
to the number of dimensions of each data set.
W: n K
H: K m
Usually K is chosen to be smaller than n
original matrix.

..

.
= ...
. V ..

nm

or m, so W and H are smaller than V which is the

..
.

...

nK

..
.

(1.3)

Km

V W H can be rewritten column by column as v W h where v and h are the corresponding columns of V and H:

..
..

.
..
..
v

h
=

(1.4)
. W .


..
.
..

.
nK
n1

K1

In other words, each data vector v is approximated by a linear combination of the columns of
W , weighted by the components of h. In a image clustering problem, The K columns of W are
called basis images. Each column of H is called an encoding and is in one-to-one correspondence
with an image in V . An encoding consists of the coefficients by which an image is represented
with a linear combination of basis images. The rank K of the factorization is generally chosen
so that (n + m)K < nm, and the product W H can be regarded as a compressed form of the
data in V . Depending on the constraints used, the output results would be completely different:
VQ uses a hard winner-take-all constraint that results in clustering data into mutually
exclusive prototypes. In VQ, each column of H is constrained to be a unary vector,
with one element equal to unity and the other elements equal to zero. In other words,
every image (column of V ) is approximated by a single basis image (column of W ) in the
factorization V W H (Lee and Seung (1999)).

CHAPTER 1. INTRODUCTION

PCA enforces only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability. PCA constrains the columns
of W to be orthonormal and the rows of H to be orthogonal to each other. This relaxes
the unary constraint of V Q, allowing a distributed representation in which each image is
approximated by a linear combination of all the basis images, or eigenimages. Although
eigenimages have a statistical interpretation as the directions of largest variance, many
of them do not have an obvious visual interpretation. This is because PCA allows the
entries of W and H to be of arbitrary sign. As the eigenimages are used in linear combinations that generally involve complex cancellations between positive and negative numbers,
many individual eigenimages lack intuitive meaning (Lee and Seung (1999)).
NMF does not allow negative entries in the matrix factors W and H. This is very useful
property in comparison with PCA that allows arbitrary signs of pixels and loses the interpretability of the results specially in images data that have only positive values. Unlike
the unary constraint of V Q, these non-negativity constraints permit the combination of
multiple basis images to represent an image. But only additive combinations are allowed,
because the non-zero elements of W and H are all positive. In contrast to PCA, no
subtractions can occur. For these reasons, the non-negativity constraints are compatible
with the intuitive notion of combining parts to form a whole, which is how NMF learns a
parts-based representation.
Because NMF is the method used in this research, in the following NMF is described in detail.

1.3

Non-negative Matrix Factorization (NMF)

As said before V is a non-negative matrix that is known and the goal is to find two matrices W
and H that are also positive. There are different types of non-negative matrix factorizations.
The different types arise from using different cost functions for measuring the divergence between
X and W H (such as Lin (2007)) and possibly by regularization of the W and/or H matrices. In
order to formulate the NMF problem, first we need to define a cost function. Cost function is a
measure to know how good obtained W and H can predict V . After defining the cost function,
the NMF problem reduces to the problem of optimizing the cost function. There are two basic
cost functions as described below (Lee and Seung (2001)):
X
kA Bk2 =
(Aij Bij )2
(1.5)
ij

D(AkB) =

X
Aij
(Aij log
Aij + Bij )
Bij

(1.6)

ij

Both of them are lower bounded by zero and it happens when A = B.

1.3.1

A Simple NMF and Its Solution

Here a simple, classic NMF with its solution is presented. This algorithm is also used later in
the experiments (Lee and Seung (2001)).
NMF based on Euclidean Cost Function: Minimize kV W Hk2 with respect to W and
H,
subject to the constraints W, H 0
NMF based on Logarithmic Cost Function: Minimize D(V kW H) with respect to W and
H,
subject to the constraints W, H 0

CHAPTER 1. INTRODUCTION

Neither kV W Hk2 nor D(V kW H) are not convex in both variables W and H. They are
convex in W only or H. So, there is no algorithm that solves NMF globally. Instead we can
propose some algorithms to solve them locally (local optima).
newline The simplest algorithm to solve these problems is probably algorithm. But, convergence
can be slow. have faster convergence but more complicated. This objective function can be
related to the likelihood of generating the images in V from the basis W and encodings H. An
iterative approach to reach a local maximum that is explained in the following:
The Euclidean distance kV W Hk is non-increasing with these update rules:
Ha Ha

(W T V )a
(W T W H)a

Wia Wia

(V H T )ia
(W HH T )ia

The divergence D(V kW H) is non-increasing with these update rules:


P
P
Vi )/(W H)i
(Ha Vi )/(W H)i
i (Wia
P
P
Ha Ha
Wia Wia
k Wka
Ha

(1.7)

(1.8)

The convergence of the process is ensured. The initialization is performed using positive random
initial conditions for matrices W and H.

1.3.2

Uniqueness

The factorization is not unique: A matrix and its inverse can be used to transform the two
factorization matrices by for example (Xu et al. (2003)):
WH = WBB1 H

(1.9)

= WB and H
= B1 H are non-negative they form another
If the two new matrices W
and H
applies at least if B
parametrization of the factorization. The non-negativity of W
is a non-negative monomial matrix. In this simple case it will just correspond to a scaling
and a permutation. More control over the non-uniqueness of NMF is obtained with sparsity
constraints.

Chapter 2

Non-negative Matrix Factorization


Algorithms
2.1

Comparison of Different NMFs

The dataset consists of 472 images in which every image has 80 44 pixels. Every image has
already been segmented from a background and the background pixels values are set to a large
negative value. Every image corresponds to the 2-DG uptake in the OB of a rat in response
to a particular chemical substance. Also, images correspond to different animals so there are
small differences in size and maybe alignment.
The purpose of this research is to apply Non-negative Matrix Factorization (NMF) for biclustering images and pixels simultaneously and finally find pixels that respond in a similar
manner for a group of chemicals. Before doing any experiments, all the background pixels are
removed to decrease the dimension of all images. As said in section 1.1, since the introduction of
NMF numerous methods have been developed for different applications for different purposes.
In this research, in order to realize what method is appropriate for the given problem, some
methods are compared together and finally the best one with the minimum error is used. The
methods are as follows:
convexnmf : Convex and semi-nonnegative matrix factorizations (Ding et al. (2010)).
orthnmf : Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006)).
nmfnnls: Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares (Kim and Park (2008) and Kim and Park (2007)).
nmfrule: Basic NMF(Lee and Seung (1999) and Lee and Seung (2001)).
nmfmult: NMF with Multiplicative Update Formula (Pauca et al. (2006)).
nmfals: NMF with Auxiliary Constraints (Berry et al. (2007)).
In order to make a reasonable comparison among these methods, every algorithm is executed
500 times and the average of all results is finally reported. Since the number of clusters, K in
NMF formula as shown in Equation 1.3, is not obvious, every algorithm is executed 10 times for
each number of clusters (1 to 50 clusters). As Table 2.1 shows, nmfnnls has the lowest RMSE
(Root Mean Square Error). Root Mean Square Error is computed as shown in Equation 2.1.
In this equation, n is the number of samples (images) and m is the number of pixels. In order
to be able to use NMF every image is vectorized. So, for example a 20 30 pixel images after
vectorization would become a vector with a length of 600. Figure 2.1 also shows the RMSE of
nmfnnls for different number of clusters. As it is expected, as the number of clusters increases,
RMSE decreases.
9

CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS


Method
convexnmf
orthnmf
nmfnnls
nmfrule
nmfmult
nmfals

RMSE
0.1260
0.1320
0.1159
0.1162
0.1657
0.1178

10

Maximum Number of Iterations


1000
1000
1000
1000
1000
1000

Table 2.1: Comparison of different NMF methods

0.16

0.15
RMSE
0.14

RMSE

0.13

0.12

0.11

0.1

0.09

0.08
0

10

15

20

25

30

35

40

45

#Clusters

Figure 2.1: Comparison of RMSE for different numbers of clusters


sP

i (Vi

(W H)i )2

mn

(2.1)

As shown in Table 2.1, nmfnnls is the best method for the given problem. Later in Chapter 4,
this method will be used to bi-clustering images and pixels simultaneously.
Just to show how NMF works on our problem, the output of NMF is shown with K = 12.
Later in Chapter 4, the number of clusters would be determined through some experiments.
Figure 2.2 shows the obtained clusters (They can be interpreted as the cluster centers). It should
be mentioned that each cluster is referring to a column of W in NMF. Also, Figure 2.3 shows
three samples of images, before and after applying NMF. In this figure, the first column on the
right is the original images obtained from the laboratory experiments. The middle column is
related to normalized images. The column on the left shows the results obtained by NMF. As
it can be seen, the main components of patterns are much clearer than the original ones. It
should be mentioned that in order to use NMF, it is mandatory to have only positive values.
First, The original images are vectorized to form V in which every column represents an image.
Then, V is normalized through the rows. All the experiments are done with normalized data.

2.2

NMF Based on Alternating Non-Negativity Constrained


Least Squares And Active Set Method

This section is devoted to the detail description of the method developed by Kim and Park
(2008). This method has shown the best performance in the experiments. The method tries
to produce sparse W and H. In order to enforce sparseness on W or H, Kim and Park (2008)

CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS

11

introduces two formulations and the corresponding algorithms for sparse NMFs, i.e. SNMF/L
for sparse W (where L denotes the sparseness imposed on the left factor) and SNMF/R for
sparse H (where R denotes the sparseness imposed on the right factor). The introduced sparse
NMF formulations that impose the sparsity on a factor of NMF utilize L1-norm minimization
and the corresponding algorithms are based on alternating non-negativity constrained least
squares (ANLS). Each sub-problem is solved by a fast non-negativity constrained least squares
(NLS) algorithm (Van Benthem and Keenan (2004)) that is improved upon the active set based
NLS method.

2.2.1

Formulation of Sparse NMFs

SNMF/R
To apply sparseness constraints on H, the following SNMF/R optimization problem is used:
n

minW,H

X
1
||A W H||2F + ||W ||2F +
||H(:, j)||21 ,
2

s.t.

W, H 0

(2.2)

j=1

where H(:, j) is teh j-th column vector of H, > 0 is a parameter to supress ||W ||2F , and > 0
is a regularization parameter to balance trade-off between the accuracy of the approximation
and the sparseness of H. The SNMF/R algorithm begins with the initialization of W with
non-negative values. Then, it iterates the followinf ANLS until convergence:

 2




A
W
, s.t. H 0

(2.3)
H
min
e1k
01n F
H
where e1k R1k is a row vector with all components equal to one and 01n R1n is a zero
vector, and
 T 
 T  2
H

A
T

, s.t. W 0
min
W
(2.4)
0km F
Ik
W
where Ik is an identity matrix of size k k and 0km is a zero matrix of size k m. Equation 2.3
minimizes L1 -norm of columns of H Rkn which imposes sparcity of H.
SNMF/L
To impose sparseness constraints on W, the following formulation is used:
n

minW,H

X
1
||A W H||2F + ||H||2F +
||W (i, :)||21 ,
2

s.t.

W, H 0

(2.5)

j=1

where W (i, :) is the i-th row vector of W , > 0 is a parameter to suppress ||H||2F , and > 0 is
a regularization parameter to balance trade-off between the accuracy of the approximation and
the sparseness of W. The algorithm for SNMF/L is also based on ANLS.

CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS

Cluster #1

Cluster #2

Cluster #3

Cluster #4

20

20

20

20

40

40

40

40

60

60

60

60

80

80
20

40

80
20

Cluster #5

40

80
20

Cluster #6

40

20

Cluster #7

20

20

20

40

40

40

40

60

60

60

60

80
20

40

80
20

Cluster #9

40

80
20

Cluster #10

40

20

Cluster #11

20

20

20

40

40

40

40

60

60

60

60

80
20

40

80
20

40

40

Cluster #12

20

80

40

Cluster #8

20

80

12

80
20

40

20

Figure 2.2: Demonstration of Clusters obtained by NMFNNLS

40

CHAPTER 2. NON-NEGATIVE MATRIX FACTORIZATION ALGORITHMS

Estimated #39

Normalized #39

Main #39

20

20

20

40

40

40

60

60

60

80

80
20

40

80
20

Estimated #181

40

20

Normalized #181

Main #181

20

20

20

40

40

40

60

60

60

80

80
20

40

80
20

Estimated #230

40

20

Normalized #230

20

20

40

40

40

60

60

60

80
20

40

40

Main #230

20

80

40

80
20

40

20

40

Figure 2.3: Comparison of Three Images After Applying NMF

13

Chapter 3

Comparison of NMF and FCM


As it was shown in chapter 2, sparse non-negative matrix factorizations via alternating nonnegativity-constrained least squares (Kim and Park (2008) and Kim and Park (2007)) has the
best performance for solving the problem. In this chapter, the best NMF algorithm (nmfnnls) is
compared with FCM. The chapter is organized as follows. First in Section 3.1 FCM is introduced
briefly. In Section 3.2, the comparison measure is explained and finally in Section 3.3, NMF
and FCM are compared according to the measure.

3.1

Fuzzy c-means (FCM)

Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two
or more clusters. It is based on minimization of the following objective function:
Jm =

N X
C
X

2
um
ij ||xi cj || ,

1m<

(3.1)

i=1 j=1

where m is any real number greater than 1, uij is the degree of membership of xi in the cluster
j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and
|| || is any norm expressing the similarity between any measured data and the center. Fuzzy
partitioning is carried out through an iterative optimization of the objective function shown
above, with the update of membership uij and the cluster centers cj by Equation 3.2 and 3.3
respectively.
1
uij =
(3.2)

 2
m1
PC
||xi cj ||
k=1

||xi ck ||

PN

m
i=1 uij .xi
m
i=1 uij

cj = PN
(k+1)

(k)

(3.3)

This iteration will stop when maxij |uij


uij | < , where  is a termination criterion
between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum
or a saddle point of Jm . The algorithm is composed of the following steps:
1. Initialize U = [uij ] matrix, U (0)
2. At k-step: calculate the centers vectors C (k) = [cj ] with U (k) :
PN

cj =

i=1

PN

um
ij .xi

i=1

14

um
ij

CHAPTER 3. COMPARISON OF NMF AND FCM

15

3. Update U (k) , U (k+1) :


1

uij =


PC

k=1

(k+1)

4. If maxij |uij

3.2

||xi cj ||
||xi ck ||

2
 m1

(k)

uij | <  then STOP; otherwise return to step 2.

Cluster Validity Index

In order to compare two clustering method it is mandatory to use a measure which is usually
called cluster validity index. In this research, a cluster validity index developed by Kim et al.
(2004) is used. This cluster validity index is based on the relative degree of sharing of two
clusters as the weighted sum of the relative degrees of sharing for all data.

3.2.1

Relative degree of sharing of two fuzzy clusters

Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition (U, V ) and c be the number
of clusters.
Definition 1. The relative degree of sharing of two fuzzy clusters Fp and Fq at xj is defined
as:
F (xj ) Fq (xj )
Srel (xj : Fp , Fq ) = 1pPc
(3.4)
i=1 Fi (xj )
c
The numerator is the degree of sharing of Fp and Fq at xj and the denominator is the average
membership value of xj over c fuzzy clusters. fuzzy clusters. The relative degree of sharing of
two fuzzy clusters is defined as the weighted summation of Srel (xj : Fp , Fq ) for all data in X.
Definition 2. (Relative degree of sharing of two fuzzy clusters). The relative degree of
sharing of fuzzy clusters Fp and Fq is defined as:
Srel (Fp , Fq ) =

n
X

Srel (xj : Fp , Fq )h(xj )

(3.5)

j=1

P
where h(xj ) = ci=1 Fi (xj )loga Fi (xj ).
Here, h(xj ) is the entropy of datum xj and Fi (xj ) is the membership value with which xj
belongs to cluster Fi . h(xj ) measures how vaguely the datum xj is classified over c different
clusters. h(xj ) is introduced to assign a weight for vague data. Vague data are given more
weight than clearly classified data. h(xj ) also reflects the dependency of Fi (xj ) with respect
to different c values. This approach makes it possible to focus more on the highly-over-lapped
data in the computation of the validity index than other indices do. Since olfactory data are
highly over-lapped data, this indeces is a good measure.

3.2.2

Validity Index

Definition 3.(Validity Index) Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition
(U, V ) and c be the number of clusters. Let Srel (Fp , Fq ) be the degree of sharing of two fuzzy
clusters. Then the index V is defined as:
c

X
XX
2
2
V (U, V : X) =
Srel (Fp , Fq ) =
[c.[Fp (xj ) Fq (xj )]h(xj )]
c(c 1)
c(c 1)
p6=q

p6=q j=1

(3.6)

CHAPTER 3. COMPARISON OF NMF AND FCM

3.3

16

NMF VS. FCM

Figure 3.1 shows the cluster validity index, explained in the previous section, for different
number of clusters both for NMF and FCM. The values are also listed in Table 3.1. To obtain
every value of the index, every algorithm is repeated 10 times and the average is reported.
This is mandatory since both FCM and NMF begins with random seeds and converge to local
minima. For FCM, euclidean norm is used that is shown in Equation 3.7. m that is called the
fuzzification factor in FCM algorithm is set to 1.1. Experiments show that the larger values for
m produce weak results.
q
x21 + + x2n

||x|| :=

(3.7)

As Figure 3.1 shows, it does not matter what the number of clusters(K) is, NMF always
outperforms FCM and even it shows a more stable behavior. As an example, both algorithms
are applied on the main dataset with 12 number of clusters. The results are shown in Figure 3.5
and 3.4. As it is clear, NMF is able to find a large variety of patterns and since the output is
a compressed version of the original data, in some cases the noises are removed. At the other
hand, in FCM, clusters are very noisy and also there is a big similarity among clusters 3, 8 and
10. In contrast, NMF images are less noisy and very heterogeneous. This is a very important,
desirable property for our problem, since the clusters would be very different and probably more
interpretable.
Since the cluster validity index values are so similar for K = 2 and K = 3 as shown in
Table 3.1, clusters obtained by NMF and FCM are demonstrated and compared for K = 2 and
K = 3. In the case of K = 2, because there is a little difference between NMF and FCM in
the terms of cluster validity index shown in Table 3.1, we expect that the images be similar.
Figure 3.2(b) and 3.2(a) also support this hypothesis. However, Figure 3.3(b) and 3.3(a) show
that NMF is able to produce better clusters than fcm, even for a small number of clusters. Since
NMF does not perform a global optimization and tries to factorize V into W and H, it is more
probable that NMF grasps local patterns like those blue patterns appeared in NMF picture that
are not presenting in FCM results.
4

x 10

FCM

1.8

NMF
1.6

Cluster Validity Index

1.4

1.2

0.8

0.6

0.4

0.2

10

15

20

25

30

#Clusters

Figure 3.1: FCM VS. NMF

35

40

45

CHAPTER 3. COMPARISON OF NMF AND FCM

K
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

NMF
1762.62
1866.94
1885.21
1742.53
1819.51
1740.90
1738.99
1730.64
1800.77
1756.36
1795.11
1865.47
1795.58
1837.64
1844.71
1762.42
1870.11
1855.41
1888.06
1887.25
1919.54
1990.77

FCM
1782.22
1958.48
2304.26
2671.88
3060.68
3462.59
3867.56
4256.01
4651.38
5025.08
5428.79
5836.79
6255.98
6655.65
7050.31
7463.09
7865.73
8262.39
8649.33
9062.16
9427.37
9843.57

K
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

NMF
1936.99
1974.52
1991.93
1955.30
2070.83
2049.97
2101.71
2017.98
2065.92
2089.04
2077.78
2091.19
2190.53
2101.48
2168.09
2211.88
2163.61
2219.84
2206.45
2338.49
2277.34
2287.03

17

FCM
10209.10
10635.04
11027.00
11367.68
11813.05
12243.92
12654.37
13015.36
13382.79
13836.93
14138.64
14614.99
15020.55
15462.93
15754.74
16224.19
16589.04
17003.98
17393.46
17708.48
18165.53
18535.29

Table 3.1: Cluster Validity Index Values for Different Ks

CHAPTER 3. COMPARISON OF NMF AND FCM

18

Cluster #1

Cluster #2

10

10

20

20

30

30

40

40

50

50

60

60

70

70

80

80
10

20

30

40

10

20

30

40

(a) Clusters obtained by FCM for K = 2

Cluster #1

Cluster #2

10

10

20

20

30

30

40

40

50

50

60

60

70

70

80

80
10

20

30

40

10

20

(b) Clusters obtained by NMF for K = 2

Figure 3.2: Comparison of Clusters for K = 2

30

40

CHAPTER 3. COMPARISON OF NMF AND FCM

Cluster #1

19

Cluster #2

Cluster #3

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80
10

20

30

40

80
10

20

30

40

10

20

30

40

(a) Clusters obtained by FCM for K = 3


Cluster #1

Cluster #2

Cluster #3

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80
10

20

30

40

80
10

20

30

40

10

20

(b) Clusters obtained by NMF for K = 3

Figure 3.3: Comparison of Clusters for K = 3

30

40

CHAPTER 3. COMPARISON OF NMF AND FCM

Cluster #1

Cluster #2

20

Cluster #3

Cluster #4

20

20

20

20

40

40

40

40

60

60

60

60

80

80
20

40

80
20

Cluster #5

40

80
20

Cluster #6

40

20

Cluster #7

Cluster #8

20

20

20

20

40

40

40

40

60

60

60

60

80

80
20

40

80
20

Cluster #9

40

80
20

Cluster #10

40

20

Cluster #11

20

20

20

40

40

40

40

60

60

60

60

80
20

40

80
20

40

40

Cluster #12

20

80

40

80
20

40

Figure 3.4: Demonstration of Clusters obtained by FCM

20

40

CHAPTER 3. COMPARISON OF NMF AND FCM

Cluster #1

Cluster #2

21

Cluster #3

Cluster #4

20

20

20

20

40

40

40

40

60

60

60

60

80

80
20

40

80
20

Cluster #5

40

80
20

Cluster #6

40

20

Cluster #7

Cluster #8

20

20

20

20

40

40

40

40

60

60

60

60

80

80
20

40

80
20

Cluster #9

40

80
20

Cluster #10

40

20

Cluster #11

20

20

20

40

40

40

40

60

60

60

60

80
20

40

80
20

40

40

Cluster #12

20

80

40

80
20

40

Figure 3.5: Demonstration of Clusters obtained by NMFNNLS

20

40

Chapter 4

Bi-clustering of Images and Pixels


One of the main goals in the analysis of image datasets is to identify groups of images, or
groups of pixels (regions in images), that exhibit similar expression patterns. A type of image
datasets that has been discussed extensively in the literature is gene expression datasets. In
these datasets, every column represents a gene and every row represents the expression level
of a gene. Related to the clustering of gene expression data sets, several clustering techniques,
such as k-means, self-organizing maps (SOM) or hierarchical clustering have been extensively
applied to identify groups of similarly expressed genes or conditions from gene expression data.
Additionally, hierarchical clustering algorithms have been also used to perform two-way clustering analysis in order to discover sets of genes similarly expressed in subsets of experimental
conditions by performing clustering on both, genes and conditions, separately. The identification of these block-structures plays a key role to get insights into the biological mechanisms
associated to different physiological states as well as to define gene expression signatures, i.e.,
genes that are coordinately expressed in samples related by some identifiable criterion such as
cell type, differentiation state, or signaling response.
In the literature there are numerous works related to the clustering of gene-expression images. However, to the knowledge of the writers of this report, no work has been reported yet
related to application of NMF for bi-clustering of olfactory images. The bi-clustering of olfactory images is so important since it is believed that it would help us to understand the olfactory
coding. A code is a set of rules by which information is transposed from one form to another.
In the case of olfactory coding, it would describe the ways in which information about odorant
molecules is transposed into neural responses. The idea is that when we understood that code,
we might be able to predict the odorant molecule from the neural response, the neural response
from the odorant molecule, and the perception of the odorant molecule from the neural response
(Leon and Johnson (2003)).
To approach the problem, standard clustering methods are the first choice that comes to
mind. Although standard clustering algorithms have been successfully applied in many contexts,
they suffer from two well known limitations that are especially evident in the analysis of large
and heterogeneous collections of gene expression data (Carmona-Saez et al. (2006)) and also
olfactory data:
i) They group images (or pixels) base on global similarities. However, a set of co-regulated
images might only be co-expressed in a subset of pixels, and show not related, and almost
independent patterns in the rest. In the same way, related pixels may be characterized by only
a small subset of coordinately expressed images.
ii) Standard clustering algorithms generally assign each image to a single cluster. Nevertheless, many images can be involved in different biological processes depending on the cellular
requirements and, therefore, they might be co-expressed with different groups of images under
different pixels. Clustering the images into one and only one group might mask the interrelationships between images that are assigned to different clusters but show local similarities. In
22

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

23

the last few years several methods have been proposed to avoid these drawbacks. Among these
methods, bi-clustering algorithms have been presented as an alternative approach to standard
clustering techniques to identify local structures from gene expression datasets. These methods
perform clustering on genes and conditions simultaneously in order to identify subsets of genes
that show similar expression patterns across specific subsets of experimental conditions and vice
versa.
In this research, NMF that was introduced in chapter 2, is applied for bi-clustering images
and pixels at the same time. Bi-clustering of images and pixels would help us to recognize what
patterns are repeated over a data set.
The rest of the chapter is organized as follows. First, it is explained that how NMF results
can be used for bi-clustering. Then, in Section 4.2, a method is introduced for Feature Selection.
With the help of this method, the most effective pixels are chosen. This would finally result
in bi-clusters patterns. In Section 4.3, a method called Visual Assessment Cluster Tendency
(VAT) is reviewd. This method is used for finding the number of co-clusters. The results of VAT
is presented in Section 4.3.2. In Section 4.5, first, NMF is applied for factorization of vectorized
images, then the feature selection method is used and finally the co-clusters are demonstrated
and discussed.

4.1

NMF for bi-clustering

Non-negative factorization of Equation 1.3 is applied to perform clustering analysis of a data


matrix according to Kim and Park (2007).
As said in Section 1.2, decomposing V into two matrices W and H is the final goal of a
matrix factorization. W is usually called the basis matrix and H is called the coefficient matrix.
In a data matrix V that the rows represent pixels and the columns images, the basis matrix W
can be used to divide the m images into k image-clusters and the coefficient matrix H to divide
the n pixels into k pixel-clusters. Typically, pixel i is assigned to image-cluster q if the W (i, q)
is the largest element in W (i, :) and sample j is assigned to sample-cluster q if the H(q, j) is
the largest element in H(:, j). In the following section, the feature selection method used for
filtering features are explained.

4.2

Feature Selection

Suppose after applying NMF, W is obtained with m rows and K columns. Every column is
representing an image (cluster) and every row is representing a pixel. The problem is to find
the most effective pixels in every cluster:
Step 1. First all the rows of W are normalized. A row has the data related to the
membership of a pixel to different clusters (images). Normalization of row i is done by
the following equation:
W (i, k)
W (i, k) = PK
(4.1)
j W (i, j)
Step 2. Using the following equation, the data are scaled such that a pixel (feature) with
a smaller membership value, would have a bigger negative value. The results are then
stored in matric LogW .
LogW (i, k) = log2 (W (i, k)) W (i, k)

(4.2)

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

24

Step 3. Feature scores are computed using LogW from step 2. Score(i) represents the
score of i-th feature (pixel) and K is the number of clusters.
K

Score(i) = 1 +

X
1

LogW (i, j)
log2 (K)

(4.3)

j=1

Two extreme examples are presented here to show how the Score works. Assume that
there two features with the following membership values to 5 cluster:
F1 = [1, 0, 0, 0, 0]
F2 = [0.2, 0.2, 0.2, 0.2, 0.2]
Score of feature F1 will be equal to 1 since:
Score(F1 ) = 1 +

1
(1 log2 1) = 1
log2 5

(4.4)

Score of feature F2 will be equal to 0 since:


Score(F2 ) = 1 +

1
log2 0.2
(0.2 log2 0.2 + + 0.2 log2 0.2) = 1 +
=0
log2 5
log2 5

(4.5)

Step 4. After finding the scores of all features (pixels), the more effective features can be
filtered. In this research the first 65% of pixels are considered as the most effective ones.

4.3

Choosing the Number of Clusters

In Chapter 3, FCM and NMF are compared according to a cluster validity index introduced by
Kim et al. (2004) which is a fuzzy index. The comparison was possible due to fuzzy interpretation
of clusters obtained by NMF. Each W (i, j) can be interpreted as the membership degree of Pixel
i in Cluster J (Image) or in other words the Degree that Pixel i is a member of Cluster J. The
comparison shows that NMF is able to produce more well-separated clusters with a desirable
density. However as Figure 3.1 shows the index is monotonically increasing and does not give a
clue about the number of clusters.
In order to choose the number of clusters, a visual method called Visual Assessment of Cluster Tendency (VAT) is used. The method was originally introduced in Bezdek and Hathaway
(2002) and extended later by others such as Hathaway et al. (2006), Bezdek et al. (2007), Park
et al. (2009) and Havens and Bezdek (2011). The method applied in this research is introduced
in Bezdek et al. (2007). In the following the method is explained briefly.

4.3.1

Visual Assessment of Cluster Tendency (VAT)

We have an m n matrix D, and assume that its entries correspond to pair wise dissimilarities
between m row objects Or and n column objects Oc , which, taken together (as a union), comprise
a set O of N = m+n objects. Bezdek et al. (2007) develops a new visual approach that applies to
four different cluster assessment problems associated with O. The problems are the assessment
of cluster tendency:
P1) amongst the row objects Or ;
P2) amongst the column objects Oc ;
P3) amongst the union of the row and column objects Or

Oc and;

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

25

P4) amongst the union of the row and column objects that contain at least one object of
each type (co-clusters).
The basis of the method is to regard D as a subset of known values that is part of a larger,
unknown N N dissimilarity matrix, and then impute the missing values from D. This results in
estimates for three square matrices (Dr , Dc , Dr S c ) that can be visually assessed for clustering
tendency using the previous VAT or sVAT algorithms. The output from assessment of Dr S c
ultimately leads to a rectangular coVAT image which exhibits clustering tendencies in D.
Introduction
We consider a type of preliminary data analysis related to the pattern recognition problem of
clustering. Clustering or cluster analysis is the problem of partitioning a set of objects O =
{O1 ,. . . , On } into c self-similar subsets based on available data and some well-defined measure
of (cluster) similarity (Bezdek and Hathaway (2002)). All clustering algorithms will find an
arbitrary (up to 1 c n) number of clusters, even if no actual clusters exist. Therefore,
a fundamentally important question to ask before applying any particular (and potentially
biasing) clustering algorithm is: Are clusters present at all?
The problem of determining whether clusters are present as a step prior to actual clustering
is called the assessing of clustering tendency. Various formal (statistically based) and
informal techniques for tendency assessment are discussed in Jain (1988) and Everitt (1978).
None of the existing approaches is completely satisfactory (nor will they ever be). The main
purpose of the research is to add a simple and intuitive visual approach to the existing repertoire
of tendency assessment tools. The visual approach for assessing cluster tendency introduced
can be used in all cases involving numerical data. From now on VAT is used as an acronym for
Visual Assessment of Tendency. The VAT approach presents pair wise dissimilarity information
about the set of objects O = {O1 ,. . . , On } as a square digital image with n2 pixels, after the
objects are suitably reordered so that the image is better able to highlight potential cluster
structure Bezdek and Hathaway (2002).
Data Representation
There are two common data representations of O upon which clustering can be based:
Object data representation: When each object in O is represented by a (column)
vector x in <n , the set X = {x1 , ..., xn } <n is called an object data representation of
O. The k th component of the ith feature vector (xki ) is the value of the k th feature (e.g.,
height, weight, length, etc.) of the ith object. It is in this data space that practitioners
sometimes seek geometrical descriptors of the clusters.
Relational data representation: When each pair of objects in O is represented by
a relationship, then we have relational data. The most common case of relational data
is when we have (a matrix of) dissimilarity data, say R = [Rij ] , where Rij is the pair
wise dissimilarity (usually a distance) between objects oi and oj , for 1 i, j n. More
generally, R can be a matrix of similarities based on a variety of measures.
Dissimilarity Images
Let R be an nn dissimilarity matrix corresponding to the set O = {o1 , . . . , on }. R satisfies the
following (metric) conditions for all 1 i, j n:
Rij 0
Rij = Rji

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

26

Rij = 0
R can be displayed as an intensity image I which is called a dissimilarity image. The intensity
or gray level gij of pixel (i,j) depends on the value of Rij . The value Rij = 0 corresponds to gij
= 0 (pure black); the value Rij = Rmax , where Rmax denotes the largest dissimilarity value in
R, gives gij = Rmax (pure white). Intermediate values of Rij produce pixels with intermediate
levels of gray in a set of gray levels G = {G1 , . . . , Gm }. A dissimilarity image is not useful
until it is ordered by a procedure. We will attempt to reorder the objects {o1 , o2 , . . . , on } as
{ok1 , ok2 , . . . , okn } so that, to whatever degree possible, if ki is near kj , then oki is similar to
okj . The corresponding ordered dissimilarity image (ODI)I will often indicate cluster tendency
in the data by dark blocks of pixels along the main diagonal. The ordering is accomplished by
processing elements in the dissimilarity matrix R (rather than using the objects or object data
directly).
The images shown below use 256 equally spaced gray levels, with G1 = 0 (black) and Gm =
Rmax (white). The displayed gray level of pixel (i,j) is the level gij G that is closest to Rij .
The example is from Bezdek and Hathaway (2002). The corresponding dissimilarity matrix is
shown below:

0
0.73 0.19 0.71 0.16
0.73
0
0.59 0.12 0.78

0
0.55 0.19
R = 0.19 0.59

0.71 0.12 0.55


0
0.74
0.16 0.78 0.19 0.74
0

0.5

0.5

1.5

1.5

2.5

2.5

3.5

3.5

4.5

4.5

5.5

5.5
1

(a) A simple dissimilarity images

(b) Ordered dissimilarity images

Figure 4.1: A simple dissimilarity images before and after ordering with VAT

VAT & co-VAT Algorithms


The algorithm of VAT is explained below. Introduction of VAT is mandatory since it is used in
coVAT:
Input: An M M matrix of pair wise dissimilarities R = [rij ] satisfying:
for all 1 i, j M : rij = rji
rij 0
rii = 0.
Step 1.

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

27

Set I = 0; J = 1, 2, ..., M ; P = (0, 0, ..., 0)


Select (i, j) arg maxpJ,qJ {rpq }.
S
Set P (1) = j; Replace I I j and J J j.
Step 2.
For t = 2, ..., M :
Select (i, j) arg minpI,qJ {rpq }.
S
Set P (t) = j; Replace I I j and J J j.
Next t.
= [
Step 3. Form the ordered dissimilarity matrix R
rij ] = [rP (i)P (j) ], for 1 i, j M .
scaled so that max rij corresponds to white and min rij to black.
Output: Image I(R),
Before beginning explaining coVAT algorithm, it is mandatory to introduce Dr S c . Dr S c is
composed of D, Dr and Dc as it is shown below:


Dr D
S
Dr c =
(4.6)
D T Dc
If we define d(oi , oj ) as the distance between object oi and oj , then Dr S c can be rewritten
as:

Dr
DT

..

.


D
d(o
,
o
)

m 1

=
d(o1 , om+1 )
Dc

..
..

.
.
d(o1 , o1 )

..

d(o1 , om+n )

d(o1 , om )

..


d(o1 , om+1 ) d(om , om+1 )


..
..
..


.
.
.

d(om , om ) d(o1 , om+n ) d(om , om+n )

d(om , om+1 )
d(om+1 , om+1 ) d(om+1 , om+n )

..
..
..
..

.
.
.
.
d(om , om+n )
d(om+n , om+1 ) d(om+n , om+n )

If we look at Dr S c , Dr has the dimension of m m where m is the number of row objects,


Dc has the dimension of n n where n is the number of column objects and finally Dr S c that
has the dimension of the m + n. Also, it should be considered that Dr and Dc are unknown
and calculated through the following estimations:
[Dr ]ij = r kdi, dj, k

f or

1 i, j m

(4.7)

[Dc ]ij = c kd,i d,j k f or

1 i, j n

(4.8)

where r and c are scale factors. Also any other norm can be used. Finally with Dr and Dc
Dr S c can be achieved. The coVAT algorithm (Bezdek et al. (2007)) is as follows. As it can be
seen it uses VAT algorithm (Bezdek and Hathaway (2002)) inside:
Input: An m n matrix of pair wise dissimilarities D = [dij ] satisfying, for all 1 i m and
1 j n : dij 0.
Step 1. Build estimates of Dr and Dc using the following formula:
Step 2. Build estimates of Dr S c using Formula 1.
Step 3. Run VAT on Dr S c , and save the permutation array Pr S c = (P (1), ..., P (m + n))
Initialize rc = cc = 0; RP = RC = 0
Step 4.

For t = 1, ..., m + n:

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

28

If P (t) m
rc = rc + 1 %rc = rowcomponent
RP (rc) = P (t) %RP = rowindices
Else
cc = cc + 1 %cc = columncomponent
CP (cc) = P (t) %CP = columnindices
End If
Next t
= [dij ] = [dRP (i)CP (j) ] for
Step 5. From the co-VAT ordered rectangular dissimilarity matrix D
1 i m and 1 j n
Output: : Rectangular Image I((D)), scaled so that maxdij corresponds to white and mindi j
to black.
Advantages, Disadvantages
Advantages
VAT could be applied on dataset with missing data. If the original data has missing
components (is incomplete), then any existing data imputation scheme can be used
to fill in the missing part of the data prior to processing. The ultimate purpose
of imputing data here is simply to get a very rough picture of the cluster tendency
in O. Consequently, sophisticated imputation schemes, such as those based on the
expectation-maximization (EM) algorithm are unnecessarily expensive in both complexity and computation time.
The VAT tool is widely applicable because it displays an ordered form of dissimilarity
data, which itself can always be obtained from the original data for O. We must
consider the fact that almost all of the other clustering validity indexes (CVI) like
Xie and Beni (1991) that are used for determining the number of clusters can be used
only after that the clustering is done. So, usually clustering is repeated with different
values for the number of clusters and the obtained CVIs are compared to choose the
best. This approach is so time consuming and in fact could be impractical for large
data sets. Also as mentioned in the introduction, these approaches are biased ones
since they assume that data could be clustered without considering the possibility
that maybe there is no clusters in the data. Some examples are also shown in Bezdek
et al. (2007) part IV Numerical Examples.
VAT does not need any parameter optimization because roughly it has no parameters.
This is a great advantages over other methods that usually need a lot of effort on
parameter optimization.
The usefulness of a dissimilarity image for visually assessing cluster tendency depends
crucially on the ordering of the rows and columns of R. The VAT ordering algorithm
can be implemented in O(M 2 ) time complexity and is similar to Prims algorithm
for finding a minimal spanning tree (MST) of a weighted graph. As M grows the
complexity grows non linearly. Are there effective size limits for this new approach?
No! If D is large, then the square dissimilarity matrix processing of Dr , Dc and Drc
that is required to apply coVAT to D can be done using sVAT, the scalable version
of VAT (Hathaway et al. (2006)). The bottom line? This general approach works for
even very large (unloadable) data sets (Bezdek et al. (2007)).

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

29

By examining the permutation arrays and correlating their entries with the dark
blocks produced by VAT-coVAT, a crude clustering of the data is produced. At a
minimum, the visually identified clusters could be used as a good initialization of
an iterative dissimilarity-clustering algorithm such as the non-Euclidean relational
fuzzy (NERF) c-means algorithm introduced in (Hathaway and Bezdek (1994)).
Disadvantages
The coVAT image does a very good job of representing co-cluster structure, as we
must first do but its computation is expensive, is there a cheaper, direct route from

the original D to a usefully reordered D?


The image derived from V AT (Drc ) for the co-clustering problem (P4) does not
usually allow us to visually distinguish between pure clusters and co-clusters. We
get this information for coVAT by examining the permutation array, but can it be
better represented in the image?
The estimation of those parts of V AT (Drc ) that are unknown is done by euclidean
norm. The method is very simple and it could be meaningless sometimes because of
euclidean norm. Many experiments could be conducted to see the effects of the other
norms. Also more advanced models except Expectation Maximization (EM) models
that are so costly can be used to deliver a better approximation. The accuracy
of approximation is critical in the model since the missed parts of V AT (Drc ) are
fulfilled through this approximation. The authors propose the triangle inequality
based approximation (TIBA) as an alternative discussed in Hathaway and Bezdek
(2002).

4.3.2

VAT results

r and c which are scale factors are set to 0.1. In order to decrease the computational cost of
VAT, first NMF is applied to decompose the images (V ) into two matrices of W and H. Then
VAT is applied to evaluate only W . The number of clusters in NMF is set to 20 which is a
relatively big number. We suppose that in NMFs results some bi-clusters are presented that
we do not know their numbers and by applying VAT on W the number of bi-clusters can be
estimated. The growth of the size of Dr S c is shown in Figure 4.2. As Figure 4.2 shows, with
472 images and 3520 pixels for each image, Dr S c will be as big as a 16-million pixel image and
processing such image in VAT is so difficult. In addition, our experiments show that finding
patterns in such big images is extremely difficult and almost impossible. These are the main
reasons that VAT is applied on the output of NMF.
The results are shown in Figure 4.3 and Figure 4.4. To increase the contrast of the picture
and make it more clear, coVAT image is filtered. The result image is shown in Figure 4.5. As
Figure 4.5 shows, in the last 7 rows there are some strong evidences about co-clusters. Also
there are some weak patterns in rows number 7 to 13. It is mandatory to do some experiments
to realize whether it is necessary to consider more than 7 clusters or no. In the next section,
this question will be answered by analyzing the stability of NMF.

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

30

Figure 4.2: Growth of Complexity of VAT

Dr for P1

Dc for P2
200
400

600
800

10

1000
1200
15
1400
1600
20
2

10

12

14

16

18

20

200

400

600

Dunion for P3

800

1000

1200

1400

1600

1200

1400

1600

coVAT Image: P4

200
400

600
800

10

1000
1200
15
1400
1600
20

1800
200

400

600

800

1000

1200

1400

1600

1800

200

400

600

800

Figure 4.3: VAT results for K=20 in NMF

1000

20

18

16

14

12

10

200

400

600

1000

1200

Figure 4.4: coVAT results for K=20 in NMF

800

coVAT Image: P4

1400

1600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS


31

20

18

16

14

12

10

200

400

800

1000

1200

1400

Figure 4.5: coVAT high contrast results for K=20 in NMF

600

1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS


32

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

K
2
3
4
5
6
7
8
9
10
11
12
13
14

CVI
Mean
Std
1754.19
27.02
1800.75
71.45
1811.75
96.93
1736.7
54.34
1733.19 48.74
1747.37
61.94
1819.98
74.13
1811.27
66.34
1811.48
54.27
1793.28
53.77
1826.54
47.6
1829.09
50.05
1844.67
55.1

33

RMSE
Mean
Std
0.1392 7.88e-14
0.1286 4.55e-07
0.1216 1.17e-05
0.1159 2.29e-05
0.1127 2.56e-05
0.1101 2.94e-05
0.1079 2.56e-05
0.1059 2.67e-05
0.1043 2.41e-05
0.1028 4.41e-05
0.1015 4.36e-05
0.1004 3.29e-05
0.0992 2.87e-05

Table 4.1: Stability Analysis for K = 2 to 14: Mean and Standard Deviation

4.4

Stability Analysis of NMF

As shown in Section 4.3.2, by the help of VAT the number of bi-clusters was estimated to be
near 7. However, there were some weak evidences about some other possible bi-clusters. In this
section, the stability of NMF is evaluated to decide about the number of bi-clusters. To do so,
NMF is iterated 100 times with the number of clusters changing from 2 to 14. The results are
summarized in Table 4.1, Figure 4.6 and Figure 4.7.
As Figure 4.6 and Figure 4.7 show, the minimum CVI with 100 iterations is achieved for
K = 6. Also K = 6 has the second best standard deviation after K = 2. K = 2 has the
minimum standard deviation because the algorithm converges so quickly to the same local
minimum and this decreases the standard deviation.

4.5

Bi-clustering of Images and Pixels

With the number of clusters determined to be 6, images and pixels can be clustered by NMF.
Figure 4.8(a) shows the clusters centers corresponding to the columns of W and Figure 4.8(b)
shows the components. These components are achieved simply by setting the maximum value
in each row of W (W (i, :)) equal to one and the rest equal to zero. The components also
clearly explain the additivity feature of NMF means that every image can be reconstructed by
a combination of basis components in W .
In order to visualize the matrix factorization, the feature selection method explained in
Section 4.2 is applied on the normalized data set. 30% of pixels (features) are selected and the
rest is removed. Figure 4.9 shows the factorization of V into W and H.

4.6

Cross-Validation

Cross-validation, sometimes called rotation estimation, is a technique for assessing how the
results of a statistical analysis will generalize to an independent data set. It is mainly used in
settings where the goal is prediction, and one wants to estimate how accurately a predictive
model will perform in practice. One round of cross-validation involves partitioning a sample
of data into complementary subsets, performing the analysis on one subset (called the training
set), and validating the analysis on the other subset (called the validation set or testing set). To

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

34

1860
CVI

1840

1820

Mean

1800

1780

1760

1740

1720
2

8
Cluster #

10

12

14

(a) Mean
100
CVI

90

80

Standard Deviation

70

60

50

40

30

20
2

8
Cluster #

10

12

14

(b) Standard Deviation

Figure 4.6: Mean and Standard Deviation of CVI for K = 2 to 14


reduce variability, multiple rounds of cross-validation are performed using different partitions,
and the validation results are averaged over the rounds.
To perform cross-validation, the whole data set is permutated randomly 5 times and for
every permutation, a 10-fold cross-validation is done. The results are summarized in Table 4.2.
Also five random folds are chosen, each from one permutation. The related clusters are depicted
in Figure 4.10. According to Table 4.2 and Figure 4.10, even with removing a part of the data
set, NMF still is able find very similar clusters.

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

35

0.145
RMSE
0.14

0.135

0.13

Mean

0.125

0.12

0.115

0.11

0.105

0.1

0.095
2

8
Cluster #

10

12

14

(a) Mean
5

x 10
4.5

RMSE

3.5

Standard Deviation

2.5

1.5

0.5

0
2

8
Cluster #

10

12

(b) Standard Deviation

Figure 4.7: Mean and Standard Deviation of RMSE for K = 2 to 14

14

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

Cluster #1

Cluster #2

Cluster #3

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80
10

20

30

40

80
10

Cluster #4

20

30

40

10

Cluster #5
10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80
20

30

40

20

30

40

Cluster #6

10

10

36

80
10

20

30

40

10

20

30

40

(a) Clusters
Cluster #1

Cluster #2

Cluster #3

10

10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80
10

20

30

40

80
10

Cluster #4

20

30

40

10

Cluster #5
10

10

20

20

20

30

30

30

40

40

40

50

50

50

60

60

60

70

70

70

80

80
20

30

40

30

40

Cluster #6

10

10

20

80
10

20

30

40

10

20

(b) Components

Figure 4.8: Clusters and Components for K = 6

30

40

Figure 4.9: V W H

1.2

0.8

0.6

0.4

0.2

0.2

0.4

0.6

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS


37

CHAPTER 4. BI-CLUSTERING OF IMAGES AND PIXELS

1
2
3
4
5
Mean
Variance
Std

RMSE
0.112625
0.112645
0.112663
0.112646
0.112651
0.112646
1.90826e-10
1.38139e-05

38

CVI
1742.4
1779.16
1741.94
1745.64
1768.51
1755.53
295.44
17.19

Table 4.2: Cross validation of NMF with K = 6

C#1
20
40
60
80

C#2
20
40
60
80

20 40
C#1
20
40
60
80

20
40
60
80
20 40
C#2

20
40
60
80
20 40
C#1

20
40
60
80

20 40
C#1

20 40
C#2

20 40

20 40

20 40
C#5

20 40
C#4

20 40
C#6
20
40
60
80

20 40
C#5
20
40
60
80

20 40

20 40
C#6
20
40
60
80

20
40
60
80

20
40
60
80
20 40

20 40
C#5

20 40
C#4

20 40
C#3

20 40
C#6
20
40
60
80

20
40
60
80

20
40
60
80

20
40
60
80

20 40
C#5

20 40
C#4

20 40
C#3

C#6
20
40
60
80

20
40
60
80

20
40
60
80

20
40
60
80

20
40
60
80

20 40
C#4

20 40
C#3

20 40
C#2

C#5
20
40
60
80

20
40
60
80

20
40
60
80

20
40
60
80

20
40
60
80

20 40
C#3

20 40
C#2

20 40
C#1

C#4
20
40
60
80

20
40
60
80

20
40
60
80

20
40
60
80

C#3

20 40
C#6
20
40
60
80

20 40

Figure 4.10: W : Five random sets of clusters from five permutations

20 40

Chapter 5

Conclusions and Future Researches


The available dataset consists of 472 images of 80 44 pixels. Every image, showing responses
in the olfactory bulb, corresponds to the 2-DG uptake in the OB of a rat in response to a
particular chemical substance. This research tries to take a completely new approach toward
analyzing and understanding the olfactory bulbs nature. The Olfaction is the sense of smell and
the olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perception
of odors. Analysis of olfactory images is so important since it is believed that it would help us
to understand the olfactory coding. A code is a set of rules by which information is transposed
from one form to another. In the case of olfactory coding, it would describe the ways in which
information about odorant molecules is transposed into neural responses. The idea is that when
we understood that code, we might be able to predict:
The odorant molecule from the neural response,
The neural response from the odorant molecule and,
The perception of the odorant molecule from the neural response.
The purpose of this research is to bi-cluster olfactory images and pixels simultaneously
and finally find pixels (regions of images) that respond in a similar manner for a selection of
chemicals. In order to find these regions, Non-negative Matrix Factorization (NMF) is applied.
Since the introduction of NMF by Paatero and Tapper (1994), NMF has been applied successfully in different areas including Chemometrics, Face Recognition, Multimedia data analysis, Text mining DNA gene expression analysis, Protein interaction and many others. To the
knowledge of the writers of this report, no work has been reported yet related to bi-clustering
of olfactory images. The important results and conclusions derived from this research are as
follows:
After comparison of different NMFs models, it came out that sparse models have a better
performance. This is probably because of the sparse nature of the data set. In the future,
the sparsity of the data set could be evaluated and more sparse NMF methods should be
used. This sparsity can be seen in very different shapes of the clusters obtained by NMF.
FCM, a well-known method in the literature of fuzzy clustering, is applied for clustering
images and the results are compared to NMF. NMF has shown a superior performance in
all the experiments and with growing the number of clusters, NMF shows a much better
performance. In the large number of clusters, NMF can find very heterogeneous patterns
while fcm completely fails to do so. It seems that application of classical methods such
as FCM that optimizes a global objective function is not a good choice for clustering
olfactory images.

39

CHAPTER 5. CONCLUSIONS AND FUTURE RESEARCHES

40

Since the problem is a blind clustering and therefore unsupervised, the tendency of clustering should be evaluated before clustering. This means that before applying any clustering
method one should ask if there is any cluster or no. Visual Assessment Cluster Tendency
(VAT) introduced by Bezdek and Hathaway (2002) is used to evaluate the cluster tendency. This method has a lot of advantages and the most important ones are that it
works directly on data not on the clusters obtained by a clustering method and it is a
parameter-free method and does not need any parameter setting. However, running of
VAT is very costly and every run with W not the whole data set V as the input, takes
around four hours. Other variations of VAT such as sVAT (Scalable, improved VAT:
Havens and Bezdek (2011)) could be considered in future. It should be mentioned that
after obtaining clusters, the interpretability of clusters must be evaluated by an expert
related to the context of the problem.
After obtaining clusters (W ) and H by NMF, V W H can be rewritten column by
column as v W h where v and h are the corresponding columns of V and H:

..
..

.
..
..
v

h
=

(5.1)
. W .


..
.
..

. n1
nK
K1
This lets us to reconstruct every olfactory image by the combination of clusters. This
additivity property of NMF allow us to realize the most effective regions in images that
finally leads to a better understanding of olfaction. Development of new additive clustering
methods such as NMF could be considered as a future work.
Since NMF and also FCM produce non-global solutions, it is probable to obtain different
clusters in every run. The sensitivity of NMF should be evaluated more extensively using
methods such as cross-validation.

Appendix A

Experiments and Specifications


In this section, all the experiments and their specifications are listed.

A.1

Chapter 2

A.1.1

Comparison of Different Algorithms of NMF

Acronym
convexnmf
orthnmf
nmfnnls

Model
Convex and semi-nonnegative matrix factorizations (Ding et al. (2010))
Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006))
Sparse non-negative matrix factorizations via alternating non-negativity
constrained least squares (Kim and Park (2008) and Kim and Park
(2007))
Basic NMF(Lee and Seung (1999) and Lee and Seung (2001))
NMF with Multiplicative Update Formula Pauca et al. (2006)
NMF with Auxiliary Constraints Berry et al. (2007)

nmfrule
nmfmult
nmfals

All initial seeds are created randomly and the maximum iterations is 1000.

A.2
A.2.1

Chapter 3
Comparison of FCM and NMF
Parameter
m (Fuzzification Factor)
Norm
Number of Clusters (K)
Initial Seeds
Maximum Iteration

A.3
A.3.1

Model
FCM
FCM
FCM and NMF
FCM and NMF
NMF (Kim and Park (2007))

Value
1.1
Euclidean
2 to 45
Random
1000

Chapter 4
coVAT
Parameter
Scaling Factor r
Scaling Factor c
Norm

Model
coVAT
coVAT
coVAT (Bezdek et al. (2007))

41

Value
0.1
0.1
Euclidean

APPENDIX A. EXPERIMENTS AND SPECIFICATIONS

A.3.2

Bi-clustering and Feature Selection


Parameter
Filter Percentage
Number of Clusters (K)
Maximum Iteration
Initial Seeds

A.3.3

Model
Feature Selection
NMF
NMF
NMF (Kim and Park (2007))

Value
30%
6
1000
Random Positive Seeds

Model
NMF
NMF
NMF
NMF (Kim and Park (2007))

Value
2 to 14
Random Positive Seeds
1000
100

Model
NMF
NMF
NMF
NMF
NMF (Kim and Park (2007))

Value
6
Random Positive Seeds
1000
10
5

Stability Analysis
Parameter
Number of Clusters (K)
Initial Seeds
Maximum Iteration
Number of Repetition

A.3.4

42

Cross-validation

Parameter
Number of Clusters (K)
Initial Seeds
Maximum Iteration
Number of Folds
Number of Permutations

Bibliography
Batcha, N. and A. Zaki (2010, jan.). Preliminary study of a new approach to nmf based text
summarization fused with anaphora resolution. In Knowledge Discovery and Data Mining,
2010. WKDD 10. Third International Conference on, pp. 367 370.
Benetos, E. and C. Kotropoulos (2010, nov.). Non-negative tensor factorization applied to music
genre classification. Audio, Speech, and Language Processing, IEEE Transactions on 18 (8),
1955 1967.
Berry, M. W., M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons (2007). Algorithms
and applications for approximate nonnegative matrix factorization. Computational Statistics
& Data Analysis 52 (1), 155 173.
Bezdek, J. and R. Hathaway (2002). Vat: a tool for visual assessment of (cluster) tendency. 3,
2225 2230.
Bezdek, J., R. Hathaway, and J. Huband (2007, oct.). Visual assessment of clustering tendency
for rectangular dissimilarity matrices. Fuzzy Systems, IEEE Transactions on 15 (5), 890 903.
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding.
Psychological Review 94 (2), 115 147.
Carmona-Saez, P., R. Pascual-Marqui, F. Tirado, J. Carazo, and A. Pascual-Montano (2006).
Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC
Bioinformatics 7, 118. 10.1186/1471-2105-7-78.
Chueinta, W., P. K. Hopke, and P. Paatero (2000). Investigation of sources of atmospheric
aerosol at urban and suburban residential areas in thailand by positive matrix factorization.
Atmospheric Environment 34 (20), 3319 3329.
Ding, C., T. Li, and M. Jordan (2010, jan.). Convex and semi-nonnegative matrix factorizations.
Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (1), 45 55.
Ding, C., T. Li, W. Peng, and H. Park (2006). Orthogonal nonnegative matrix t-factorizations
for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 06, New York, NY, USA, pp. 126135. ACM.
Everitt, B. (1978). Graphical techniques for multivariate data. London: Heinemann Educational.
Greene, D., G. Cagney, N. Krogan, and P. Cunningham (2008). Ensemble non-negative matrix
factorization methods for clustering proteinprotein interactions. Bioinformatics 24 (15),
17221728.
Hathaway, R. J. and J. C. Bezdek (1994). Nerf c-means: Non-euclidean relational fuzzy clustering. Pattern Recognition 27 (3), 429 437.

43

BIBLIOGRAPHY

44

Hathaway, R. J. and J. C. Bezdek (2002). Clustering incomplete relational data using the
non-euclidean relational fuzzy c-means algorithm. Pattern Recognition Letters 23 (1-3), 151
160.
Hathaway, R. J., J. C. Bezdek, and J. M. Huband (2006). Scalable visual assessment of cluster
tendency for large data sets. Pattern Recognition 39 (7), 1315 1324.
Havens, T. and J. Bezdek (2011). An efficient formulation of the improved visual assessment
of cluster tendency (ivat) algorithm. Knowledge and Data Engineering, IEEE Transactions
on PP (99), 1.
Jain, A. (1988). Algorithms for clustering data. Englewood Cliffs N.J.: Prentice Hall.
Kim, H. and H. Park (2007). Sparse non-negative matrix factorizations via alternating nonnegativity-constrained least squares for microarray data analysis. Bioinformatics 23 (12),
14951502.
Kim, H. and H. Park (2008, July). Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30,
713730.
Kim, Y.-I., D.-W. Kim, D. Lee, and K. H. Lee (2004, December). A cluster validation index
for gk cluster analysis based on relative degree of sharing. Inf. Sci. Inf. Comput. Sci. 168,
225242.
Lee, D. D. and H. S. Seung (1999, October). Learning the parts of objects by non-negative
matrix factorization. Nature 401 (6755), 788791.
Lee, D. D. and H. S. Seung (2001). Algorithms for non-negative matrix factorization. In In
NIPS, pp. 556562. MIT Press.
Leon, M. and B. A. Johnson (2003). Olfactory coding in the mammalian olfactory bulb. Brain
Research Reviews 42 (1), 23 32.
Lin, C.-J. (2007). Projected gradient methods for nonnegative matrix factorization. Neural
Computation 19 (10), 27562779.
Paatero, P. and U. Tapper (1994). Positive matrix factorization: A non-negative factor model
with optimal utilization of error estimates of data values. Environmetrics 5 (2), 111126.
Park, L., J. Bezdek, and C. Leckie (2009, feb.). Visualization of clusters in very large rectangular
dissimilarity data. In Autonomous Robots and Agents, 2009. ICARA 2009. 4th International
Conference on, pp. 251 256.
Pauca, V. P., J. Piper, and R. J. Plemmons (2006). Nonnegative matrix factorization for spectral
data analysis. Linear Algebra and its Applications 416 (1), 29 47. Special Issue devoted to
the Haifa 2005 conference on matrix theory.
Ullman, S. (2000, July). High-Level Vision: Object Recognition and Visual Cognition (1 ed.).
The MIT Press.
Van Benthem, M. H. and M. R. Keenan (2004). Fast algorithm for the solution of large-scale
non-negativity-constrained least squares problems. Journal of Chemometrics 18 (10), 441
450.
Xie, X. and G. Beni (1991, aug). A validity measure for fuzzy clustering. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 13 (8), 841 847.

BIBLIOGRAPHY

45

Xu, W., X. Liu, and Y. Gong (2003). Document clustering based on non-negative matrix
factorization. In Proceedings of the 26th annual international ACM SIGIR conference on
Research and development in informaion retrieval, SIGIR 03, New York, NY, USA, pp.
267273. ACM.
Yuan, Z. and E. Oja (2005). Projective nonnegative matrix factorization for image compression
and feature extraction. In H. Kalviainen, J. Parkkinen, and A. Kaarna (Eds.), Image Analysis,
Volume 3540 of Lecture Notes in Computer Science, pp. 333342. Springer Berlin / Heidelberg.
10.1007/11499145 35.

You might also like