Professional Documents
Culture Documents
Milad Avazbeigi
JULY 2011
UNIVERSIDAD DE OVIEDO
Milad Avazbeigi
TUTOR/ADVISOR:
Enrique H. Ruspini
Francisco Ortega Fernandez
Contents
Preface
1 Introduction
1.1 Historical Background . . . . . . . . . . .
1.2 NMF, PCA and VQ . . . . . . . . . . . .
1.3 Non-negative Matrix Factorization (NMF)
1.3.1 A Simple NMF and Its Solution .
1.3.2 Uniqueness . . . . . . . . . . . . .
5
5
6
7
7
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
And Ac. . . . . . 10
. . . . . . 11
.
.
.
.
.
. . . . .
. . . . .
clusters
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
15
15
15
16
.
.
.
.
.
.
.
.
22
23
23
24
24
29
33
33
33
39
41
41
41
41
41
41
41
42
. . . .
NMF
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
List of Figures
2.1
2.2
2.3
3.1
3.2
3.3
3.4
3.5
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
. . . . . . .
. . . . . . .
. . . . . . .
FCM . . .
NMFNNLS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
18
19
20
21
VAT
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
30
30
31
32
34
35
36
37
38
Preface
With the rapid technological advances especially in the last two decades, a huge mass of biological data has been gathered through new tools or new experiments enabled by new equipments
and methods. Intelligent analysis of this mass of data is mandatory to extract some knowledge about the underlying biological mechanisms that causes such data. An example is the
application of statistical methods for analysis of image datasets that could be for example geneexpression images. One of the main goals in the analysis of image datasets is to identify groups
of images, or groups of pixels (regions in images), that exhibit similar expression patterns.
In the literature there have been numerous works related to the clustering or bi-clustering
of gene-expression images. However, to the knowledge of the writers of this report, no work has
been reported yet related to bi-clustering of olfactory images. Olfaction is the sense of smell and
the olfactory bulb is a structure of the vertebrate forebrain involved in olfaction, the perception
of odors. The bi-clustering of olfactory images is so important since it is believed that it would
help us to understand the olfactory coding. A code is a set of rules by which information is
transposed from one form to another. In the case of olfactory coding, it would describe the
ways in which information about odorant molecules is transposed into neural responses. The
idea is that when we understood that code, we might be able to predict the odorant molecule
from the neural response, the neural response from the odorant molecule, and the perception
of the odorant molecule from the neural response.
The available dataset consists of 472 images of 80x44 pixels. Every image has been already
segmented from a background. Every image corresponds to the 2-DG uptake in the OB of a
rat in response to a particular chemical substance. The purpose of this research is to apply
Non-negative Matrix Factorization (NMF) for bi-clustering images and pixels simultaneously
and finally find pixels (regions of images) that respond in a similar manner for a selection of
chemicals.
Chapter 1
Introduction
1.1
Historical Background
Matrix factorization is a unifying theme in numerical linear algebra. A wide variety of matrix
factorization algorithms have been developed over many decades, providing a numerical platform
for matrix operations such as solving linear systems, spectral decomposition, and subspace
identification. Some of these algorithms have also proven useful in statistical data analysis,
most notably the singular value decomposition (SVD), which underlies principal component
analysis (PCA) (Ding et al. (2010)).
There is psychological and physiological (Biederman (1987)) evidence for parts-based representations in the brain, and certain computational theories of object recognition rely on such
representations (Ullman (2000)). But little is known about how brains or computers might
learn the parts of objects. Non-negative matrix factorization (NMF) that is able to learn parts
of images and semantic features of text. This is in contrast to other methods, such as principal
components analysis (PCA) and (VQ) vector quantization, that learn holistic, not parts-based,
representations. Non-negative matrix factorization is distinguished from the other methods by
its use of non-negativity constraints. These constraints lead to a parts-based representation
because they allow only additive, not subtractive, combinations (Lee and Seung (1999)).
Since the introduction of NMF by Paatero and Tapper (1994), the scope of research on NMF
has grown rapidly in the recent years. NMF has been shown to be useful in a variety of applied
settings, including:
Chemometrics (Chueinta et al. (2000))
Pattern recognition (Yuan and Oja (2005))
Multimedia data analysis (Benetos and Kotropoulos (2010))
Text mining (Batcha and Zaki (2010), Xu et al. (2003))
DNA gene expression analysis (Kim and Park (2007))
Protein interaction (Greene et al. (2008))
This chapter is organized as follows. First, in Section 1.2, NMF, PCA an QA which are the
most famous matrix factorization methods are explained briefly and compared to each other.
Then in section 1.3, a simple NMF model is presented. Also in this section the uniqueness of
NMSs solutions is discussed.
CHAPTER 1. INTRODUCTION
1.2
Factorizing a data matrix is extensively studied in Numerical Linear Algebra. There are different
methods for factorizing a data matrix:
Principal Components Analysis (PCA)
Vector Quantization (VQ)
Non-negative Matrix Factorization (NMF)
From another point of view Principal Components Analysis and Vector Quantization are methods for unsupervised learning also.
The formulation of the problem in all three methods PCA, VQ and NMF is as follows. Given
a matrix V , the goal is to find matrix factors W and H such that:
V WH
or
Vi (W H)i =
K
X
(1.1)
Wia Ha
(1.2)
a=1
where
V is a n m matrix where m is the number of examples (in the data set) and n is related
to the number of dimensions of each data set.
W: n K
H: K m
Usually K is chosen to be smaller than n
original matrix.
..
.
= ...
. V ..
nm
..
.
...
nK
..
.
(1.3)
Km
V W H can be rewritten column by column as v W h where v and h are the corresponding columns of V and H:
..
..
.
..
..
v
h
=
(1.4)
. W .
..
.
..
.
nK
n1
K1
In other words, each data vector v is approximated by a linear combination of the columns of
W , weighted by the components of h. In a image clustering problem, The K columns of W are
called basis images. Each column of H is called an encoding and is in one-to-one correspondence
with an image in V . An encoding consists of the coefficients by which an image is represented
with a linear combination of basis images. The rank K of the factorization is generally chosen
so that (n + m)K < nm, and the product W H can be regarded as a compressed form of the
data in V . Depending on the constraints used, the output results would be completely different:
VQ uses a hard winner-take-all constraint that results in clustering data into mutually
exclusive prototypes. In VQ, each column of H is constrained to be a unary vector,
with one element equal to unity and the other elements equal to zero. In other words,
every image (column of V ) is approximated by a single basis image (column of W ) in the
factorization V W H (Lee and Seung (1999)).
CHAPTER 1. INTRODUCTION
PCA enforces only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability. PCA constrains the columns
of W to be orthonormal and the rows of H to be orthogonal to each other. This relaxes
the unary constraint of V Q, allowing a distributed representation in which each image is
approximated by a linear combination of all the basis images, or eigenimages. Although
eigenimages have a statistical interpretation as the directions of largest variance, many
of them do not have an obvious visual interpretation. This is because PCA allows the
entries of W and H to be of arbitrary sign. As the eigenimages are used in linear combinations that generally involve complex cancellations between positive and negative numbers,
many individual eigenimages lack intuitive meaning (Lee and Seung (1999)).
NMF does not allow negative entries in the matrix factors W and H. This is very useful
property in comparison with PCA that allows arbitrary signs of pixels and loses the interpretability of the results specially in images data that have only positive values. Unlike
the unary constraint of V Q, these non-negativity constraints permit the combination of
multiple basis images to represent an image. But only additive combinations are allowed,
because the non-zero elements of W and H are all positive. In contrast to PCA, no
subtractions can occur. For these reasons, the non-negativity constraints are compatible
with the intuitive notion of combining parts to form a whole, which is how NMF learns a
parts-based representation.
Because NMF is the method used in this research, in the following NMF is described in detail.
1.3
As said before V is a non-negative matrix that is known and the goal is to find two matrices W
and H that are also positive. There are different types of non-negative matrix factorizations.
The different types arise from using different cost functions for measuring the divergence between
X and W H (such as Lin (2007)) and possibly by regularization of the W and/or H matrices. In
order to formulate the NMF problem, first we need to define a cost function. Cost function is a
measure to know how good obtained W and H can predict V . After defining the cost function,
the NMF problem reduces to the problem of optimizing the cost function. There are two basic
cost functions as described below (Lee and Seung (2001)):
X
kA Bk2 =
(Aij Bij )2
(1.5)
ij
D(AkB) =
X
Aij
(Aij log
Aij + Bij )
Bij
(1.6)
ij
1.3.1
Here a simple, classic NMF with its solution is presented. This algorithm is also used later in
the experiments (Lee and Seung (2001)).
NMF based on Euclidean Cost Function: Minimize kV W Hk2 with respect to W and
H,
subject to the constraints W, H 0
NMF based on Logarithmic Cost Function: Minimize D(V kW H) with respect to W and
H,
subject to the constraints W, H 0
CHAPTER 1. INTRODUCTION
Neither kV W Hk2 nor D(V kW H) are not convex in both variables W and H. They are
convex in W only or H. So, there is no algorithm that solves NMF globally. Instead we can
propose some algorithms to solve them locally (local optima).
newline The simplest algorithm to solve these problems is probably algorithm. But, convergence
can be slow. have faster convergence but more complicated. This objective function can be
related to the likelihood of generating the images in V from the basis W and encodings H. An
iterative approach to reach a local maximum that is explained in the following:
The Euclidean distance kV W Hk is non-increasing with these update rules:
Ha Ha
(W T V )a
(W T W H)a
Wia Wia
(V H T )ia
(W HH T )ia
(1.7)
(1.8)
The convergence of the process is ensured. The initialization is performed using positive random
initial conditions for matrices W and H.
1.3.2
Uniqueness
The factorization is not unique: A matrix and its inverse can be used to transform the two
factorization matrices by for example (Xu et al. (2003)):
WH = WBB1 H
(1.9)
= WB and H
= B1 H are non-negative they form another
If the two new matrices W
and H
applies at least if B
parametrization of the factorization. The non-negativity of W
is a non-negative monomial matrix. In this simple case it will just correspond to a scaling
and a permutation. More control over the non-uniqueness of NMF is obtained with sparsity
constraints.
Chapter 2
The dataset consists of 472 images in which every image has 80 44 pixels. Every image has
already been segmented from a background and the background pixels values are set to a large
negative value. Every image corresponds to the 2-DG uptake in the OB of a rat in response
to a particular chemical substance. Also, images correspond to different animals so there are
small differences in size and maybe alignment.
The purpose of this research is to apply Non-negative Matrix Factorization (NMF) for biclustering images and pixels simultaneously and finally find pixels that respond in a similar
manner for a group of chemicals. Before doing any experiments, all the background pixels are
removed to decrease the dimension of all images. As said in section 1.1, since the introduction of
NMF numerous methods have been developed for different applications for different purposes.
In this research, in order to realize what method is appropriate for the given problem, some
methods are compared together and finally the best one with the minimum error is used. The
methods are as follows:
convexnmf : Convex and semi-nonnegative matrix factorizations (Ding et al. (2010)).
orthnmf : Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006)).
nmfnnls: Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares (Kim and Park (2008) and Kim and Park (2007)).
nmfrule: Basic NMF(Lee and Seung (1999) and Lee and Seung (2001)).
nmfmult: NMF with Multiplicative Update Formula (Pauca et al. (2006)).
nmfals: NMF with Auxiliary Constraints (Berry et al. (2007)).
In order to make a reasonable comparison among these methods, every algorithm is executed
500 times and the average of all results is finally reported. Since the number of clusters, K in
NMF formula as shown in Equation 1.3, is not obvious, every algorithm is executed 10 times for
each number of clusters (1 to 50 clusters). As Table 2.1 shows, nmfnnls has the lowest RMSE
(Root Mean Square Error). Root Mean Square Error is computed as shown in Equation 2.1.
In this equation, n is the number of samples (images) and m is the number of pixels. In order
to be able to use NMF every image is vectorized. So, for example a 20 30 pixel images after
vectorization would become a vector with a length of 600. Figure 2.1 also shows the RMSE of
nmfnnls for different number of clusters. As it is expected, as the number of clusters increases,
RMSE decreases.
9
RMSE
0.1260
0.1320
0.1159
0.1162
0.1657
0.1178
10
0.16
0.15
RMSE
0.14
RMSE
0.13
0.12
0.11
0.1
0.09
0.08
0
10
15
20
25
30
35
40
45
#Clusters
i (Vi
(W H)i )2
mn
(2.1)
As shown in Table 2.1, nmfnnls is the best method for the given problem. Later in Chapter 4,
this method will be used to bi-clustering images and pixels simultaneously.
Just to show how NMF works on our problem, the output of NMF is shown with K = 12.
Later in Chapter 4, the number of clusters would be determined through some experiments.
Figure 2.2 shows the obtained clusters (They can be interpreted as the cluster centers). It should
be mentioned that each cluster is referring to a column of W in NMF. Also, Figure 2.3 shows
three samples of images, before and after applying NMF. In this figure, the first column on the
right is the original images obtained from the laboratory experiments. The middle column is
related to normalized images. The column on the left shows the results obtained by NMF. As
it can be seen, the main components of patterns are much clearer than the original ones. It
should be mentioned that in order to use NMF, it is mandatory to have only positive values.
First, The original images are vectorized to form V in which every column represents an image.
Then, V is normalized through the rows. All the experiments are done with normalized data.
2.2
This section is devoted to the detail description of the method developed by Kim and Park
(2008). This method has shown the best performance in the experiments. The method tries
to produce sparse W and H. In order to enforce sparseness on W or H, Kim and Park (2008)
11
introduces two formulations and the corresponding algorithms for sparse NMFs, i.e. SNMF/L
for sparse W (where L denotes the sparseness imposed on the left factor) and SNMF/R for
sparse H (where R denotes the sparseness imposed on the right factor). The introduced sparse
NMF formulations that impose the sparsity on a factor of NMF utilize L1-norm minimization
and the corresponding algorithms are based on alternating non-negativity constrained least
squares (ANLS). Each sub-problem is solved by a fast non-negativity constrained least squares
(NLS) algorithm (Van Benthem and Keenan (2004)) that is improved upon the active set based
NLS method.
2.2.1
SNMF/R
To apply sparseness constraints on H, the following SNMF/R optimization problem is used:
n
minW,H
X
1
||A W H||2F + ||W ||2F +
||H(:, j)||21 ,
2
s.t.
W, H 0
(2.2)
j=1
where H(:, j) is teh j-th column vector of H, > 0 is a parameter to supress ||W ||2F , and > 0
is a regularization parameter to balance trade-off between the accuracy of the approximation
and the sparseness of H. The SNMF/R algorithm begins with the initialization of W with
non-negative values. Then, it iterates the followinf ANLS until convergence:
2
A
W
, s.t. H 0
(2.3)
H
min
e1k
01n
F
H
where e1k R1k is a row vector with all components equal to one and 01n R1n is a zero
vector, and
T
T
2
H
A
T
, s.t. W 0
min
W
(2.4)
0km
F
Ik
W
where Ik is an identity matrix of size k k and 0km is a zero matrix of size k m. Equation 2.3
minimizes L1 -norm of columns of H Rkn which imposes sparcity of H.
SNMF/L
To impose sparseness constraints on W, the following formulation is used:
n
minW,H
X
1
||A W H||2F + ||H||2F +
||W (i, :)||21 ,
2
s.t.
W, H 0
(2.5)
j=1
where W (i, :) is the i-th row vector of W , > 0 is a parameter to suppress ||H||2F , and > 0 is
a regularization parameter to balance trade-off between the accuracy of the approximation and
the sparseness of W. The algorithm for SNMF/L is also based on ANLS.
Cluster #1
Cluster #2
Cluster #3
Cluster #4
20
20
20
20
40
40
40
40
60
60
60
60
80
80
20
40
80
20
Cluster #5
40
80
20
Cluster #6
40
20
Cluster #7
20
20
20
40
40
40
40
60
60
60
60
80
20
40
80
20
Cluster #9
40
80
20
Cluster #10
40
20
Cluster #11
20
20
20
40
40
40
40
60
60
60
60
80
20
40
80
20
40
40
Cluster #12
20
80
40
Cluster #8
20
80
12
80
20
40
20
40
Estimated #39
Normalized #39
Main #39
20
20
20
40
40
40
60
60
60
80
80
20
40
80
20
Estimated #181
40
20
Normalized #181
Main #181
20
20
20
40
40
40
60
60
60
80
80
20
40
80
20
Estimated #230
40
20
Normalized #230
20
20
40
40
40
60
60
60
80
20
40
40
Main #230
20
80
40
80
20
40
20
40
13
Chapter 3
3.1
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two
or more clusters. It is based on minimization of the following objective function:
Jm =
N X
C
X
2
um
ij ||xi cj || ,
1m<
(3.1)
i=1 j=1
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster
j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and
|| || is any norm expressing the similarity between any measured data and the center. Fuzzy
partitioning is carried out through an iterative optimization of the objective function shown
above, with the update of membership uij and the cluster centers cj by Equation 3.2 and 3.3
respectively.
1
uij =
(3.2)
2
m1
PC
||xi cj ||
k=1
||xi ck ||
PN
m
i=1 uij .xi
m
i=1 uij
cj = PN
(k+1)
(k)
(3.3)
cj =
i=1
PN
um
ij .xi
i=1
14
um
ij
15
uij =
PC
k=1
(k+1)
4. If maxij |uij
3.2
||xi cj ||
||xi ck ||
2
m1
(k)
In order to compare two clustering method it is mandatory to use a measure which is usually
called cluster validity index. In this research, a cluster validity index developed by Kim et al.
(2004) is used. This cluster validity index is based on the relative degree of sharing of two
clusters as the weighted sum of the relative degrees of sharing for all data.
3.2.1
Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition (U, V ) and c be the number
of clusters.
Definition 1. The relative degree of sharing of two fuzzy clusters Fp and Fq at xj is defined
as:
F (xj ) Fq (xj )
Srel (xj : Fp , Fq ) = 1pPc
(3.4)
i=1 Fi (xj )
c
The numerator is the degree of sharing of Fp and Fq at xj and the denominator is the average
membership value of xj over c fuzzy clusters. fuzzy clusters. The relative degree of sharing of
two fuzzy clusters is defined as the weighted summation of Srel (xj : Fp , Fq ) for all data in X.
Definition 2. (Relative degree of sharing of two fuzzy clusters). The relative degree of
sharing of fuzzy clusters Fp and Fq is defined as:
Srel (Fp , Fq ) =
n
X
(3.5)
j=1
P
where h(xj ) = ci=1 Fi (xj )loga Fi (xj ).
Here, h(xj ) is the entropy of datum xj and Fi (xj ) is the membership value with which xj
belongs to cluster Fi . h(xj ) measures how vaguely the datum xj is classified over c different
clusters. h(xj ) is introduced to assign a weight for vague data. Vague data are given more
weight than clearly classified data. h(xj ) also reflects the dependency of Fi (xj ) with respect
to different c values. This approach makes it possible to focus more on the highly-over-lapped
data in the computation of the validity index than other indices do. Since olfactory data are
highly over-lapped data, this indeces is a good measure.
3.2.2
Validity Index
Definition 3.(Validity Index) Let Fp and Fq be two fuzzy clusters belonging to a fuzzy partition
(U, V ) and c be the number of clusters. Let Srel (Fp , Fq ) be the degree of sharing of two fuzzy
clusters. Then the index V is defined as:
c
X
XX
2
2
V (U, V : X) =
Srel (Fp , Fq ) =
[c.[Fp (xj ) Fq (xj )]h(xj )]
c(c 1)
c(c 1)
p6=q
p6=q j=1
(3.6)
3.3
16
Figure 3.1 shows the cluster validity index, explained in the previous section, for different
number of clusters both for NMF and FCM. The values are also listed in Table 3.1. To obtain
every value of the index, every algorithm is repeated 10 times and the average is reported.
This is mandatory since both FCM and NMF begins with random seeds and converge to local
minima. For FCM, euclidean norm is used that is shown in Equation 3.7. m that is called the
fuzzification factor in FCM algorithm is set to 1.1. Experiments show that the larger values for
m produce weak results.
q
x21 + + x2n
||x|| :=
(3.7)
As Figure 3.1 shows, it does not matter what the number of clusters(K) is, NMF always
outperforms FCM and even it shows a more stable behavior. As an example, both algorithms
are applied on the main dataset with 12 number of clusters. The results are shown in Figure 3.5
and 3.4. As it is clear, NMF is able to find a large variety of patterns and since the output is
a compressed version of the original data, in some cases the noises are removed. At the other
hand, in FCM, clusters are very noisy and also there is a big similarity among clusters 3, 8 and
10. In contrast, NMF images are less noisy and very heterogeneous. This is a very important,
desirable property for our problem, since the clusters would be very different and probably more
interpretable.
Since the cluster validity index values are so similar for K = 2 and K = 3 as shown in
Table 3.1, clusters obtained by NMF and FCM are demonstrated and compared for K = 2 and
K = 3. In the case of K = 2, because there is a little difference between NMF and FCM in
the terms of cluster validity index shown in Table 3.1, we expect that the images be similar.
Figure 3.2(b) and 3.2(a) also support this hypothesis. However, Figure 3.3(b) and 3.3(a) show
that NMF is able to produce better clusters than fcm, even for a small number of clusters. Since
NMF does not perform a global optimization and tries to factorize V into W and H, it is more
probable that NMF grasps local patterns like those blue patterns appeared in NMF picture that
are not presenting in FCM results.
4
x 10
FCM
1.8
NMF
1.6
1.4
1.2
0.8
0.6
0.4
0.2
10
15
20
25
30
#Clusters
35
40
45
K
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
NMF
1762.62
1866.94
1885.21
1742.53
1819.51
1740.90
1738.99
1730.64
1800.77
1756.36
1795.11
1865.47
1795.58
1837.64
1844.71
1762.42
1870.11
1855.41
1888.06
1887.25
1919.54
1990.77
FCM
1782.22
1958.48
2304.26
2671.88
3060.68
3462.59
3867.56
4256.01
4651.38
5025.08
5428.79
5836.79
6255.98
6655.65
7050.31
7463.09
7865.73
8262.39
8649.33
9062.16
9427.37
9843.57
K
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
NMF
1936.99
1974.52
1991.93
1955.30
2070.83
2049.97
2101.71
2017.98
2065.92
2089.04
2077.78
2091.19
2190.53
2101.48
2168.09
2211.88
2163.61
2219.84
2206.45
2338.49
2277.34
2287.03
17
FCM
10209.10
10635.04
11027.00
11367.68
11813.05
12243.92
12654.37
13015.36
13382.79
13836.93
14138.64
14614.99
15020.55
15462.93
15754.74
16224.19
16589.04
17003.98
17393.46
17708.48
18165.53
18535.29
18
Cluster #1
Cluster #2
10
10
20
20
30
30
40
40
50
50
60
60
70
70
80
80
10
20
30
40
10
20
30
40
Cluster #1
Cluster #2
10
10
20
20
30
30
40
40
50
50
60
60
70
70
80
80
10
20
30
40
10
20
30
40
Cluster #1
19
Cluster #2
Cluster #3
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
70
70
70
80
80
10
20
30
40
80
10
20
30
40
10
20
30
40
Cluster #2
Cluster #3
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
70
70
70
80
80
10
20
30
40
80
10
20
30
40
10
20
30
40
Cluster #1
Cluster #2
20
Cluster #3
Cluster #4
20
20
20
20
40
40
40
40
60
60
60
60
80
80
20
40
80
20
Cluster #5
40
80
20
Cluster #6
40
20
Cluster #7
Cluster #8
20
20
20
20
40
40
40
40
60
60
60
60
80
80
20
40
80
20
Cluster #9
40
80
20
Cluster #10
40
20
Cluster #11
20
20
20
40
40
40
40
60
60
60
60
80
20
40
80
20
40
40
Cluster #12
20
80
40
80
20
40
20
40
Cluster #1
Cluster #2
21
Cluster #3
Cluster #4
20
20
20
20
40
40
40
40
60
60
60
60
80
80
20
40
80
20
Cluster #5
40
80
20
Cluster #6
40
20
Cluster #7
Cluster #8
20
20
20
20
40
40
40
40
60
60
60
60
80
80
20
40
80
20
Cluster #9
40
80
20
Cluster #10
40
20
Cluster #11
20
20
20
40
40
40
40
60
60
60
60
80
20
40
80
20
40
40
Cluster #12
20
80
40
80
20
40
20
40
Chapter 4
23
the last few years several methods have been proposed to avoid these drawbacks. Among these
methods, bi-clustering algorithms have been presented as an alternative approach to standard
clustering techniques to identify local structures from gene expression datasets. These methods
perform clustering on genes and conditions simultaneously in order to identify subsets of genes
that show similar expression patterns across specific subsets of experimental conditions and vice
versa.
In this research, NMF that was introduced in chapter 2, is applied for bi-clustering images
and pixels at the same time. Bi-clustering of images and pixels would help us to recognize what
patterns are repeated over a data set.
The rest of the chapter is organized as follows. First, it is explained that how NMF results
can be used for bi-clustering. Then, in Section 4.2, a method is introduced for Feature Selection.
With the help of this method, the most effective pixels are chosen. This would finally result
in bi-clusters patterns. In Section 4.3, a method called Visual Assessment Cluster Tendency
(VAT) is reviewd. This method is used for finding the number of co-clusters. The results of VAT
is presented in Section 4.3.2. In Section 4.5, first, NMF is applied for factorization of vectorized
images, then the feature selection method is used and finally the co-clusters are demonstrated
and discussed.
4.1
4.2
Feature Selection
Suppose after applying NMF, W is obtained with m rows and K columns. Every column is
representing an image (cluster) and every row is representing a pixel. The problem is to find
the most effective pixels in every cluster:
Step 1. First all the rows of W are normalized. A row has the data related to the
membership of a pixel to different clusters (images). Normalization of row i is done by
the following equation:
W (i, k)
W (i, k) = PK
(4.1)
j W (i, j)
Step 2. Using the following equation, the data are scaled such that a pixel (feature) with
a smaller membership value, would have a bigger negative value. The results are then
stored in matric LogW .
LogW (i, k) = log2 (W (i, k)) W (i, k)
(4.2)
24
Step 3. Feature scores are computed using LogW from step 2. Score(i) represents the
score of i-th feature (pixel) and K is the number of clusters.
K
Score(i) = 1 +
X
1
LogW (i, j)
log2 (K)
(4.3)
j=1
Two extreme examples are presented here to show how the Score works. Assume that
there two features with the following membership values to 5 cluster:
F1 = [1, 0, 0, 0, 0]
F2 = [0.2, 0.2, 0.2, 0.2, 0.2]
Score of feature F1 will be equal to 1 since:
Score(F1 ) = 1 +
1
(1 log2 1) = 1
log2 5
(4.4)
1
log2 0.2
(0.2 log2 0.2 + + 0.2 log2 0.2) = 1 +
=0
log2 5
log2 5
(4.5)
Step 4. After finding the scores of all features (pixels), the more effective features can be
filtered. In this research the first 65% of pixels are considered as the most effective ones.
4.3
In Chapter 3, FCM and NMF are compared according to a cluster validity index introduced by
Kim et al. (2004) which is a fuzzy index. The comparison was possible due to fuzzy interpretation
of clusters obtained by NMF. Each W (i, j) can be interpreted as the membership degree of Pixel
i in Cluster J (Image) or in other words the Degree that Pixel i is a member of Cluster J. The
comparison shows that NMF is able to produce more well-separated clusters with a desirable
density. However as Figure 3.1 shows the index is monotonically increasing and does not give a
clue about the number of clusters.
In order to choose the number of clusters, a visual method called Visual Assessment of Cluster Tendency (VAT) is used. The method was originally introduced in Bezdek and Hathaway
(2002) and extended later by others such as Hathaway et al. (2006), Bezdek et al. (2007), Park
et al. (2009) and Havens and Bezdek (2011). The method applied in this research is introduced
in Bezdek et al. (2007). In the following the method is explained briefly.
4.3.1
We have an m n matrix D, and assume that its entries correspond to pair wise dissimilarities
between m row objects Or and n column objects Oc , which, taken together (as a union), comprise
a set O of N = m+n objects. Bezdek et al. (2007) develops a new visual approach that applies to
four different cluster assessment problems associated with O. The problems are the assessment
of cluster tendency:
P1) amongst the row objects Or ;
P2) amongst the column objects Oc ;
P3) amongst the union of the row and column objects Or
Oc and;
25
P4) amongst the union of the row and column objects that contain at least one object of
each type (co-clusters).
The basis of the method is to regard D as a subset of known values that is part of a larger,
unknown N N dissimilarity matrix, and then impute the missing values from D. This results in
estimates for three square matrices (Dr , Dc , Dr S c ) that can be visually assessed for clustering
tendency using the previous VAT or sVAT algorithms. The output from assessment of Dr S c
ultimately leads to a rectangular coVAT image which exhibits clustering tendencies in D.
Introduction
We consider a type of preliminary data analysis related to the pattern recognition problem of
clustering. Clustering or cluster analysis is the problem of partitioning a set of objects O =
{O1 ,. . . , On } into c self-similar subsets based on available data and some well-defined measure
of (cluster) similarity (Bezdek and Hathaway (2002)). All clustering algorithms will find an
arbitrary (up to 1 c n) number of clusters, even if no actual clusters exist. Therefore,
a fundamentally important question to ask before applying any particular (and potentially
biasing) clustering algorithm is: Are clusters present at all?
The problem of determining whether clusters are present as a step prior to actual clustering
is called the assessing of clustering tendency. Various formal (statistically based) and
informal techniques for tendency assessment are discussed in Jain (1988) and Everitt (1978).
None of the existing approaches is completely satisfactory (nor will they ever be). The main
purpose of the research is to add a simple and intuitive visual approach to the existing repertoire
of tendency assessment tools. The visual approach for assessing cluster tendency introduced
can be used in all cases involving numerical data. From now on VAT is used as an acronym for
Visual Assessment of Tendency. The VAT approach presents pair wise dissimilarity information
about the set of objects O = {O1 ,. . . , On } as a square digital image with n2 pixels, after the
objects are suitably reordered so that the image is better able to highlight potential cluster
structure Bezdek and Hathaway (2002).
Data Representation
There are two common data representations of O upon which clustering can be based:
Object data representation: When each object in O is represented by a (column)
vector x in <n , the set X = {x1 , ..., xn } <n is called an object data representation of
O. The k th component of the ith feature vector (xki ) is the value of the k th feature (e.g.,
height, weight, length, etc.) of the ith object. It is in this data space that practitioners
sometimes seek geometrical descriptors of the clusters.
Relational data representation: When each pair of objects in O is represented by
a relationship, then we have relational data. The most common case of relational data
is when we have (a matrix of) dissimilarity data, say R = [Rij ] , where Rij is the pair
wise dissimilarity (usually a distance) between objects oi and oj , for 1 i, j n. More
generally, R can be a matrix of similarities based on a variety of measures.
Dissimilarity Images
Let R be an nn dissimilarity matrix corresponding to the set O = {o1 , . . . , on }. R satisfies the
following (metric) conditions for all 1 i, j n:
Rij 0
Rij = Rji
26
Rij = 0
R can be displayed as an intensity image I which is called a dissimilarity image. The intensity
or gray level gij of pixel (i,j) depends on the value of Rij . The value Rij = 0 corresponds to gij
= 0 (pure black); the value Rij = Rmax , where Rmax denotes the largest dissimilarity value in
R, gives gij = Rmax (pure white). Intermediate values of Rij produce pixels with intermediate
levels of gray in a set of gray levels G = {G1 , . . . , Gm }. A dissimilarity image is not useful
until it is ordered by a procedure. We will attempt to reorder the objects {o1 , o2 , . . . , on } as
{ok1 , ok2 , . . . , okn } so that, to whatever degree possible, if ki is near kj , then oki is similar to
okj . The corresponding ordered dissimilarity image (ODI)I will often indicate cluster tendency
in the data by dark blocks of pixels along the main diagonal. The ordering is accomplished by
processing elements in the dissimilarity matrix R (rather than using the objects or object data
directly).
The images shown below use 256 equally spaced gray levels, with G1 = 0 (black) and Gm =
Rmax (white). The displayed gray level of pixel (i,j) is the level gij G that is closest to Rij .
The example is from Bezdek and Hathaway (2002). The corresponding dissimilarity matrix is
shown below:
0
0.73 0.19 0.71 0.16
0.73
0
0.59 0.12 0.78
0
0.55 0.19
R = 0.19 0.59
0.5
0.5
1.5
1.5
2.5
2.5
3.5
3.5
4.5
4.5
5.5
5.5
1
Figure 4.1: A simple dissimilarity images before and after ordering with VAT
27
Dr
DT
..
.
D
d(o
,
o
)
m 1
=
d(o1 , om+1 )
Dc
..
..
.
.
d(o1 , o1 )
..
d(o1 , om+n )
d(o1 , om )
..
d(o1 , om+1 ) d(om , om+1 )
..
..
..
.
.
.
d(om , om+1 )
d(om+1 , om+1 ) d(om+1 , om+n )
..
..
..
..
.
.
.
.
d(om , om+n )
d(om+n , om+1 ) d(om+n , om+n )
f or
1 i, j m
(4.7)
1 i, j n
(4.8)
where r and c are scale factors. Also any other norm can be used. Finally with Dr and Dc
Dr S c can be achieved. The coVAT algorithm (Bezdek et al. (2007)) is as follows. As it can be
seen it uses VAT algorithm (Bezdek and Hathaway (2002)) inside:
Input: An m n matrix of pair wise dissimilarities D = [dij ] satisfying, for all 1 i m and
1 j n : dij 0.
Step 1. Build estimates of Dr and Dc using the following formula:
Step 2. Build estimates of Dr S c using Formula 1.
Step 3. Run VAT on Dr S c , and save the permutation array Pr S c = (P (1), ..., P (m + n))
Initialize rc = cc = 0; RP = RC = 0
Step 4.
For t = 1, ..., m + n:
28
If P (t) m
rc = rc + 1 %rc = rowcomponent
RP (rc) = P (t) %RP = rowindices
Else
cc = cc + 1 %cc = columncomponent
CP (cc) = P (t) %CP = columnindices
End If
Next t
= [dij ] = [dRP (i)CP (j) ] for
Step 5. From the co-VAT ordered rectangular dissimilarity matrix D
1 i m and 1 j n
Output: : Rectangular Image I((D)), scaled so that maxdij corresponds to white and mindi j
to black.
Advantages, Disadvantages
Advantages
VAT could be applied on dataset with missing data. If the original data has missing
components (is incomplete), then any existing data imputation scheme can be used
to fill in the missing part of the data prior to processing. The ultimate purpose
of imputing data here is simply to get a very rough picture of the cluster tendency
in O. Consequently, sophisticated imputation schemes, such as those based on the
expectation-maximization (EM) algorithm are unnecessarily expensive in both complexity and computation time.
The VAT tool is widely applicable because it displays an ordered form of dissimilarity
data, which itself can always be obtained from the original data for O. We must
consider the fact that almost all of the other clustering validity indexes (CVI) like
Xie and Beni (1991) that are used for determining the number of clusters can be used
only after that the clustering is done. So, usually clustering is repeated with different
values for the number of clusters and the obtained CVIs are compared to choose the
best. This approach is so time consuming and in fact could be impractical for large
data sets. Also as mentioned in the introduction, these approaches are biased ones
since they assume that data could be clustered without considering the possibility
that maybe there is no clusters in the data. Some examples are also shown in Bezdek
et al. (2007) part IV Numerical Examples.
VAT does not need any parameter optimization because roughly it has no parameters.
This is a great advantages over other methods that usually need a lot of effort on
parameter optimization.
The usefulness of a dissimilarity image for visually assessing cluster tendency depends
crucially on the ordering of the rows and columns of R. The VAT ordering algorithm
can be implemented in O(M 2 ) time complexity and is similar to Prims algorithm
for finding a minimal spanning tree (MST) of a weighted graph. As M grows the
complexity grows non linearly. Are there effective size limits for this new approach?
No! If D is large, then the square dissimilarity matrix processing of Dr , Dc and Drc
that is required to apply coVAT to D can be done using sVAT, the scalable version
of VAT (Hathaway et al. (2006)). The bottom line? This general approach works for
even very large (unloadable) data sets (Bezdek et al. (2007)).
29
By examining the permutation arrays and correlating their entries with the dark
blocks produced by VAT-coVAT, a crude clustering of the data is produced. At a
minimum, the visually identified clusters could be used as a good initialization of
an iterative dissimilarity-clustering algorithm such as the non-Euclidean relational
fuzzy (NERF) c-means algorithm introduced in (Hathaway and Bezdek (1994)).
Disadvantages
The coVAT image does a very good job of representing co-cluster structure, as we
must first do but its computation is expensive, is there a cheaper, direct route from
4.3.2
VAT results
r and c which are scale factors are set to 0.1. In order to decrease the computational cost of
VAT, first NMF is applied to decompose the images (V ) into two matrices of W and H. Then
VAT is applied to evaluate only W . The number of clusters in NMF is set to 20 which is a
relatively big number. We suppose that in NMFs results some bi-clusters are presented that
we do not know their numbers and by applying VAT on W the number of bi-clusters can be
estimated. The growth of the size of Dr S c is shown in Figure 4.2. As Figure 4.2 shows, with
472 images and 3520 pixels for each image, Dr S c will be as big as a 16-million pixel image and
processing such image in VAT is so difficult. In addition, our experiments show that finding
patterns in such big images is extremely difficult and almost impossible. These are the main
reasons that VAT is applied on the output of NMF.
The results are shown in Figure 4.3 and Figure 4.4. To increase the contrast of the picture
and make it more clear, coVAT image is filtered. The result image is shown in Figure 4.5. As
Figure 4.5 shows, in the last 7 rows there are some strong evidences about co-clusters. Also
there are some weak patterns in rows number 7 to 13. It is mandatory to do some experiments
to realize whether it is necessary to consider more than 7 clusters or no. In the next section,
this question will be answered by analyzing the stability of NMF.
30
Dr for P1
Dc for P2
200
400
600
800
10
1000
1200
15
1400
1600
20
2
10
12
14
16
18
20
200
400
600
Dunion for P3
800
1000
1200
1400
1600
1200
1400
1600
coVAT Image: P4
200
400
600
800
10
1000
1200
15
1400
1600
20
1800
200
400
600
800
1000
1200
1400
1600
1800
200
400
600
800
1000
20
18
16
14
12
10
200
400
600
1000
1200
800
coVAT Image: P4
1400
1600
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
20
18
16
14
12
10
200
400
800
1000
1200
1400
600
1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
K
2
3
4
5
6
7
8
9
10
11
12
13
14
CVI
Mean
Std
1754.19
27.02
1800.75
71.45
1811.75
96.93
1736.7
54.34
1733.19 48.74
1747.37
61.94
1819.98
74.13
1811.27
66.34
1811.48
54.27
1793.28
53.77
1826.54
47.6
1829.09
50.05
1844.67
55.1
33
RMSE
Mean
Std
0.1392 7.88e-14
0.1286 4.55e-07
0.1216 1.17e-05
0.1159 2.29e-05
0.1127 2.56e-05
0.1101 2.94e-05
0.1079 2.56e-05
0.1059 2.67e-05
0.1043 2.41e-05
0.1028 4.41e-05
0.1015 4.36e-05
0.1004 3.29e-05
0.0992 2.87e-05
Table 4.1: Stability Analysis for K = 2 to 14: Mean and Standard Deviation
4.4
As shown in Section 4.3.2, by the help of VAT the number of bi-clusters was estimated to be
near 7. However, there were some weak evidences about some other possible bi-clusters. In this
section, the stability of NMF is evaluated to decide about the number of bi-clusters. To do so,
NMF is iterated 100 times with the number of clusters changing from 2 to 14. The results are
summarized in Table 4.1, Figure 4.6 and Figure 4.7.
As Figure 4.6 and Figure 4.7 show, the minimum CVI with 100 iterations is achieved for
K = 6. Also K = 6 has the second best standard deviation after K = 2. K = 2 has the
minimum standard deviation because the algorithm converges so quickly to the same local
minimum and this decreases the standard deviation.
4.5
With the number of clusters determined to be 6, images and pixels can be clustered by NMF.
Figure 4.8(a) shows the clusters centers corresponding to the columns of W and Figure 4.8(b)
shows the components. These components are achieved simply by setting the maximum value
in each row of W (W (i, :)) equal to one and the rest equal to zero. The components also
clearly explain the additivity feature of NMF means that every image can be reconstructed by
a combination of basis components in W .
In order to visualize the matrix factorization, the feature selection method explained in
Section 4.2 is applied on the normalized data set. 30% of pixels (features) are selected and the
rest is removed. Figure 4.9 shows the factorization of V into W and H.
4.6
Cross-Validation
Cross-validation, sometimes called rotation estimation, is a technique for assessing how the
results of a statistical analysis will generalize to an independent data set. It is mainly used in
settings where the goal is prediction, and one wants to estimate how accurately a predictive
model will perform in practice. One round of cross-validation involves partitioning a sample
of data into complementary subsets, performing the analysis on one subset (called the training
set), and validating the analysis on the other subset (called the validation set or testing set). To
34
1860
CVI
1840
1820
Mean
1800
1780
1760
1740
1720
2
8
Cluster #
10
12
14
(a) Mean
100
CVI
90
80
Standard Deviation
70
60
50
40
30
20
2
8
Cluster #
10
12
14
35
0.145
RMSE
0.14
0.135
0.13
Mean
0.125
0.12
0.115
0.11
0.105
0.1
0.095
2
8
Cluster #
10
12
14
(a) Mean
5
x 10
4.5
RMSE
3.5
Standard Deviation
2.5
1.5
0.5
0
2
8
Cluster #
10
12
14
Cluster #1
Cluster #2
Cluster #3
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
70
70
70
80
80
10
20
30
40
80
10
Cluster #4
20
30
40
10
Cluster #5
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
70
70
70
80
80
20
30
40
20
30
40
Cluster #6
10
10
36
80
10
20
30
40
10
20
30
40
(a) Clusters
Cluster #1
Cluster #2
Cluster #3
10
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
70
70
70
80
80
10
20
30
40
80
10
Cluster #4
20
30
40
10
Cluster #5
10
10
20
20
20
30
30
30
40
40
40
50
50
50
60
60
60
70
70
70
80
80
20
30
40
30
40
Cluster #6
10
10
20
80
10
20
30
40
10
20
(b) Components
30
40
Figure 4.9: V W H
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
1
2
3
4
5
Mean
Variance
Std
RMSE
0.112625
0.112645
0.112663
0.112646
0.112651
0.112646
1.90826e-10
1.38139e-05
38
CVI
1742.4
1779.16
1741.94
1745.64
1768.51
1755.53
295.44
17.19
C#1
20
40
60
80
C#2
20
40
60
80
20 40
C#1
20
40
60
80
20
40
60
80
20 40
C#2
20
40
60
80
20 40
C#1
20
40
60
80
20 40
C#1
20 40
C#2
20 40
20 40
20 40
C#5
20 40
C#4
20 40
C#6
20
40
60
80
20 40
C#5
20
40
60
80
20 40
20 40
C#6
20
40
60
80
20
40
60
80
20
40
60
80
20 40
20 40
C#5
20 40
C#4
20 40
C#3
20 40
C#6
20
40
60
80
20
40
60
80
20
40
60
80
20
40
60
80
20 40
C#5
20 40
C#4
20 40
C#3
C#6
20
40
60
80
20
40
60
80
20
40
60
80
20
40
60
80
20
40
60
80
20 40
C#4
20 40
C#3
20 40
C#2
C#5
20
40
60
80
20
40
60
80
20
40
60
80
20
40
60
80
20
40
60
80
20 40
C#3
20 40
C#2
20 40
C#1
C#4
20
40
60
80
20
40
60
80
20
40
60
80
20
40
60
80
C#3
20 40
C#6
20
40
60
80
20 40
20 40
Chapter 5
39
40
Since the problem is a blind clustering and therefore unsupervised, the tendency of clustering should be evaluated before clustering. This means that before applying any clustering
method one should ask if there is any cluster or no. Visual Assessment Cluster Tendency
(VAT) introduced by Bezdek and Hathaway (2002) is used to evaluate the cluster tendency. This method has a lot of advantages and the most important ones are that it
works directly on data not on the clusters obtained by a clustering method and it is a
parameter-free method and does not need any parameter setting. However, running of
VAT is very costly and every run with W not the whole data set V as the input, takes
around four hours. Other variations of VAT such as sVAT (Scalable, improved VAT:
Havens and Bezdek (2011)) could be considered in future. It should be mentioned that
after obtaining clusters, the interpretability of clusters must be evaluated by an expert
related to the context of the problem.
After obtaining clusters (W ) and H by NMF, V W H can be rewritten column by
column as v W h where v and h are the corresponding columns of V and H:
..
..
.
..
..
v
h
=
(5.1)
. W .
..
.
..
. n1
nK
K1
This lets us to reconstruct every olfactory image by the combination of clusters. This
additivity property of NMF allow us to realize the most effective regions in images that
finally leads to a better understanding of olfaction. Development of new additive clustering
methods such as NMF could be considered as a future work.
Since NMF and also FCM produce non-global solutions, it is probable to obtain different
clusters in every run. The sensitivity of NMF should be evaluated more extensively using
methods such as cross-validation.
Appendix A
A.1
Chapter 2
A.1.1
Acronym
convexnmf
orthnmf
nmfnnls
Model
Convex and semi-nonnegative matrix factorizations (Ding et al. (2010))
Orthogonal nonnegative matrix t-factorizations (Ding et al. (2006))
Sparse non-negative matrix factorizations via alternating non-negativity
constrained least squares (Kim and Park (2008) and Kim and Park
(2007))
Basic NMF(Lee and Seung (1999) and Lee and Seung (2001))
NMF with Multiplicative Update Formula Pauca et al. (2006)
NMF with Auxiliary Constraints Berry et al. (2007)
nmfrule
nmfmult
nmfals
All initial seeds are created randomly and the maximum iterations is 1000.
A.2
A.2.1
Chapter 3
Comparison of FCM and NMF
Parameter
m (Fuzzification Factor)
Norm
Number of Clusters (K)
Initial Seeds
Maximum Iteration
A.3
A.3.1
Model
FCM
FCM
FCM and NMF
FCM and NMF
NMF (Kim and Park (2007))
Value
1.1
Euclidean
2 to 45
Random
1000
Chapter 4
coVAT
Parameter
Scaling Factor r
Scaling Factor c
Norm
Model
coVAT
coVAT
coVAT (Bezdek et al. (2007))
41
Value
0.1
0.1
Euclidean
A.3.2
A.3.3
Model
Feature Selection
NMF
NMF
NMF (Kim and Park (2007))
Value
30%
6
1000
Random Positive Seeds
Model
NMF
NMF
NMF
NMF (Kim and Park (2007))
Value
2 to 14
Random Positive Seeds
1000
100
Model
NMF
NMF
NMF
NMF
NMF (Kim and Park (2007))
Value
6
Random Positive Seeds
1000
10
5
Stability Analysis
Parameter
Number of Clusters (K)
Initial Seeds
Maximum Iteration
Number of Repetition
A.3.4
42
Cross-validation
Parameter
Number of Clusters (K)
Initial Seeds
Maximum Iteration
Number of Folds
Number of Permutations
Bibliography
Batcha, N. and A. Zaki (2010, jan.). Preliminary study of a new approach to nmf based text
summarization fused with anaphora resolution. In Knowledge Discovery and Data Mining,
2010. WKDD 10. Third International Conference on, pp. 367 370.
Benetos, E. and C. Kotropoulos (2010, nov.). Non-negative tensor factorization applied to music
genre classification. Audio, Speech, and Language Processing, IEEE Transactions on 18 (8),
1955 1967.
Berry, M. W., M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons (2007). Algorithms
and applications for approximate nonnegative matrix factorization. Computational Statistics
& Data Analysis 52 (1), 155 173.
Bezdek, J. and R. Hathaway (2002). Vat: a tool for visual assessment of (cluster) tendency. 3,
2225 2230.
Bezdek, J., R. Hathaway, and J. Huband (2007, oct.). Visual assessment of clustering tendency
for rectangular dissimilarity matrices. Fuzzy Systems, IEEE Transactions on 15 (5), 890 903.
Biederman, I. (1987). Recognition-by-components: A theory of human image understanding.
Psychological Review 94 (2), 115 147.
Carmona-Saez, P., R. Pascual-Marqui, F. Tirado, J. Carazo, and A. Pascual-Montano (2006).
Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC
Bioinformatics 7, 118. 10.1186/1471-2105-7-78.
Chueinta, W., P. K. Hopke, and P. Paatero (2000). Investigation of sources of atmospheric
aerosol at urban and suburban residential areas in thailand by positive matrix factorization.
Atmospheric Environment 34 (20), 3319 3329.
Ding, C., T. Li, and M. Jordan (2010, jan.). Convex and semi-nonnegative matrix factorizations.
Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (1), 45 55.
Ding, C., T. Li, W. Peng, and H. Park (2006). Orthogonal nonnegative matrix t-factorizations
for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 06, New York, NY, USA, pp. 126135. ACM.
Everitt, B. (1978). Graphical techniques for multivariate data. London: Heinemann Educational.
Greene, D., G. Cagney, N. Krogan, and P. Cunningham (2008). Ensemble non-negative matrix
factorization methods for clustering proteinprotein interactions. Bioinformatics 24 (15),
17221728.
Hathaway, R. J. and J. C. Bezdek (1994). Nerf c-means: Non-euclidean relational fuzzy clustering. Pattern Recognition 27 (3), 429 437.
43
BIBLIOGRAPHY
44
Hathaway, R. J. and J. C. Bezdek (2002). Clustering incomplete relational data using the
non-euclidean relational fuzzy c-means algorithm. Pattern Recognition Letters 23 (1-3), 151
160.
Hathaway, R. J., J. C. Bezdek, and J. M. Huband (2006). Scalable visual assessment of cluster
tendency for large data sets. Pattern Recognition 39 (7), 1315 1324.
Havens, T. and J. Bezdek (2011). An efficient formulation of the improved visual assessment
of cluster tendency (ivat) algorithm. Knowledge and Data Engineering, IEEE Transactions
on PP (99), 1.
Jain, A. (1988). Algorithms for clustering data. Englewood Cliffs N.J.: Prentice Hall.
Kim, H. and H. Park (2007). Sparse non-negative matrix factorizations via alternating nonnegativity-constrained least squares for microarray data analysis. Bioinformatics 23 (12),
14951502.
Kim, H. and H. Park (2008, July). Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 30,
713730.
Kim, Y.-I., D.-W. Kim, D. Lee, and K. H. Lee (2004, December). A cluster validation index
for gk cluster analysis based on relative degree of sharing. Inf. Sci. Inf. Comput. Sci. 168,
225242.
Lee, D. D. and H. S. Seung (1999, October). Learning the parts of objects by non-negative
matrix factorization. Nature 401 (6755), 788791.
Lee, D. D. and H. S. Seung (2001). Algorithms for non-negative matrix factorization. In In
NIPS, pp. 556562. MIT Press.
Leon, M. and B. A. Johnson (2003). Olfactory coding in the mammalian olfactory bulb. Brain
Research Reviews 42 (1), 23 32.
Lin, C.-J. (2007). Projected gradient methods for nonnegative matrix factorization. Neural
Computation 19 (10), 27562779.
Paatero, P. and U. Tapper (1994). Positive matrix factorization: A non-negative factor model
with optimal utilization of error estimates of data values. Environmetrics 5 (2), 111126.
Park, L., J. Bezdek, and C. Leckie (2009, feb.). Visualization of clusters in very large rectangular
dissimilarity data. In Autonomous Robots and Agents, 2009. ICARA 2009. 4th International
Conference on, pp. 251 256.
Pauca, V. P., J. Piper, and R. J. Plemmons (2006). Nonnegative matrix factorization for spectral
data analysis. Linear Algebra and its Applications 416 (1), 29 47. Special Issue devoted to
the Haifa 2005 conference on matrix theory.
Ullman, S. (2000, July). High-Level Vision: Object Recognition and Visual Cognition (1 ed.).
The MIT Press.
Van Benthem, M. H. and M. R. Keenan (2004). Fast algorithm for the solution of large-scale
non-negativity-constrained least squares problems. Journal of Chemometrics 18 (10), 441
450.
Xie, X. and G. Beni (1991, aug). A validity measure for fuzzy clustering. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 13 (8), 841 847.
BIBLIOGRAPHY
45
Xu, W., X. Liu, and Y. Gong (2003). Document clustering based on non-negative matrix
factorization. In Proceedings of the 26th annual international ACM SIGIR conference on
Research and development in informaion retrieval, SIGIR 03, New York, NY, USA, pp.
267273. ACM.
Yuan, Z. and E. Oja (2005). Projective nonnegative matrix factorization for image compression
and feature extraction. In H. Kalviainen, J. Parkkinen, and A. Kaarna (Eds.), Image Analysis,
Volume 3540 of Lecture Notes in Computer Science, pp. 333342. Springer Berlin / Heidelberg.
10.1007/11499145 35.