Binary Matrix Factorization

2013 IEEE 13th International Conference on Data Mining Workshops
Mining Discrete Patterns

via Binary Matrix Factorization
Peng Jiang
Michael T. Heath
Department of Computer Science

University of Illinois at Urbana-Champaign
Urbana, Illinois 61801
Email: pjiang2@illinois.edu
Department of Computer Science

University of Illinois at Urbana-Champaign
Urbana, Illinois 61801
Email: heath@illinois.edu
AbstractIn general, binary matrix factorization (BMF) refers

to the problem of nding two binary matrices of low rank such
that the difference between their matrix product and a given
binary matrix is minimal. BMF is an important tool in mining
discrete patterns for high-dimensional data. One approximate
matrix factor nds the dominant patterns, and the other shows
how the original patterns are represented by the dominant
ones. The problem of determining the exact optimal solution
is NP-hard. We show that BMF is closely related with k-means
clustering and propose a clustering approach for BMF. We prove
that our approach has approximation ratio of 2. We further
propose a randomized clustering algorithm that chooses k cluster
centroids randomly based on preassigned probabilities to each
point. The randomized clustering algorithm works well for large
k. We experimentally demonstrate the nice theoretical properties
of BMF on applications in pattern extraction and association rule
mining.
Index TermsBinary matrix factorization, k-means clustering,
approximation algorithm, pattern extraction, association rule
mining.
I. I NTRODUCTION
Given a binary matrix G {0, 1}mn , the problem of binary matrix factorization (BMF) is to nd two binary matrices
U {0, 1}mk and W {0, 1}kn so that the distance
between G and the matrix product U W is minimal. In the
existing literature, the distance is measured by the square of the
Frobenius norm, leading to an objective function GU W 2F .
BMF arises naturally in applications involving binary data
sets, such as association rule mining for agaricus-lepiota
mushroom data sets [1], biclustering structure identication
for gene expression data sets [2], [3], pattern discovery for
gene expression pattern images [4], digits reconstruction for
USPS data sets [5], mining high-dimensional discrete-attribute
data [6], [7], market basket data clustering [8], and document
clustering [3].
Numerous techniques have been proposed to deal with
continuous data. Singular value decomposition (SVD) decomposes G into G = U V T , where U and V are orthogonal
matrices, and is a diagonal matrix with singular values
in descending order. A rank-one approximation to G can be
obtained by the outer product of the rst column in U and
the rst column in V scaled by the rst singular value in .
Similarly, a rank-k approximation can be obtained by the sum
of k decreasingly signicant rank-one matrices. This truncated
978-0-7695-5109-8/13 $31.00 2013 IEEE
DOI 10.1109/ICDMW.2013.46
SVD is the optimal solution for minimizing the error dened

in Frobenius norm. This result is originally due to Eckart and
Young [9]; for a modern treatment in terms of the SVD, see
Golub and Van Loan [10].
Nonnegative matrix factorization (NMF) [11], [12] is used
to discover meaningful patterns in nonnegative matrices. It
decomposes a nonnegative matrix G into G U W , where
U and W are nonnegative matrices. The goal is to minimize
G U W 2F , the same objective function as that of BMF. By
denition, there are no negative values in the approximation
data. However, the solution is not unique. Let B be any
nonnegative invertible matrix, then U B and B 1 W are also
solutions, as U W = (U B) (B 1 W ).
The above methods that work well for continuous data can
be applied to analyze binary data. However, the approximation
data obtained by these methods usually contains negative
and non-binary values, from which it is hard to deduce
useful information to help understand the original binary data.
Moreover, each of these real-valued entries are stored in 32
bits if they have type oat, and 64 bits if they have type double,
while each binary entry needs only 1 bit to store.
In 2002, Koyuturk et al. [1] proposed PROXIMUS, which
approximates binary dataset using entries containing only 0s
and 1s. It is the rst algorithm to return binary approximate
data. PROXIMUS nds rank-one approximation to the original
binary data and then performs recursive partitioning until some
stopping criterion is met and the algorithm stops at a certain
rank. Koyuturk et al. [6] further showed that BMF is NPhard because it can be formulated as an integer programming
problem with 2m+n feasible solutions, even for rank-1 BMF.
They [7] showed that there is no theoretical guarantee on the
quality of the solution produced by PROXIMUS. Lin et al. [13]
proposed an algorithm theoretically equivalent to PROXIMUS
but with lower computation cost. Shen et al. [4] proposed a 2approximation algorithm for rank-1 BMF by reformulating it
as a 0-1 integer linear problem (ILP). Gillis and Glineur [14]
gave an upper bound for BMF by nding the maximum edge
bicliques in the bipartite graph whose adjacency matrix is G.
They also proved that rank-1 BMF is NP-hard. Miettinen et
al. [15] dened discrete basis problem (DBP) as another way
to approximate a binary matrix by the boolean product of two
binary matrices U W so as to minimize G U W 1 .
1129
They also considered a special variant of DBP, which adds

another constraint that each column of W has exactly one
entry of value 1. They name this problem Discrete Basis
Partitioning Problem (DBPP) and showed that there exists a
10-approximation algorithm for it. All the methods discussed
above approximate the original binary data by another binary
matrix. In the denition of BMF, the matrix product U W is
generally not required to be binary. However, since the matrix
G is binary, it is often desirable to have a matrix product that
is also binary.
In this paper, we focus on this special variant of BMF where
the matrix product is restricted to the class of binary matrices.
We note that when the matrix product U W is binary, then
there is no difference between the squared Frobenius norm
and the l1 norm of the matrix G U W . As we will see
later, this will substantially facilitate the proof of theoretical
properties. We explore the relationship between BMF and
special classes of clustering problems and use this relation to
develop an effective approximation algorithm, which we prove
has approximation ratio of 2. The clustering algorithm works
well for small values of k, but its complexity is exponential
in k because it chooses all possible k points out of n points
as initial cluster centroids, and thus it is not suitable for large
values of k. Therefore we propose to choose k cluster centroids
randomly based on preassigned probabilities to each point so
that the randomized clustering algorithm works well for large
values of k. We further estimate the quality of the solution. Our
approaches differ from others methods such as PROXIMUS
and ILP in that they effectively discover the desired number of
underlying dominant patterns instead of rst nding a rank-one
approximation and then performing partitioning recursively.
Since the approximation data represents the original one quite
well but with much smaller size, we can apply association rule
mining algorithms to the approximate dataset resulting in high
speedup in time. The superior performance of our approach
makes it particularly suitable for high-dimensional datasets.
A brief note about the notation we use: For any matrix G,
let gi: denotes the i-th row, gi denotes the i-th column, and
Gji (or gi(j) ) denotes the j-th element of gi . We use g0 to
denote the origin in a suitable space.
II. P ROBLEM D EFINITION
Though the denition of BMF does not require the approximation product matrix to be binary, it is often desirable to
have a binary matrix product U W , as the original matrix is
binary. However, the quadratic constraint on U W makes the
problem very hard to solve. One way to reduce the difculty
is to replace the quadratic constraint by a linear constraint
that guarantees a binary matrix product. For this purpose, we
introduce two specic variants of BMF. Given G {0, 1}mn
and integer k min(m, n), our BMF problems of rank k are
dened as follows:
min
G U W 1
s.t.
U {0, 1}mk , W {0, 1}kn ,

W T e k en .
U,W
(1)
min
U,W
s.t.
G U W 1
(2)
mk
U {0, 1}
U ek em .
, W {0, 1}
kn
In our models we replace the squared Frobenius norm by the

n
m

l1 norm, which is dened as A1 =
|aij | for any
i=1 j=1
matrix A. Thus the two measures are the same when U W is

binary. However, we will see later that using the l1 norm will
substantially change the solution process. In addition, ek , em ,
and en are vectors of all ones in k, m, and n dimensional
space, respectively. The constraint W T ek en (or U ek
em ) ensures that every column of W (or every row of U )
contains at most one nonzero element, and thus it guarantees
that U W is a binary matrix.
III. R ELATIONSHIP BETWEEN BMF AND C LUSTERING
In this section, We reformulate BMF as a special k-means
clustering, and show that the two problems are equivalent. We
also explore the relationship between BMF and several wellknown clustering problems. We rt dene a special k-means
clustering as follows:
Special k-means clustering Given a set V of n binary mdimensional points and a positive integer k with k n, nd a
set S = {s1 , , sk } of binary m-dimensional centroids and a
partition C = {C0 , C1 , , Ck } such that S and C minimize
k
v si 1 +
i=1 vCi
v s0 1 ,
vC0
where s0 is the origin of the m-dimensional space. Note that si

can be any binary vector, i.e., S does not have to be a subset of
V. We show the equivalence of BMF and this special k-means
clustering.
Theorem III.1. BMF and the special k-means clustering
are equivalent in the sense that they have the same optimal
solution set and objective value.
Proof: Let us temporarily x U and consider the resulting
subproblem of BMF (1):
min
W
s.t.
n
gi U wi 1
(3)
i=1
eTk wi
1, i = 1, , n;
wi {0, 1}k , i = 1, , n.
It is easy to see that the optimal solution to the above problem

can be obtained as follows:

1 if uj = arg minl=1, ,k gi ul 1
wi (j) =
.
0 otherwise
If wi (j) = 1, we say gi is assigned to uj , otherwise gi is
assigned to u0 , the origin of the space Rm . Thus problem (3)
assigns each point gi to the nearest centroid in the set
1130
data points C1 = {v1 , , vp }, we nd a binary centroid s1

to minimize the following optimization problem:

v s1 1 .
(5)
min
{u0 , u1 , . . . , uk }. It follows that BMF (1) can be cast as the

following clustering problem:
min
u1 ,...,uk
s.t.
n

i=1
minl=0,1, ,k gi ul 1
m
uj {0, 1} ,
s1
(4)
We call s1 the l1 center of the cluster, and give the following

theorem for the optimal s1 .
j = 1, . . . , k.
It is easy to see that problem (4) is just another formulation of

the special k-means clustering. Thus BMF (1) is equivalent to
the special k-means clustering with U giving the centroids, and
W giving the partition. BMF (2) can similarly be reformulated
as a special k-means clustering problem with W giving the
centroids, and U giving the partition.
We now explore the relationship between BMF and the
well-known clustering problem, k-means clustering [16], the
denition of which is given as follows:
Classical k-means clustering Given a set V of n binary mdimensional points and a positive integer k with k n, nd
a partition C = {C1 , , Ck } such that C minimizes

v
k

vCi
2 .
v
|C
|
i
i=1
Theorem IV.1. Given a cluster C1 of p binary data points, the

optimal binary centroid of problem (5) is obtained by rounding
the geometric center of cluster C1 to binary.
The theorem holds because of a well-known fact that median
minimizes the absolute error, which in binary vectors means
in particular that element-wise majority minimizes L1 norm.
A. A Deterministic 2-Approximation Algorithm
There have been many effective algorithms proposed for kmeans clustering. Lloyd [17] proposed an iterative renement
method that is very widely used. In what follows we modify
Lloyds algorithm for the BMF problem. We rst cast every
column of G as a data point in Rm and denote the resulting
data set by VG , whose cardinality is n. Then we formulate
of VG with a xed
another set SV (k) that contains all subsets

size k. The cardinality of SV (k) is nk . We obtain a clustering
algorithm for BMF (1) in Algorithm 1, which tries every subset
in SV (k) as an initial U and returns the solution with minimum
objective value over all runs.
vCi
Theorem (III.1) allows us to use the formulation of the

special k-means clustering instead of the BMF model when we
compare BMF with k-means clustering. From the denitions
of the two clustering problems, we conclude that BMF is very
close to classical k-means clustering with two differences: One
is that in BMF an additional center, the origin, is used in the
assignment process, which allows BMF to assign many sparse
points to the origin and perform the clustering task only for
the relatively dense points. Intuitively, this will help to reduce
the objective value.
Another difference between BMF and the classical k-means
clustering is that BMF uses any binary points as centroids,
while in classical k-means clustering, each cluster centroid
is the geometric center of all the data points in that cluster.
However, we later prove an interesting conclusion that optimal
centroids for BMF are obtained by rounding geometric centers
to binary ones. In addition, we use l1 norm in BMF, whereas
squared l2 norm is used in k-means. However, the two
measures are the same for binary data.
IV. T WO A PPROXIMATION A LGORITHMS FOR BMF
In this section, we present two approximation algorithms
for BMF. We rst describe a deterministic 2-approximation
algorithm for BMF. The algorithm is effective for small k,
but it takes too much time for large k. Thus we present
a randomized clustering algorithm that chooses k cluster
centroids based on preassigned probabilities of each point so
that it works well for large values of k.
The key issue of a clustering algorithm is updating the
centroids. We discuss how to nd the optimal binary centroid
to minimize the sum of the l1 distances within a cluster in
special k-means clustering. Given a cluster C1 of p binary
vC1
Algorithm 1: Clustering for BMF (1)

n
1 for l 1 to k do
2
Choose subset sl SV (k) and form initial U by
casting every point in sl as its column vector;
3
for i 1 to n do
4
Assign gi to nearest centroid among
u 0 , u 1 , . . . , uk ;
5
for j 1 to k do
6
if gi is assigned to uj then
7
wi (j) = 1;
8
else
9
wi (j) = 0;
10
end
11
end
12
end
13
Compute new l1 center for each cluster Cp based on
newly assigned data points; if there is no change in
l1 center for every p = 1, . . . , k then
14
Output U and corresponding W as solution;
15
else
16
Update l1 center for each cluster and go to line 3;
17
end
18 end
n
19 Return U and W with minimum objective value over k
runs.
We next consider the approximation ratio of Algorithm 1.
1131
Theorem IV.2. Suppose that U = [u1 , . . . , uk ] is the global

optimal solution of problem (4), with an objective value fopt ,
and U = [u1 , . . . , uk ] is the solution output by Algorithm 1
with an objective value f (U ). Then
B. A Randomized Approximation Algorithm

We present a O(log k) approximation algorithm for BMF
based on randomized centers. Instead of the exhaustive search
procedure in Algorithm 1, we modify the random seed selection process in kmeans++ [18] to obtain the starting centers.
Let D(x) denote the l1 distance from a data point x to the
closest center we have already chosen. We use the following
procedure to select the starting centers.
f (U ) 2fopt .
Proof: Let Cp denote the p-th cluster with the binary
centroid up at the optimal solution for 1 p k, and C0
the optimal cluster with centroid u0 . Then we can rewrite the
optimal objective value of problem (4) as
fopt =
k

gi up 1 +
p=1 gi Cp
Algorithm 2: Randomized clustering for BMF (1)
gi 1 .
(6)
gi C0
Let gp denotes the point in Cp that is closest to the centroid,

i.e.,
(7)
gp = arg min gi up 1 .
vV
gi Cp
It follows that

gi gp 1
gi Cp
4
5
We use a weighted probability distribution where a point is

chosen to be next center with probability proportional to its l1
distance from its closest existing center. For convenience, we
call the weighting used in the above procedure D1 weighting.
The way we assign D1 weighting is quite reasonable, as the
center that has been chosen will be assigned weight 0 and thus
will not be chosen again, and the next center is more likely
to be chosen far away from existing centers. Once k starting
centers are chosen, we continue the clustering procedure in
Algorithm 1.
Next we claim in the following theorem that Algorithm 2
has approximation ratio of O(log k).
(gi up 1 + gp up 1 )
gi Cp
gi up 1 ,
(8)
gi Cp
where the rst inequality follows from the triangle inequality

of l1 norm and the positive scalability property of the norm,
i.e, |v|1 = |v|1 holds for any vector v. The second inequality
follows from (7). Therefore, we have
f (U )
k

gi gp 1 +
p=1 gi Cp
k

2(
gi
k

p=1 gi Cp
gi 1
gi C0
up 1
p=1 gi Cp
end
Perform steps 3-17 of Algorithm 1.
(gi up ) + (up gp )1
gi Cp
Take origin of space of data set V to be rst center, u0 ;

for i 1 to k do
Choose cluster center ui
by selecting ui = v V

D(v);
with probability D(v )/
Theorem IV.3. Suppose that fopt is the optimal objective

value of problem (4), and U is the solution output by randomly
choosing initial centroids in Algorithm 2 with an objective
value f (U ). Then
gi 1
gi C0
gi up 1 +
gi 1 )
E(f (U )) 4(ln k + 2)fopt .
gi C0
2fopt ,
The theorem is similar with the conclusion in [18], but

we have a sharper bound on the objective value. As in [18],
we need several technical results to prove it. For notational
convenience, let us denote Copt = {C0 , C1 , , Ck }, where
every Ci is the cluster in the optimal solution associated with
cluster center ui . Specically, C0 is the cluster associated with
centroid u0 .
We rst prove the following lemma for a cluster whose
center is selected uniformly at random from the set itself.
where the rst inequality holds because every gp for p =

1, . . . , k comes from original data points and thus these k
points must be used as initial centroids in Algorithm 1.
The second inequality holds due to (8). It is straightforward
to verify the third inequality, and the last equality follows
from (6).
We conclude from Theorem IV.2 that BMF (1) can be solved
by Algorithm 1 with approximation ratio of 2, as BMF (1)
and problem (4) are equivalent problems. The approximation
ratio of 2 is a desirable quality of Algorithm 1, especially
considering that BMF is NP-hard. However, considering every
possible centroid implies that it works effectively only for
small k. In the next subsection, we discuss how to use a
random starting strategy to improve the efciency of the
algorithm and to obtain a good approximation to BMF for
large k.
Lemma IV.4. Let A be an arbitrary cluster in the nal optimal

clusters Copt , and let C be a clustering with center selected
uniformly at random from A. Then
E(f (A)) 2fopt (A).
Proof: The proof follows a similar vein as the proof
of Lemma 3.2 in [18] with the exception that the Euclidean
1132
due to the use of the l1 norm. It leads to a sharper bound

on the objective value in Theorem IV.3. The following lemma
resembles Lemma 3.4 in [18], with a minor difference in the
constant used in the estimate. Thus we omit its proof.
distance has been replaced by the l1 distance. Let c(A) be the

l1 center of the cluster in the optimal solution. It follows that
1
E(f (A)) =
a a0 1
|A|
a0 A
aA
1
a c(A)1
(9)
|A|
a0 A aA
1
|A| a0 c(A)1
+
|A|
a0 A

= 2
a c(A)1 .
Lemma IV.6. Let C be an arbitrary clustering. Choose T > 0

uncovered clusters from Copt , and let Vu denote the set of
points in these clusters, with Vc = V Vu . Suppose we add
t T random centers to C, chosen with D1 weighting. Let C
denote the resulting clustering. Then
E(f (C )) (1 + Ht )(f (Vc ) + 4fopt (Vu )) +
aA
where Ht denotes the harmonic sum, 1 +
The inequality follows from the triangle inequality of l1

norm. It should be mentioned that the above lemma holds for
the cluster C0 . In such a case, we need only to change the l1
center c(A) to u0 in the proof of the lemma.
We then extend the above result to the remaining centers
chosen with D1 weighting.
E(f (C))
Proof: Note that for any a0

A, the probability that a0 is
selected as the center is D(a0 )/( aA D(a)). The new added
center a0 will change the contribution of each point a in the
objective value to min(D(a), a a0 1 ). It follows that

D(a0 )

E(f (A)) =
min(D(a), a a0 1 ).
aA D(a)
(f (C0 ) + f (A) + 4f (Copt )

4fopt (C0 ) 4fopt (A))(1 + Hk1 )
4(2 + ln k)f (Copt ),
where the last inequality follows from Lemma IV.5 and the
fact that Hk1 1 + ln k.
V. N UMERICAL R ESULTS

D(a)
1
aA
a a0 1
|A|
aA D(a) aA
a0 A

1
aA a a0 1

+
D(a)
|A|
aA D(a)
a0 A
aA
2
a0 a1
|A|
In this section we compare our proposed algorithm with

other existing ones to solve real application problems. For
efciency considerations, we implemented only the randomized algorithm, which we repeat 20 times and output the best
solution with average running time. All numerical tests were
conducted using MATLAB R2010a and performed on a Mac
OS X with Intel Core 2 Duo 3.06GHz CPU and 4 GB RAM.
We compare our algorithm with PROXIMUS [1], which rst
performs rank-one binary matrix factorization, then splits the
dataset based on the entries of the resulting binary vector and
performs recursive partitioning in the direction of such vectors.
We used an efcient implementation of the PROXIMUS
algorithm, CBA R package [19], for our experiments. CBA
requires an input parameter called max.radius, which means
the maximum number of entries a row can deviate from the
dominant pattern to which it is assigned. We used trial and
error to nd a proper value for this parameter. PROXIMUS
stops when it meets the stopping criteria and returns a certain
number of patterns.
4fopt (A),
A. Pattern Extraction
a0 A
aA
Note by the triangle inequality of l1 distance, we have that

D(a0 ) D(a) + a a0 1 for any a. Thus by summing over
(D(a)+aa0 1 )
a, we have D(a0 ) aA |A|
, and hence

1
aA (D(a) + a a0 1 )

E(f (A))
|A|
aA D(a)
a0 A

min(D(a), a a0 1 )
aA
a0 A aA
E(f (A)) 4fopt (A).
+ + 1t .
Now we are ready to state the main result in this subsection.

Proof of Theorem IV.3: Consider the clustering C after
all the starting centers have been selected. Let A denote
the cluster in Copt from which we choose u1 . Applying
Lemma IV.6 with t = T = k 1, and with C0 and A the
only two possibly covered clusters, we have
Lemma IV.5. Let A be an arbitrary cluster in the nal optimal

clusters Copt , and let C be an arbitrary clustering. If we add a
random center to C from A, chosen with D1 weighting, then
1
2
T t
f (Vu ),
T
In this section we compare the performance of BMF for

extracting patterns on two synthetic binary datasets with
PROXIMUS and k-means. We used embedded MATLAB
package for k-means. Since k-means usually returns entries
of negative values and non-binary values, which are hard to
interpret for binary data, thus we change the negative sign to
where the second inequality holds because the two relationships hold: min(D(a), a a0 1 ) a a0 1 and
min(D(a), a a0 1 ) D(a). The last inequality follows
from Lemma IV.4.
Lemma IV.5 is similar to Lemma 3.3 in [18], but we gain
a closer bound on f (A) by 4, instead of 8 in [18]. This is
1133
positive and round each entry to 0 or 1. We call this method

rounded k-means.
The articial dataset was generated by implanting uniform
patterns into groups of rows on a background of uniform white
noise. Each entry of the matrix is set to 1 with probability 0.01
as the background noise. The matrix of 250 84 has ve
row clusters, each of which has two patterns randomly chosen
from the ve uniform patterns. We generate each pattern as
follows. Let l denote the distance of two leading columns in
successive patterns, and r denote the number of overlapping
columns shared by two successive patterns. Each pattern has
(l + r) columns. The (i, j)-th entry belonging to a row cluster
containing the k-th pattern is set to 1 with probability 0.85,
where (k 1)l + 1 j kl + r. In our experiments, we set
l = 16, and r = 4.
We compare the performance from three aspects:
Compare sparsity patterns of the original and the approximate matrix. Fig. 1 shows the sparsity patterns of the
original matrix and the approximate one obtained by each
method. It shows that BMF discovers all the underlying
patterns, PROXIMUS reveals most of them, while kmeans clustering reveals only some patterns, and rounded
k-means improves its performance, but still unable to
discover all patterns.
Tell whether the rank of the approximate matrix matches
the number of clusters in the original matrix. The approximation of all methods except PROXIMUS are of rank
5, which matches the number of clusters in the original
dataset. PROXIMUS returns a rank-10 approximation.
The redundancy in the approximation is the division of
the second row cluster into several parts, which adds
additional ranks for the approximation.
Calculate the Frobenius norm of the approximation error,
i.e., the square root of the objective value used in our
BMF model. The approximation error increases in the order PROXIMUS, BMF, k-means, and rounded k-means,
with values 38.91, 39.09, 39.82, and 46.20, respectively.
B. Association Rule Mining

Since BMF extracts underlying patterns from highdimensional data quite well, we apply it to speed up the
process of association rule mining, a popular method for
discovering relations between variables in large databases.
Association rules are employed in many application areas
including transaction data mining, book recommendations and
tag recommendations for videos or images. However, the cost
for determining such relations is usually quite high due to the
large size of these datasets. We demonstrate how binary matrix
factorization can reduce the cost by replacing the original
dataset with a much smaller approximate dataset.
Koyuturk et al. [1] showed the factorization to a toy transaction matrix. Each of the m transactions in G is approximated
by one of the k virtual transactions in W , and matrix U
gives the weights of the virtual transactions. Then we apply
association rule mining algorithms to W instead of G. We can
50
50
50
100
100
100
150
150
150
200
200
200
250
250
50
(a) Original matrix
(b) BMF
20
20
40
40
60
60
80
80
100
250
50
50
(c) PROXIMUS
100
0
20
40
60
80
(d) k-means
20
40
60
80
(e) Rounded k-means
Fig. 1. Pattern Extraction for Sample Binary Matrix with Five Row Clusters,
Each of Which Contains a Randomly Chosen Pair of Five Overlapping
Patterns
TABLE I
I MAGE TAG DATASET
# rows
2000
# columns
81
# nonzeros
11740
min rowsum
4
max rowsum
20
expect signicant speedup in time for discovering association

rules on large real datasets.
The real test dataset is the NUS-Wide ickr dataset [20]
created by crawling photos from ickr and choosing tags for
photos. Each row is an image and each column is a tag. If
an image has a tag then the corresponding entry is 1, else 0.
We ltered the datasets so that each row has at least two 1s,
as the original datasets are very sparse. Table I describes the
features of the ltered data.
We now compare the association rules mined on the original
datasets, and the approximate ones obtained by BMF or
PROXIMUS. We use an efcient Matlab implementation of the
well-known a-priori algorithm [21], named ARMADA [22],
as the benchmark algorithm for association rule mining. We
extended ARMADA so that it can accept any support value
from 0 to 100, which is not restricted to only integers l,
when specied in the percentage of the data set. For example,
the algorithm can now accept support value 3.5%. We also
modied the original ARMADA to produce another version of
ARMADA that can mine weighted transaction sets obtained
from binary matrix factorization algorithms. We then run
ARMADA on the given dataset and the modied ARMADA
on the approximate dataset and compare the results in terms of
the runtime of the algorithms and the number of rules extracted
from these two datasets. Table II compare the results from
running ARMADA on the original and approximate datasets,
where the unit of the runtime is seconds. The approximate
1134
TABLE II
P ERFORMANCE FOR O RIGINAL AND A PPROXIMATE I MAGE TAG DATASETS
120
Orig.
Approx. for CBMF
Approx. for PROXIMUS
100
time
time
# rules # rules # rules # rules
time approx. approx. #rules approx. approx. matched. matched.
orig. BMF PROX. orig. BMF PROX. BMF
PROX.
62.57 5.78
2.31
75
72
42
69
39
43.93 3.48
1.88
70
64
39
63
38
32.43 3.04
1.56
63
60
36
59
35
24.92 2.57
1.40
56
57
33
53
33
20.00 1.85
1.01
53
52
31
51
31
11.68 1.18
0.79
43
44
24
43
22
9.03
0.95
0.72
40
40
20
40
19
6.78
0.86
0.62
32
36
18
32
17
5.15
0.59
0.49
24
20
16
18
14
4.94
0.58
0.42
22
20
14
18
14
time_ratio
precision
30
recall
100
100
98
90
25
96
80
15
94
recall (%)
time_ratio
precision (%)
20
92
70
60
90
10
50
88
time_ratio of cbmf
time_ratio of proximus
5
Fig. 2.
10
15
min_sup (%)
precision of cbmf
precision of proximus
20
86
10
15
min_sup (%)
recall of cbmf
recall of proximus
20
40
10
15
min_sup (%)
20
Assessment of Association Rules on ImageTag Dataset
datasets are obtained by either BMF or PROXIMUS. The

support value for ImageTag dataset ranges from 5% to 20%,
which were selected as a meaningful range of support values.
The condence value is set as 30%.
To assess the relative times and the quality of the resulting
association rules, we dene time ratio, precision, and recall
as follows:
time ratio
precision
recall
=
=
=
time of ARMADA on original dataset

time of ARMADA on approximate dataset
# matched rules between original and approximate dataset
# rules discovered on approximate dataset
# matched rules between original and approximate dataset
# rules discovered on original dataset
Figure 2 compares the values of these measures. The denitions of these measures imply that larger values mean better
performance. Specically, high time ratio means that mining
on the approximate datasets costs much less than mining on
the original datasets, high precision means that most of the
rules found on the approximate datasets match those found
on the original datasets, and high recall means that most of
the rules found on the original datasets can be discovered by
running the modied ARMADA on the approximate datasets.
Table II and Fig. 2 show the performance for the original
ImageTag dataset and the approximate one obtained by BMF
or PROXIMUS. The rules displayed in this table are the rules
of cardinality 2, which are the rules of interest for this dataset,
as its average row sum is 5.87. We decompose the original
80
No of Rules
min
sup
%
5
6.5
7.5
8.5
10
12.5
14.5
16.5
18.5
20
60
40
20
Fig. 3.
3
4
Cardinality of Rule
Number of Rules for Support Value 7.5 on ImageTag Dataset
matrix with 2000 patterns into an approximate matrix with

171 patterns, that is about one twelfth of its original size. The
runtime for decomposing the Transaction dataset by BMF is
8.48 seconds, and we can still benet from the decomposition,
as it is performed only once, but we may mine association
rules several times on the resulting approximate dataset. When
the approximate dataset are obtained by BMF, we notice that
for all support values, the speedup is as high as one order
of magnitude, the precision values are above 88%, and the
recall values are above 75% (above 90% for most support
values). This table and gure also show that mining on the
approximate dataset obtained by BMF discovers rules with
better precision for most of the support values, and rules with
much better recall for all the support values, when compared
with mining on the approximate dataset obtained by PROXIMUS. Thus BMF is especially suitable for decomposing very
large datasets to mine rules of a given cardinality on the
resulting approximate datasets. In addition, we also show the
presence of rules of all cardinalities for the support value
7.5 in Fig. 3, which gives the number of rules of different
cardinalities discovered by mining on the original ImageTag
dataset, the approximate dataset obtained by BMF, and the
approximate dataset obtained by PROXIMUS. Fig. 3 shows
that the rules mined on the approximate dataset obtained by
BMF are much closer to those mined on the original dataset
than those for PROXIMUS. In fact, we discovered 186 rules on
the original dataset, which we refer as true rules, 188 rules for
BMF, among which 156 rules are matched with true rules, and
125 rules for PROXIMUS, among which only 91 are matched
with true rules.
We evaluate the effect of the number of patterns in the
approximate datasets on the time and quality of the association
rules. Table III lists the time for decomposing the ImageTag dataset by BMF, and compares the rules discovered on
the original dataset with 2000 patterns and the approximate
datasets with 171, 263, and 394 patterns decomposed by BMF
and PROXIMUS, respectively. The table gives the average
time ratio, precision, and recall over 10 support values from
5% to 20%. As seen in the table, increasing the number of
patterns increases the decomposition time and reduces the
time ratio, as expected. However, the precision and recall
values are not signicantly affected by the number of patterns,
especially for BMF, which attains over 92% average precision
and recall rates for all datasets. Thus we conclude that it suf-
1135
TABLE III
E FFECT OF N UMBER OF PATTERNS ON A SSOCIATION RULES
decomp. time time
# patterns time ratio ratio prec.(%) prec. (%) recall (%) recall (%)
approx. BMF BMF PROX. BMF PROX.
BMF
PROX.
171
8.48 9.92 16.94 95.03
95.61
92.33
55.30
263
13.03 6.79 11.93 94.25
99.34
92.43
59.58
394
22.38 4.36 5.39 92.89
95.96
97.07
83.16
ces to mine approximate datasets with only a small number

of patterns by BMF, resulting in high rates of precision, recall,
and speed up.
VI. C ONCLUSIONS
We developed models for BMF, connected them with several
clustering problems, proposed deterministic and randomized
clustering algorithms for solving BMF, analyzed their approximation ratios, and showed empirical results on the behavior of
the randomized clustering algorithm. The results indicate that
our algorithm nds underlying patterns quite well, indicating
that the approximation datasets represent the original ones
well but with much smaller size. We then apply association
rule mining algorithms to approximate datasets, resulting in
good performance and high speedup in time. The superior
performance of BMF makes it particularly suitable for highdimensional datasets.
Our future work will mainly focus on two aspects: applying
BMF to broad applications and proposing new models. We will
extend the application of association rule mining by testing
algorithms on transaction datasets from retail supermarkets.
We will also evaluate the effect of the number of patterns
in the approximate datasets on the time and quality of the
association rules. In addition, we will apply BMF to other
applications such as metagene pattern discovery in bioinformatics community and document clustering in information
retrieval. On the other hand, we will extend the BMF model
to handle two different types of mismatched entries in the
approximation: 0-to-1 and 1-to-0. Our current BMF models
minimize the sum of the two types of mismatched entries
without any preference between them. However, in many
practical applications, it might be helpful to include such a
preference in the optimization model. We will propose a new
model by using different weights for each type of error.
[6] M. Koyuturk, A. Grama, and N. Ramakrishnan, Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute
data sets, IEEE Transactions on Knowledge and Data Engineering,
vol. 17, no. 4, pp. 447461, 2005.
[7] , Nonorthogonal decomposition of binary matrices for boundederror data compression and analysis, ACM Transactions on Mathematical Software, vol. 32, no. 1, pp. 3369, 2006.
[8] T. Li, A general model for clustering binary data, in Proceedings of the
2005 ACM SIGKDD Interational Conference on Knowledge Discovery
and Data Mining (KDD05), 2005, pp. 188197.
[9] C. Eckart and G. Young, The approximation of one matrix by another
of lower rank, Psychometrika, vol. 1, no. 3, pp. 211218, 1936.
[10] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Johns
Hopkins University Press, 1996.
[11] T. Li and C. Ding, The relationships among various nonnegative matrix
factorization methods for clustering, in ICDM, 2006, pp. 362371.
[12] H. Kim and H. Park, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,
SIAM Journal on Matrix Analysis and Applications, no. 30, pp. 713730,
2008.
[13] M. M. Lin, B. Dong, and M. T. Chu, Integer matrix factorization and
its application, Technical Report, 2005.
[14] N. Gillis and F. Glineur, Using underapproximations for sparse nonnegative matrix factorization, Pattern Recognition, vol. 43, no. 4, pp.
16761687, 2010.
[15] P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila, The
discrete basis problem, in 17th European Conference on Machine
Learning(ECML06), 2006, pp. 335346.
[16] J. McQueen, Some methods for classication and analysis of multivariate observations, in Proc. 5th Berkeley Symposium on Mathematical
Statistics and Probability, 1967, pp. 281297.
[17] S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inform.
Theory, pp. 129137, 1982.
[18] D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful
seeding, in Proc. Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027 1035.
[19] Cba: Clustering for business analytics, http://cran.r-project.org/web/
packages/cba/index.html.
[20] T. S. Chua, J. H. Tang, R. C. Hong, H. J. Li, Z. P. Luo, and Y. T. Zheng,
Nus-wide: A real-world web image database from national university
of singapore, in ACM International Conference on Image and Video
Retrieval, 2009, pp. 810.
[21] R. Agrawal and R. Srikant, Fast algorithms for mining association
rules, in Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp.
487499.
[22] Armada, http://www.aiai.ed.ac.uk/jmalone/software.htm.
R EFERENCES
[1] M. Koyuturk, A. Grama, and N. Ramakrishnan, Algebraic techniques
for analysis of large discrete-valued datasets, in PKDD, 2002, pp. 311
324.
[2] Z. Y. Zhang, T. Li, C. Ding, X. W. Ren, and X. S. Zhang, Binary
matrix factorization for analyzing gene expression data, Data Mining
and Knowledge Discovery, vol. 20, no. 1, pp. 2852, 2010.
[3] Z. Y. Zhang, T. Li, C. Ding, and X. S. Zhang, Binary matrix factorization with applications, in IEEE International Conference on Data
Mining, 2007, pp. 391400.
[4] B. H. Shen, S. Ji, and J. Ye, Mining discrete patterns via binary matrix
factorization, in Proceedings of the 2009 ACM SIGKDD Interational
Conference on Knowledge Discovery and Data Mining (KDD09), 2009,
pp. 757766.
[5] E. Meeds, Z. Ghahramani, R. M. Neal, and S. T. Roweis, Modeling
dyadic data with binary latent factors, in Neural Information Processing
Systems (NIPS06), 2006, pp. 977984.
1136

Binary Matrix Factorization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Binary Matrix Factorization

Uploaded by

Copyright:

Available Formats

2013 IEEE 13th International Conference on Data Mining Workshops

Mining Discrete Patterns

Department of Computer Science

Department of Computer Science

AbstractIn general, binary matrix factorization (BMF) refers

SVD is the optimal solution for minimizing the error dened

They also considered a special variant of DBP, which adds

U {0, 1}mk , W {0, 1}kn ,

In our models we replace the squared Frobenius norm by the

matrix A. Thus the two measures are the same when U W is

where s0 is the origin of the m-dimensional space. Note that si

It is easy to see that the optimal solution to the above problem

data points C1 = {v1 , , vp }, we nd a binary centroid s1

{u0 , u1 , . . . , uk }. It follows that BMF (1) can be cast as the

We call s1 the l1 center of the cluster, and give the following

It is easy to see that problem (4) is just another formulation of

Theorem IV.1. Given a cluster C1 of p binary data points, the

Theorem (III.1) allows us to use the formulation of the

Algorithm 1: Clustering for BMF (1)

Theorem IV.2. Suppose that U = [u1 , . . . , uk ] is the global

B. A Randomized Approximation Algorithm

Algorithm 2: Randomized clustering for BMF (1)

Let gp denotes the point in Cp that is closest to the centroid,

We use a weighted probability distribution where a point is

where the rst inequality follows from the triangle inequality

(gi up ) + (up gp )1

Take origin of space of data set V to be rst center, u0 ;

Theorem IV.3. Suppose that fopt is the optimal objective

E(f (U )) 4(ln k + 2)fopt .

The theorem is similar with the conclusion in [18], but

where the rst inequality holds because every gp for p =

Lemma IV.4. Let A be an arbitrary cluster in the nal optimal

due to the use of the l1 norm. It leads to a sharper bound

distance has been replaced by the l1 distance. Let c(A) be the

Lemma IV.6. Let C be an arbitrary clustering. Choose T > 0

where Ht denotes the harmonic sum, 1 +

The inequality follows from the triangle inequality of l1

Proof: Note that for any a0

(f (C0 ) + f (A) + 4f (Copt )

In this section we compare our proposed algorithm with

Note by the triangle inequality of l1 distance, we have that

E(f (A)) 4fopt (A).

Now we are ready to state the main result in this subsection.

Lemma IV.5. Let A be an arbitrary cluster in the nal optimal

In this section we compare the performance of BMF for

positive and round each entry to 0 or 1. We call this method

B. Association Rule Mining

(a) Original matrix

(e) Rounded k-means

expect signicant speedup in time for discovering association

Assessment of Association Rules on ImageTag Dataset

datasets are obtained by either BMF or PROXIMUS. The

time of ARMADA on original dataset

Number of Rules for Support Value 7.5 on ImageTag Dataset

matrix with 2000 patterns into an approximate matrix with

ces to mine approximate datasets with only a small number

You might also like