Professional Documents
Culture Documents
Michael T. Heath
I. I NTRODUCTION
Given a binary matrix G {0, 1}mn , the problem of binary matrix factorization (BMF) is to nd two binary matrices
U {0, 1}mk and W {0, 1}kn so that the distance
between G and the matrix product U W is minimal. In the
existing literature, the distance is measured by the square of the
Frobenius norm, leading to an objective function GU W 2F .
BMF arises naturally in applications involving binary data
sets, such as association rule mining for agaricus-lepiota
mushroom data sets [1], biclustering structure identication
for gene expression data sets [2], [3], pattern discovery for
gene expression pattern images [4], digits reconstruction for
USPS data sets [5], mining high-dimensional discrete-attribute
data [6], [7], market basket data clustering [8], and document
clustering [3].
Numerous techniques have been proposed to deal with
continuous data. Singular value decomposition (SVD) decomposes G into G = U V T , where U and V are orthogonal
matrices, and is a diagonal matrix with singular values
in descending order. A rank-one approximation to G can be
obtained by the outer product of the rst column in U and
the rst column in V scaled by the rst singular value in .
Similarly, a rank-k approximation can be obtained by the sum
of k decreasingly signicant rank-one matrices. This truncated
978-0-7695-5109-8/13 $31.00 2013 IEEE
DOI 10.1109/ICDMW.2013.46
G U W 1
s.t.
U,W
(1)
min
U,W
s.t.
G U W 1
(2)
mk
U {0, 1}
U ek em .
, W {0, 1}
kn
v si 1 +
i=1 vCi
v s0 1 ,
vC0
s.t.
n
gi U wi 1
(3)
i=1
eTk wi
1, i = 1, , n;
wi {0, 1}k , i = 1, , n.
1130
u1 ,...,uk
s.t.
n
i=1
minl=0,1, ,k gi ul 1
m
uj {0, 1} ,
s1
(4)
j = 1, . . . , k.
vCi
vC1
1131
f (U ) 2fopt .
Proof: Let Cp denote the p-th cluster with the binary
centroid up at the optimal solution for 1 p k, and C0
the optimal cluster with centroid u0 . Then we can rewrite the
optimal objective value of problem (4) as
fopt =
k
gi up 1 +
p=1 gi Cp
gi 1 .
(6)
gi C0
vV
gi Cp
It follows that
gi gp 1
gi Cp
4
5
(gi up 1 + gp up 1 )
gi Cp
gi up 1 ,
(8)
gi Cp
k
gi gp 1 +
p=1 gi Cp
k
2(
gi
k
p=1 gi Cp
gi 1
gi C0
up 1
p=1 gi Cp
end
Perform steps 3-17 of Algorithm 1.
gi Cp
gi 1
gi C0
gi up 1 +
gi 1 )
gi C0
2fopt ,
1132
a c(A)1
(9)
|A|
a0 A aA
1
|A| a0 c(A)1
+
|A|
a0 A
= 2
a c(A)1 .
aA
E(f (C))
where the last inequality follows from Lemma IV.5 and the
fact that Hk1 1 + ln k.
V. N UMERICAL R ESULTS
D(a)
1
aA
a a0 1
|A|
aA D(a) aA
a0 A
1
aA a a0 1
+
D(a)
|A|
aA D(a)
a0 A
aA
2
a0 a1
|A|
4fopt (A),
A. Pattern Extraction
a0 A
aA
a0 A aA
+ + 1t .
1
2
T t
f (Vu ),
T
where the second inequality holds because the two relationships hold: min(D(a), a a0 1 ) a a0 1 and
min(D(a), a a0 1 ) D(a). The last inequality follows
from Lemma IV.4.
Lemma IV.5 is similar to Lemma 3.3 in [18], but we gain
a closer bound on f (A) by 4, instead of 8 in [18]. This is
1133
Compare sparsity patterns of the original and the approximate matrix. Fig. 1 shows the sparsity patterns of the
original matrix and the approximate one obtained by each
method. It shows that BMF discovers all the underlying
patterns, PROXIMUS reveals most of them, while kmeans clustering reveals only some patterns, and rounded
k-means improves its performance, but still unable to
discover all patterns.
Tell whether the rank of the approximate matrix matches
the number of clusters in the original matrix. The approximation of all methods except PROXIMUS are of rank
5, which matches the number of clusters in the original
dataset. PROXIMUS returns a rank-10 approximation.
The redundancy in the approximation is the division of
the second row cluster into several parts, which adds
additional ranks for the approximation.
Calculate the Frobenius norm of the approximation error,
i.e., the square root of the objective value used in our
BMF model. The approximation error increases in the order PROXIMUS, BMF, k-means, and rounded k-means,
with values 38.91, 39.09, 39.82, and 46.20, respectively.
50
50
50
100
100
100
150
150
150
200
200
200
250
250
50
(b) BMF
20
20
40
40
60
60
80
80
100
250
50
50
(c) PROXIMUS
100
0
20
40
60
80
(d) k-means
20
40
60
80
Fig. 1. Pattern Extraction for Sample Binary Matrix with Five Row Clusters,
Each of Which Contains a Randomly Chosen Pair of Five Overlapping
Patterns
TABLE I
I MAGE TAG DATASET
# rows
2000
# columns
81
# nonzeros
11740
min rowsum
4
max rowsum
20
1134
TABLE II
P ERFORMANCE FOR O RIGINAL AND A PPROXIMATE I MAGE TAG DATASETS
120
Orig.
Approx. for CBMF
Approx. for PROXIMUS
100
time
time
# rules # rules # rules # rules
time approx. approx. #rules approx. approx. matched. matched.
orig. BMF PROX. orig. BMF PROX. BMF
PROX.
62.57 5.78
2.31
75
72
42
69
39
43.93 3.48
1.88
70
64
39
63
38
32.43 3.04
1.56
63
60
36
59
35
24.92 2.57
1.40
56
57
33
53
33
20.00 1.85
1.01
53
52
31
51
31
11.68 1.18
0.79
43
44
24
43
22
9.03
0.95
0.72
40
40
20
40
19
6.78
0.86
0.62
32
36
18
32
17
5.15
0.59
0.49
24
20
16
18
14
4.94
0.58
0.42
22
20
14
18
14
time_ratio
precision
30
recall
100
100
98
90
25
96
80
15
94
recall (%)
time_ratio
precision (%)
20
92
70
60
90
10
50
88
time_ratio of cbmf
time_ratio of proximus
5
Fig. 2.
10
15
min_sup (%)
precision of cbmf
precision of proximus
20
86
10
15
min_sup (%)
recall of cbmf
recall of proximus
20
40
10
15
min_sup (%)
20
=
=
=
Figure 2 compares the values of these measures. The denitions of these measures imply that larger values mean better
performance. Specically, high time ratio means that mining
on the approximate datasets costs much less than mining on
the original datasets, high precision means that most of the
rules found on the approximate datasets match those found
on the original datasets, and high recall means that most of
the rules found on the original datasets can be discovered by
running the modied ARMADA on the approximate datasets.
Table II and Fig. 2 show the performance for the original
ImageTag dataset and the approximate one obtained by BMF
or PROXIMUS. The rules displayed in this table are the rules
of cardinality 2, which are the rules of interest for this dataset,
as its average row sum is 5.87. We decompose the original
80
No of Rules
min
sup
%
5
6.5
7.5
8.5
10
12.5
14.5
16.5
18.5
20
60
40
20
Fig. 3.
3
4
Cardinality of Rule
1135
TABLE III
E FFECT OF N UMBER OF PATTERNS ON A SSOCIATION RULES
decomp. time time
# patterns time ratio ratio prec.(%) prec. (%) recall (%) recall (%)
approx. BMF BMF PROX. BMF PROX.
BMF
PROX.
171
8.48 9.92 16.94 95.03
95.61
92.33
55.30
263
13.03 6.79 11.93 94.25
99.34
92.43
59.58
394
22.38 4.36 5.39 92.89
95.96
97.07
83.16
[6] M. Koyuturk, A. Grama, and N. Ramakrishnan, Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute
data sets, IEEE Transactions on Knowledge and Data Engineering,
vol. 17, no. 4, pp. 447461, 2005.
[7] , Nonorthogonal decomposition of binary matrices for boundederror data compression and analysis, ACM Transactions on Mathematical Software, vol. 32, no. 1, pp. 3369, 2006.
[8] T. Li, A general model for clustering binary data, in Proceedings of the
2005 ACM SIGKDD Interational Conference on Knowledge Discovery
and Data Mining (KDD05), 2005, pp. 188197.
[9] C. Eckart and G. Young, The approximation of one matrix by another
of lower rank, Psychometrika, vol. 1, no. 3, pp. 211218, 1936.
[10] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Johns
Hopkins University Press, 1996.
[11] T. Li and C. Ding, The relationships among various nonnegative matrix
factorization methods for clustering, in ICDM, 2006, pp. 362371.
[12] H. Kim and H. Park, Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,
SIAM Journal on Matrix Analysis and Applications, no. 30, pp. 713730,
2008.
[13] M. M. Lin, B. Dong, and M. T. Chu, Integer matrix factorization and
its application, Technical Report, 2005.
[14] N. Gillis and F. Glineur, Using underapproximations for sparse nonnegative matrix factorization, Pattern Recognition, vol. 43, no. 4, pp.
16761687, 2010.
[15] P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila, The
discrete basis problem, in 17th European Conference on Machine
Learning(ECML06), 2006, pp. 335346.
[16] J. McQueen, Some methods for classication and analysis of multivariate observations, in Proc. 5th Berkeley Symposium on Mathematical
Statistics and Probability, 1967, pp. 281297.
[17] S. Lloyd, Least squares quantization in pcm, IEEE Trans. Inform.
Theory, pp. 129137, 1982.
[18] D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful
seeding, in Proc. Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027 1035.
[19] Cba: Clustering for business analytics, http://cran.r-project.org/web/
packages/cba/index.html.
[20] T. S. Chua, J. H. Tang, R. C. Hong, H. J. Li, Z. P. Luo, and Y. T. Zheng,
Nus-wide: A real-world web image database from national university
of singapore, in ACM International Conference on Image and Video
Retrieval, 2009, pp. 810.
[21] R. Agrawal and R. Srikant, Fast algorithms for mining association
rules, in Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp.
487499.
[22] Armada, http://www.aiai.ed.ac.uk/jmalone/software.htm.
R EFERENCES
[1] M. Koyuturk, A. Grama, and N. Ramakrishnan, Algebraic techniques
for analysis of large discrete-valued datasets, in PKDD, 2002, pp. 311
324.
[2] Z. Y. Zhang, T. Li, C. Ding, X. W. Ren, and X. S. Zhang, Binary
matrix factorization for analyzing gene expression data, Data Mining
and Knowledge Discovery, vol. 20, no. 1, pp. 2852, 2010.
[3] Z. Y. Zhang, T. Li, C. Ding, and X. S. Zhang, Binary matrix factorization with applications, in IEEE International Conference on Data
Mining, 2007, pp. 391400.
[4] B. H. Shen, S. Ji, and J. Ye, Mining discrete patterns via binary matrix
factorization, in Proceedings of the 2009 ACM SIGKDD Interational
Conference on Knowledge Discovery and Data Mining (KDD09), 2009,
pp. 757766.
[5] E. Meeds, Z. Ghahramani, R. M. Neal, and S. T. Roweis, Modeling
dyadic data with binary latent factors, in Neural Information Processing
Systems (NIPS06), 2006, pp. 977984.
1136