You are on page 1of 8

2013 IEEE Conference on Computer Vision and Pattern Recognition

Weakly-Supervised Dual Clustering for Image Semantic Segmentation

Yang Liu , Jing Liu , Zechao Li , Jinhui Tang , Hanqing Lu



NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China, 100190.

School of Computer Science, Nanjing University of Science and Technology, China, 210044.
{liuyang6, jliu, luhq}@nlpr.ia.ac.cn, zechao.li@gmail.com, jinhuitang@mail.njust.edu.cn

Abstract popular research topic and some efforts contribute to the


problem [3, 26]. Most works focus on fully or partially
In this paper, we propose a novel Weakly-Supervised D- supervised setting which means each or partial pixels are
ual Clustering (WSDC) approach for image semantic seg- manually labeled for model training [18, 11, 6, 22]. How-
mentation with image-level labels, i.e., collaboratively per- ever, producing pixel-level labels is time-consuming and
forming image segmentation and tag alignment with those may be inaccurate. Fortunately, lots of image sharing web-
regions. The proposed approach is motivated from the ob- sites provide us plentiful user-contributed images with so-
servation that superpixels belonging to an object class usu- cial tags, in which the raw correspondences between images
ally exist across multiple images and hence can be gath- and labels are available. Thus, weakly-supervised method-
ered via the idea of clustering. In WSDC, spectral clus- s [25, 26, 27] with only image-level labels available have
tering is adopted to cluster the superpixels obtained from emerged and attracted more attention.
a set of over-segmented images. At the same time, a lin- In this paper, we propose a coherent framework under the
ear transformation between features and labels as a kind of weakly-supervised setting to perform holistic image under-
discriminative clustering is learned to select the discrimi- standing, i.e., obtaining meaningful image regions and si-
native features among different classes. The both clustering multaneously assigning image-level labels to those region-
outputs should be consistent as much as possible. Besides, s. The problem is formulated as a Weakly-Supervised D-
weakly-supervised constraints from image-level labels are ual Clustering (WSDC) task to cluster superpixels and as-
imposed to restrict the labeling of superpixels. Finally, the sign a suitable label to each cluster. The rst evidence of
non-convex and non-smooth objective function are efcient- our method is that similar superpixels have high probability
ly optimized using an iterative CCCP procedure. Exten- to share the same label. To mine this kind of importan-
sive experiments conducted on MSRC and LabelMe dataset- t contextual relationship, a spectral clustering term is de-
s demonstrate the encouraging performance of our method ned over the superpixels of all images to group the vi-
in comparison with some state-of-the-arts. sually similar ones together. The second evidence is that
there is rich discriminative information among different ob-
ject classes, e.g., not all the features are important and dis-
1. Introduction criminative for a certain class. We dene a discriminative
clustering term and require its outputs to be consistent with
Image semantic segmentation is to automatically parse
the outputs of spectral clustering. Besides, we explicitly
images into some semantic regions. This is a coherent task
impose weakly-supervised constraints during the dual clus-
between image segmentation and region-level label assign-
tering process which can assign labels to clusters. Incor-
ment. That is, the two issues are inseparable and promote
porating these three terms, the problem is formulated as a
mutually. Intuitively, exact segmentations can provide rep-
non-convex and non-smooth objective function, which is
resentative features for pixel labeling. In turn, precise label-
optimized via an iterative CCCP algorithm [1]. Finally, ex-
ing results will boost image segmentation since the pixels
tensive experiments on the public datasets, i.e., MSRC and
with the same label can be deemed as a whole object. From
LabelMe, demonstrate the encouraging performance of our
this view, semantic segmentation is a kind of higher-level
algorithm. Figure 1 illustrates the owchart of the proposed
image understanding than any individual case. According-
method.
ly, the solution about the problem is really challenging but
Our main contributions are summarized as follows.
valuable to support ne-grained image analysis, retrieval or
other possible applications. We propose a coherent framework to jointly solve im-
Recently, image semantic segmentation has become a age segmentation and region-level annotation under

1063-6919/13 $26.00 2013 IEEE 2073


2075
DOI 10.1109/CVPR.2013.270

  
 
  

 
  

  
 



 
 

 

 

 
     



     

  

     

 


 

 
 
 


 
  
 


Figure 1. The owchart of our method.

the weakly-supervised setting. Furthermore, the out- et.al [26] proposed a graphical MIM model and introduced
put of the model can also be used to semantically seg- an objectness to distinguish objects from background class-
ment any test images with or without labels. es. The work [27] is an extension of [26], in that work, the
The proposed method incorporates the spectral cluster- author built a multiple image model and adopted a param-
ing and discriminative clustering to cluster superpixel- eter family of CRF models, to evaluate the quality of each
s from all images into different clusters, and imposes model in the family, a model selection criterion is proposed.
image-level labels as a kind of weak supervision to as- Label To Region. Label to region means reassign the
sign labels to clusters. labels annotated at the image-level to those segmented im-
An efcient iterative CCCP solution is designed to age regions rather than the whole image [13, 12, 23]. Liu
solve the non-convex and non-smooth objective func- et.al proposed a bi-layer sparse coding formulation for re-
tion. constructing an image region using the over-segmented im-
age patches. And they further improved the work to solve
2. Related Work
the problem by search on web [15]. Yang et.al [28] pro-
In this section, we review some works related with ours posed the spatial group sparse coding by integrating the s-
in several aspects. patial correlations among training regions. However, all the
Image Semantic segmentation. From the methodol- works adopt a sequential pipeline to rst over-segment im-
ogy view, the methods can be roughly divided into three ages and then design suitable models to describe the intro-
categories: fully-supervised, semi-supervised and weakly- and inter- correlations among labels and segmented regions,
supervised. In the fully-supervised setting, CRF (Condi- while the performance of region-level tagging will be cer-
tional Random Field) [21, 19] models are used typically tainly degenerated by imperfect segmentation algorithms.
and have lots of effective extensions [5]. Its basic formula- Image Cosegmentation. Co-segmentation [7, 8, 16]
tion is dened over image pixels and various potential func- means to simultaneously segment a common salient fore-
tions are proposed to depict the relations of multiple unit- ground object from a set of images which can be seen as
s. However, the CRF-style models often have complicated a special case of our work. Kim et.al [8] maximized the
structures and many parameters which are hard to optimize overall temperature of images associated with a heat dif-
and inference. To alleviate the dependence of ne-labeled fusion process and the position of sources corresponding
training data, Socher et.al [22] proposed a semi-supervised to different classes. Joulin [7] proposed a novel energy-
model to nd a mapping between visual and textual word- minimization approach to cosegmentation that can handle
s by projecting them into a latent meaning space, in which multiple classes and images. Most existing works are only
partial ne-grained labeled images are also needed. Li et. applied to a subgroup of images with same foreground and
al [6] proposed a partially-supervised hierarchical genera- not intended to handle irregularly appearing multiple fore-
tive model to jointly classify, annotate, and segment var- grounds. Besides, they did not explore any supervision like
ious scene images. While the model estimation required easily available image-level labels in their learning process.
a handful of clean images in which some object regions
are marked with their corresponding tags. From this view, 3. Weakly-Supervised Dual Clustering
the above fully-supervised or semi-supervised solutions are
very limited due to the high cost on the acquisition of ne- To uncover the correspondence between image superpix-
grained image labels. Weakly-supervised semantic segmen- els and semantic labels, in this work we develop a weakly-
tation [26, 27] arised to solve this problem. Vezhnevets supervised dual clustering model by simultaneously max-

2076
2074
imizing the appearance consistency of superpixels within 3.3. Discriminative Clustering
the same class and the separability of multiple classes. The
Since not all the features are important and discrimina-
former problem leads to solving a bottom-up unsupervised
tive for a certain class, a discriminative clustering strategy
clustering while the latter problem leads to methods de-
with l2,1 -norm regularization is introduced. Its outputs are
signed for top-down discriminative clustering problem.
required to be consistent with the outputs of spectral clus-
3.1. Notations tering. Besides, it is required to adaptively choose the dis-
criminative features. To this end, we assume that there is
Assume we have a data collection with I images X =
a linear transformation W RdC between features and
{X1 , , Xi , , XI }. Let X = [X1 , , Xi , , XI ]
the predicted labels. Therefore, the objective function for
denote the data matrix with Xi = [x1i , , xni i ], where
discriminative clustering is formulated as
xki Rd is the feature descriptor of the k-th superpixel in
the i-th image and ni is the number of superpixels in the i-th 
N

image. For brevity, we denote X = [x1 , , xi , , xN ] min L(Y, W ) = loss(yi , W T xi ) + W 2,1 , (3)
I i=1
without confusion, where N = i=1 ni . Suppose these I
images are sampled from C classes and the label informa- where loss is a loss function to be dened, and and
tion is dened as G = [g1 , ..., gi , ...gI ] {0, 1}CI , where are two nonnegative  parameters. The l2,1 -norm is dened
d C
gi {0, 1}C is the label vector of Xi . gic = 1 if Xi belongs as W 2,1 = i=1 2
j=1 Wij . The l2,1 -norm regular-
to the c-th class and 0 otherwise. The predicted superpixels- ization term is imposed to ensure W sparse in rows. In
level label matrix Y RN C is dened as that way, the proposed method is able to handle correlat-
 ed and noisy features and enable to evaluate the correlation
1, if the n-th superpixel belongs to the c-th class,
yn =
c
between labels and features.
0, otherwise.
(1) For simplify, in this work we adopt the least square loss
function and then have
3.2. Spectral Clustering
L(Y, W ) = X T W Y 2F + W 2,1 . (4)
On the one hand, visually similar superpixels have high
Through learning the linear transformation, i.e., a mapping
probability to share the same label. On the other hand, spec-
function from visual features to labels, the discriminative
tral techniques have been demonstrated to be effective to de-
feature representations for each class can be obtained.
tect the cluster structure [20], which can integrate the con-
sistency relationships of superpixels among different im- 3.4. Weakly-Supervised Constraint
ages. In light of this, we employ spectral techniques to mine
Given an image and its associated labels, it is reasonable
the aforementioned contextual information.
and natural to restrict the mapping between superpixels and
The interactions among superpixels are represented by
labels to meet the following constraints.
an afnity matrix S RN N dened as
 One superpixel corresponds to at most one label.
x x 2
exp( i 2 j ), xi Nk (xj ) or xj Nk (xi ), One label has at least one superpixel mapped to it. It
Sij =
0, otherwise. guarantees that if a label is assigned to an image, there
is at least one superpixel supporting this label.
Here Nk (x) is the set of k-nearest superpixels of x. The Superpixels should correspond to the labels of images
k-nearest superpixels are selected only from the superpix- they belong to. This makes sure that there are no image
els from one image or the images sharing common labels, superpixels supporting an invalid label.
because the label of a superpixel is identied from labels To satisfy the rst constraint, we impose an orthogonali-
of the image it belongs to. a free parameter to control ty constraint on Y just like [10], i.e., Y T Y = IC RCC ,
the decay rate. In addition, to encourage spatially smooth where IC is an identity matrix. Since Y is the cluster in-
labelings, the spatial neighbor superpixels within the same dicator, it is reasonable to constraint Y 0. When both
image are also connected. Then the spectral clustering term nonnegative and orthogonal constraints are satised, only
is dened as minimizing the following equation: one element in each row of Y is greater than zero and all of
the others are zeros. Hence the learned Y is more accurate
1 
N
yi yj and more capable to provide discriminative information.
J (Y ) = Sij   22 = Tr[Y T LY ], (2) To satisfy the last two conditions, we explicitly impose a
2 i,j=1 Aii Ajj
weak-supervision constraint with a hyper-parameter :
N
where A is a diagonal matrix with Aii = j=1 Sij and L = 
I 
C
Q(Y ) = | max yij
c
gic |. (5)
A1/2 (A S)A1/2 is the normalized Laplacian matrix. xij Xi
i=1 c=1

2077
2075
c
where yij is the value of Y corresponding to the j-th super- where n is the number of superpixels with the largest la-
pixel within the i-th image on label c. Since it is difcult to bel value max l(t) . At the (t + 1)-th iteration, we estimate
directly dealing with Eq. 5 and yijc
[0, 1], we have: the current l based on l(t) and the corresponding (t) . As
  (t) 
1 maxxij Xi yij c
, if gic =1, T l(t) = j j lj = max l(t) j =0 j = max l(t) , for
| max yijc
gic |= (6) the function max(l), its 1-st order Taylor expansion is ap-
jXi maxxij Xi yij ,
c
else.
Then the right side of Eq. 5 is rewritten as: proximated as (max l)l(t) max l(t) + T (l l(t) ) =
max l(t) + T l max l(t) = T l, which could be also writ-

I 
C 
I 
C
ten as :
[ (1 gic ) max yij
c
+ gic (1 max yij
c
)].
xij Xi xij Xi
i c i c

I 
C
(7) gic (1 hc BUi Y hTc ) (11)
Similar
  with [14],
 the rst term is further relaxed by i=1 c=1
(1 g c
i ) y
xij Xi ij
c
. Then Q(Y ) is rewritten as:
i c
where B = [B1 , ..., Bi , ..., BI ], each Bi =

I 
C [bTi1 , ..., bTic , ...bTiC ] RCni is a matrix correspond-
Q(Y ) = [(1 gic )hTc Y T qi + gic (1 max pTij Y hc )],
xij Xi
ing the image i and bic = T . Ui RN N is a diagonal
i=1 c=1 block matrix, Ui = diag(u1 , .., ui ), uk = 0nk nk for
(8) k = 1, ...i 1, i + 1, ...I and ui = Ini ni .
where hc RC is an indicator vector whose all elements
except for the c-th element are zeros. qi RN is a vector
4.2. Iterative Optimization
whose all elements excepts for those elements correspond-
ing to the i-th image are zeros. pij RN is an indicator Now, we adopt an iterative optimization process. First,
vector whose element corresponding to the j-th superpixel we relax the orthogonal constraint and the optimization
in the i-th image is one and other elements are zeros. problem (9) becomes
3.5. The Proposed Formulation min L (Y, W ) = T r(Y T LY ) + X T W Y 2F + W 2,1
Y,W
Jointly considering the above three aspects, we obtain a

I 
C
unied objective function J (Y ) + L(Y, W ) + Q(Y ): + [(1 gic )hc Y T qi + gic (1 hc BUi Y hTc )]
min Tr[Y LY ] + X W Y
T T
2F + W 2,1 i=1 c=1
Y,W
+ Y T Y IC 2F

I 
C 2
+ [(1 gic )hTc Y T qi + gic (1 max pTij Y hc )] s.t. Y 0.
xij Xi
i=1 c=1 (12)
s.t. Y Y = IC ,
T
Y 0. (9) where 0 is a parameter to control the orthogonality con-
straint. In our experiments it is set large enough to ensure
Note that the l2,1 -norm regularization is non-smooth and
the orthogonality constraint satised. We have
the max term is non-convex. So the objective function is
not convex over Y and W simultaneously. In Section 4, we L (Y, W )
focus on how to solve this optimization problem. = 2(X(X T W Y ) + DW ) = 0
W (13)
W = (XX T + D)1 XY.
4. Optimization Algorithm
4.1. CCCP Algorithm Here D is a diagonal matrix with Dii = 2w1i 2 . Substitut-
ing W by Eq. 13, Eq. 12 can be rewritten as:
The CCCP algorithm solves the optimization problem
using an iterative process. At each round t, given an initial 
I 
C
value, CCCP substitutes the concave part of the objective min L = T r[Y M Y ] + [ T
(1 gic )hc Y T qi
Y
function using the 1-st order Taylor expansion. The sub- i c
optimum solution is achieved by iteratively optimizing the 
I 
C
T
subproblem until convergence. + gic (1 hc BUi Y hTc )] + Y Y IC 2F
2
Since the last term in Eq. 9 is a sum term, we i c
consider only the term related with gic . Let l = s.t. Y 0. (14)
[yi1
c c
, ..., yij c T
, ..., yin i
] , we pick the subgradient of l with
Rni and its j-th element is given by: where M = L + (IN X T (XX T + D)1 X) and
 IN RN N is an identity matrix. To optimize the above
1 (t) (t)
j = n , if lj = max(l ), (10) problem, we introduce multiplicative updating rules. Let-
0, otherwise. ting ij be the Lagrange multiplier for constraint Yi,j 0

2078
2076
Algorithm 1 Weakly-supervised Dual clustering. and there are 3 labels per image on average. The dataset is
Input: split into 276 training images and 256 test images.
Data matrix X RdN ; LabelMe [11]: It is a more challenging dataset than M-
Label matrix G RCI ; SRC. It contains 2688 images from 33 classes. There are
Parameters , , , . 2488 training images and 200 test images.
1: Construct the k-nearest neighbor graph and calculate L;
The both datasets are provided with pixel-level
2: The iteration step t = 1;
groundtruth. We adopt SLIC algorithm [2] to obtain the
Initialize Y RN C ; superpixels for each image, and describe each superpix-
Set Dt Rdd as an identity matrix. el by the typical bag-of-words representation while using
3: repeat
SIFT [17] as the local descriptor. To present fair compar-
4: W t = (XX T + Dt )1 XY t ; isons with other methods, we use training images to learn
5: M t = L + (IN X T (XX T + Dt )1 X); our model, and use test images to evaluate the performance.
6: calculate B t ; We evaluate the performance of semantic segmentation
7: calculate P t according Eq. 16; from two views: the labeling performance and segmentation
2(Y t ) performance. The labeling performance is usually evaluat-
8: Yijt+1 Yijt (2M t Y t +P t +2Yijt (Y t )T Y t )ij
ed via two kinds of quantitative measures: total accuracy
9: update the diagonal matrix Dt+1 as Dii t
= 2W1 t 2 ; (T Acc) which measures the percentage of classied pixels,
i
10: t=t+1; and average per-class accuracy (Aver Acc) which measures
11: until Convergence criterion satised the percentage of correctly classied pixels for a class then
Output: averaged over all classes. Because the various baselines on
label matrix Y ; the both datasets adopt different evaluation standards so we
multi-class classier W . report different measures to accord with the corresponding
baselines. For segmentation evaluation metric we adopt the
intersection-over-union score (IOU score) [7] which is a s-
and = [ij ], the lagrange function is L + T r(Y T ). tandard measure in PASCAL challenges. It is dened as
Setting its derivative with respect to Y to 0, we obtain 1
 GTi Rik
maxk |I| 
iI GTi Rk , where GTi is the groundtruth and
i

2M Y + P + 2Y Y T Y 2Y + = 0, (15) Rik the region associated with the k-th class in the image i.

where 5.2. Parameter Analysis


Five parameters need to be set in WSDC, k in the k-

I 
C
P = [(1 gic )qi hc gic UiT B T hTc hc ]. (16) nearest graph construction, and in Eq. 4, in Eq. 5, in
i=1 c=1 Eq. 12. We set k = 50 to construct the k-nearest graph. In
the experiment we nd that is insensitive so we xed =
Using the Karush-Kuhn-Tuckre (KKT) condition [9] 1000 empirically. is set to be 108 which is large enough to
ij Yij = 0, we obtain the updating rules: guarantee the orthogonality constraint satised. Specical-
ly, we focus on the effects of and , because the two pa-
2(Y )ij
Yij Yij (17) rameters are crucial to our results. The range of and are
(2M Y + P + 2Y Y T Y )ij {10, 102 , 103 , 104 , 105 } and {102 , 103 , 104 , 105 , 106 , 107 }
respectively. The semantic segmentation performance is
Then we normalize Y such that (Y T Y )ii = 1, i = 1, ..., C.
used to tune parameters. The results on both datasets are
The optimization algorithm is summarized in Algorithm 1.
shown in Fig. 2. We can observe the following conclusions.
Firstly, when and increase from small to large, the per-
5. Experiments formance varies apparently, which shows that the l2,1 -norm
In this section, we conduct extensive experiments to val- term and weakly-supervision constraint have great impacts
idate the performance of the proposed method and discuss on the performance. Secondly, accuracies reach the peak
the experimental analysis. points when = 103 , = 104 and = 104 , = 106 on
both datasets respectively which all lie in the middle range
5.1. Datasets and the accuracies do not increase monotonically when
To verify the effectiveness of our method, we conduc- and increase. Because extremely large makes the rows
t experiments on two public and challenging datasets, i.e., sparsity overwhelming and extremely small will fail to
MSRC [21] and LabelMe [11]. select the discriminative features. Extremely large will
MSRC: It is a widely used dataset in semantic segmen- lead to neglect the effects of other terms which is also in-
tation task. It contains 591 images from 21 different classes advisable. In the following experiments, we adopt the best

2079
2077
!  ! Table 3. Segmentation performances of our method comparing
with other baselines on MSRC dataset.
class WSDC 2 WSDC 3 [7] [8] [16]
bike 27.6 39.9 43.3 29.9 42.8
bird 48.2 48.3 47.7 29.9 -
car 48.0 52.3 59.7 37.1 52.5
cat 56.0 52.3 31.9 24.4 5.6
chair 72.1 54.3 39.6 28.7 39.4
cow 30.5 43.2 52.7 33.5 26.1
Figure 2. Parameter tuning results of paramters and for MSRC dog 42.8 50.8 41.8 33.0 -
and LabelMe. face 25.3 45.8 70.0 33.2 40.8
ower 71.0 84.9 51.9 40.2 -
house 28.2 48.6 51.0 32.3 66.4
parameter settings on both datasets. plane 15.6 35.9 21.6 25.1 33.4
sheep 56.3 66.3 66.3 60.8 45.7
5.3. Experiments on MSRC dataset sign 51.2 59.5 58.9 43.2 -
tree 71.3 58.1 67.0 61.2 55.9
We compare the proposed algorithm with LAS [15],
average 46.1 52.9 50.2 36.6 40.9
MTL-RF [25], MIM [26] and RLSIM [4] to evaluate the
semantic segmentation performance. We summarize these
methods from the three sides as in Table 1: Supervision,
ILP (Image Label Prior) and MOF (Multiple of Features). multiple foregrounds and backgrounds at one time is a big
Full supervision means each pixel is labeled with a tag challenge for most segmentation methods.
and Weak supervision means only image-level labels are Table 3 shows the segmentation performance on MSR-
available. With ILP represents during the predicting pe- C dataset. First, under the same setting, WSDC 3 gets
riod, the images labels are available and we only predict the highest average IOU score comparing with [7, 8, 16].
the labels of superpixels from the images labels. Without It can proves that the weakly-supervision information can
ILP indicates the labels of images are absolutely unknown. promote the segmentation performance. Secondly, WS-
MOF = yes stands for the method using multiple features. DC 2 obtains comparable results with other methods. It
Table 2 shows the overall semantic segmentation per- is worth to noting that, segmenting a subgroup of images
formance. Two facts can be observed. First, our method which share the same foreground itself is a strong supervi-
achieves the best result which can validate the effectiveness sion. Even WSDC 2 segments the whole image set with
of our method. Even we use single feature and without ILP, multiple foregrounds and backgrounds our method still out-
our method is comparable even better than other method- perform [8, 16]. The results of WSDC 2 will certainly be
s. Secondly, unlike RLSIM 1 gets much higher accuracy effected by the imbalanced labels and irregular appearing
with ILP than RLSIM 2 without ILP, the results of WSD- foregrounds and backgrounds. Thirdly, our method obtains
C 1 and WSDC 2 are very near and both achieve high ac- the best results on 6 out of 14 classes especially on cat and
curacies, which approves during the prediction period the dog categories which are easily confusing objects classes.
image-level labels have negligible effect on our algorith- This reects that the guidance of weakly-supervision can
ms performance. In addition, requiring image-level priors boost the segmentation performance and especially helpful
to boost performance is also a weakness of many semantic to disambiguate the easily confusing categories which is al-
segmentation methods. Figure 3 illustrates the per-class ac- so a second target of our method.
curacies on MSRC. Our method gets the best results on 10
out of 21 classes and especially works well on some very
hard classes such as bird, cat, dog, etc.
5.4. Experiments on LabelMe dataset
For segmentation performance, we compare our method Fully-supervised methods [21, 24, 11] and weakly-
with [7, 8, 16]. All the three methods divide the images supervised methods [27, 26] are used as compared methods
into some subgroups which images with a same label are and the condition settings of them are displayed in Table 4.
deemed as a subgroup, and they process the images from a
Semantic segmentation comparisons on LabelMe are p-
subgroup at one time. In our experiments, we report the seg-
resented in Table 5. Our method outperforms the weakly-
mentation performance under two settings: one is WSDC 3
supervised methods substantially and is comparable with
which segments images from a subgroup at one time, the
fully-supervised approaches. The segmentation average
other one we directly report the segmentation performance
IOU score is 20.1%. To our best knowledge, no works have
of WSDC 2. Segmenting images of the whole dataset with
reported the segmentation performance on LabelMe dataset.

2080
2078
Table 1. The experimental settings of our method and baselines on MSRC dataset.
method MTL-RF [25] LAS [15] MIM [26] RLSIM 1 [4] RLSIM 2 [4] WSDC 1 WSDC 2
Supervision Weak Weak Weak Weak Weak Weak Weak
ILP Without Without With With Without Without With
MOF No No Yes No No No No

Table 2. Total accuracy (T Acc) of our method comparing with baselines on MSRC dataset.
method MTL-RF [25] LAS [15] MIM [26] RLSIM 1 [4] RLSIM 2 [4] WSDC 1 WSDC 2
T Acc 51 63 67 69 47 69 71

Table 6. Total accuracy (T Acc) of our method under different 6. Conclusion


data settings on MSRC dataset.
order Learning Predicting ILP T Acc In this paper, we propose a Weakly-Supervised Dual
1 1 + 2 1 + 2 yes 68.5 Clustering (WSDC) method to automatically segment the
2 1 + 2 2 yes 70.7 images into localized semantic regions. We combine spec-
3 1 2 yes 71.0 tral clustering and discriminative clustering into a unied
4 1 2 no 69.0 framework to integrate the contextual relationships between
5 2 2 yes 71.4 superpixels and discriminative features of multiple classes.
To fully exploit discriminative features, we impose the non-
negative constraint on the label matrix Y and l2,1 -norm reg-
Table 7. Average per-class accuracy (Aver Acc) of our method
ularization on the linear transformation. The image-level
under different data settings on LabelMe dataset.
labels are imposed as weakly-supervised constraints to as-
order Learning Predicting ILP Aver Acc
sign each cluster a semantic label. Extensive experiments
1 1 + 2 1 + 2 yes 25.0
on public challenging datasets have shown the effectiveness
2 1 + 2 2 yes 26.3
of our method.
3 1 2 yes 26.0
4 1 2 no 25.0 Acknowledgements. This work was supported by 973
5 2 2 yes 23.2 Program (2010CB327905) and National Natural Science
Foundation of China (61272329,61070104,61202325).

References
5.5. Out-of-Sample and Label Prior Discussion
[1] R. A Yuille. The concave-convex procedure. Neural Com-
To further investigate the ability of solving the out-of- putation, 15(4):915936, 2003. 1
sample problem of our method, we use different data set- [2] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
tings during the learning and predicting periods. We name S. Ssstrunk. Slic superpixels compared to state-of-the-art
the standard training set and test set as 1 and 2 respective- superpixel methods. IEEE TPAMI, 22(8):888 905, 2012.
ly. The results of our method under different data settings 5
on both datasets are reported in Table 6 and Table 7. Sev- [3] P. Arbelaez, B. Hariharan, C. Gu, S. Gupta, L. Bourdev, and
eral facts can be obtained. First, the highest and lowest ac- J. Malik. Semantic segmentation using regions and parts. In
curacies on both datasets under different settings make little CVPR, 2012. 1
difference which proves our method is relatively stable and [4] F. Briggs, X. Z. Fern, and R. Raich. Rank-loss support
robust. Second, compared setting 2 and 3, whether the test instance machines for miml instance annotation. In KDD,
set 2 is explored in the model learning process or not, the 2012. 6, 7
obtained accuracies are comparable. Maybe due to the sim- [5] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation
plicity of MSRC, the out-of-sample setting (setting 3) on the and object localization with superpixel neighborhoods. In
dataset achieves better performance than the in-sample set- ICCV, 2009. 2
ting (setting 2). Third, the results with ILP are only a little [6] L. jia Li, R. Socher, and L. Fei-fei. Towards total scene un-
higher than without ILP. This demonstrates that our method derstanding: Classication, annotation and segmentation in
is effective to semantically parsing images even no labels an automatic framework. In CVPR, 2009. 1, 2
are provided. Finally, the proposed algorithm achieves the [7] A. Joulin, F. Bach, and J. Ponce. Multi-class cosegmentation.
best performance with setting 5 and setting 2 on the MSRC In CVPR, 2012. 2, 5, 6
and LabelMe datasets, respectively. The reason may be that [8] G. Kim, E. P. Xing, L. Fei-Fei, and T. Kanade. Distributed
2 of MSRC has more images and fewer class labels than 2 cosegmentation via submodular optimization on anisotropic
of LabelMe. diffusion. In ICCV, 2011. 2, 6

2081
2079
"#$% &'&()* +!#()* &,+-./()*





! 








                              

Figure 3. Detailed performance of our method on MSRC dataset.

Table 4. The experimental settings of our method and baselines on LabelMe dataset.
method Texboost [21] LT [11] Supix [24] MIM [26] GMIM [27] WSDC 1 WSDC 2
Supervision Full Full Full Weak Weak Weak Weak
ILP Without Without Without With With Without With
MOF Yes No Yes Yes Yes No No

Table 5. Average per-class accuracy (Aver Acc) of our method comparing with other baselines on LabelMe dataset.
method Texboost [21] LT [11] Supix [24] MIM [26] GMIM [27] WSDC 1 WSDC 2
Aver Acc 13 24 29 14 21 25 26

[9] H. Kuhn and A. Tucker. Nonlinear programming. In Berke- [20] J. Shi and J. Malik. Normalized cuts and image segmenta-
ley Symposium on Mathematical Statistics and Probabilistic- tion. IEEE TPAMI, 22(8):888 905, 2000. 3
s, 1951. 5 [21] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost
[10] Z. Li, Y. Yang, J. Liu, X. Zhou, and H. Lu. Unsupervised fea- for image understanding: Multi-class object recognition and
ture selection using nonnegative spectral analysis. In AAAI, segmentation by jointly modeling texture, layout, and con-
2012. 3 text. IJCV, 81:223, 2009. 2, 5, 6, 8
[11] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene pars- [22] R. Socher and L. Fei-fei. Connecting modalities: Semi-
ing: label transfer via dense scene alignment. In CVPR, supervised segmentation and annotation of images using un-
2009. 1, 5, 6, 8 aligned text corpora. In CVPR, 2010. 1, 2
[12] J. Liu, M. Li, Q. Liu, H. Lu, and S. Ma. Image annotation via [23] J. Tang, S. Yan, R. Hong, G.-J. Qi, and T.-S. Chua. Inferring
graph learning. Pattern Recognition, 42(2):218228, 2009. semantic concepts from community-contributed images and
2 noisy tags. In ACM MM, 2009. 2
[13] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, and S. Ma. Dual [24] J. Tighe and S. Lazebnik. Superparsing: Scalable nonpara-
cross-media relevance model for image annotation. In ACM metric image parsing with superpixels. In ECCV, 2010. 6,
MM, 2007. 2 8
[14] S. Liu, S. Yan, T. Zhang, C. Xu, J. Liu, and H. Lu. Weakly [25] A. Vezhnevets and J. M. Buhmann. Towards weakly super-
supervised graph propagation towards collective image pars- vised semantic segmentation by means of multiple instance
ing. Multimedia, IEEE Transactions on, 14(2):361373. 4 and multitask learning. In CVPR, 2010. 1, 6, 7
[15] X. Liu, S. Yan, J. Luo, J. Tang, Z. Huango, and H. Jin. Non- [26] A. Vezhnevets, V. Ferrari, and J. Buhmann. Weakly super-
parametric label-to-region by search. In CVPR, 2010. 2, 6, vised semantic segmentation with a multi-image model. In
7 ICCV, 2011. 1, 2, 6, 7, 8
[16] J. P. Lopamudra Mukherjee, Vikas Singh. Scale invariant [27] A. Vezhnevets, V. Ferrari, and J. M. Buhmann. Weakly su-
cosegmentation for image groups. In CVPR, 2011. 2, 6 pervised structured output learning for semantic segmenta-
[17] D. G. Lowe. Distinctive image features from scale-invariant tion. In CVPR, 2012. 1, 2, 6, 8
keypoints. IJCV, 60:91110, 2004. 5 [28] Y. Yang, Y. Yang, Z. Huang, H. T. Shen, and F. Nie. Tag
[18] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, localization with spatial correlations and joint group sparsity.
and S. Belongie. Objects in context. In ICCV, 2007. 1 In CVPR, 2011. 2
[19] C. Russell, P. H. S. Torr, and P. Kohli. Associative hierarchi-
cal crfs for object class image segmentation. In ICCV, 2009.
2

2082
2080

You might also like