You are on page 1of 17

4070 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO.

9, SEPTEMBER 2014

Joint Segmentation and Recognition of Categorized


Objects From Noisy Web Image Collection
Le Wang, Gang Hua, Senior Member, IEEE, Jianru Xue, Member, IEEE, Zhanning Gao,
and Nanning Zheng, Fellow, IEEE

Abstract The segmentation of categorized objects addresses level information across the whole image collection to simul-
the problem of joint segmentation of a single category of object taneously extract a foreground object from all images, instead
across a collection of images, where categorized objects are of segmenting the images independently through modeling
referred to objects in the same category. Most existing methods
of segmentation of categorized objects made the assumption just one single image. Such a problem is referred to as the
that all images in the given image collection contain the target segmentation of categorized objects, and has been actively
object. In other words, the given image collection is noise free. studied in recent papers [1][9].
Therefore, they may not work well when there are some noisy Nevertheless, most methods of segmentation of categorized
images, which are not in the same category, such as those image objects were built on the assumption that all images
collections gathered by a text query from modern image search
engines. To overcome this limitation, we propose a method for of the given collection contain the target object, which
automatic segmentation and recognition of categorized objects renders them unable to handle situations where the given
from noisy Web image collections. This is achieved by cotrain- collection of images contain noisy images which do not
ing an automatic object segmentation algorithm that operates contain the target category of object. Such noisy image
directly on a collection of images, and an object category collection may be gathered, e.g., by performing a text
recognition algorithm that identifies which images contain the
target object. The object segmentation algorithm is trained on query using one of the main stream image search engines
a subset of images from the given image collection, which are such as Google and BING image search. This motivated
recognized to contain the target object with high confidence, us to build an automated program to jointly cleanse and
whereas training the object category recognition model is guided extract the categorized objects from noisy Web image
by the intermediate segmentation results obtained from the object collections.
segmentation algorithm. This way, our cotraining algorithm
automatically identifies the set of true positives in the noisy Web On the other hand, most previous works on segmentation
image collection, and simultaneously extracts the target objects of categorized objects only utilized appearance [3][5], [8] or
from all the identified images. Extensive experiments validated shape [1], [7] cues as consistency constraints on foreground
the efficacy of our proposed approach on four data sets: 1) the objects across the image set. Some of these methods attempted
Weizmann horse data set; 2) the MSRC object category data to automatically learn a template for object [2] from the
set; 3) the iCoseg data set; and 4) a new 30-categories data set,
including 15 634 Web images with both hand-annotated category image collection. While some others resorted to labeled pixels
labels and ground truth segmentation labels. It is shown that our for foregound/background modeling [5], [9]. Most of these
method compares favorably with the state-of-the-art, and has the previous works neglected beneficial high level information,
ability to deal with noisy image collections. such as spatial context, across the image set. Context comes
Index Terms Segmentation of categorized objects , cosegmen- with a variety of forms, e.g., different parts of an object
tation, object recognition, auto-context model. can be context to each other, and it can be referred to
as Gestalt laws in middle level knowledge regarding intra-
I. I NTRODUCTION object configurations and inter-object relationships. Intuitively,
they should provide valuable information for object segmen-
W HEN fed with a collection of images in the same
object category, it is beneficial to leverage the high
tation and recognition [6], [10][15], especially when one
tries to jointly extract the categorized objects from a set of
Manuscript received November 20, 2013; revised May 1, 2014; accepted images.
June 27, 2014. Date of publication July 14, 2014; date of current version We extend the auto-context model originally proposed by
August 11, 2014. This work was supported in part by the China 973 Program Tu [12] to simultaneously extract the categorized objects from
under Grant 2012CB316400, and in part by the Natural Science Foundation
of China under Grant 61228303. The work of G. Hua was supported in part a set of images. Tu [12] learned the auto-context model from
by the U.S. National Science Foundation under Grant IIS 1350763, in part by a large number of images with pixel-wise labels. Previously,
the Google Research Faculty Award, and in part by the GHs Start-Up Funds Wang et al. [13] incorporated it in an energy minimization
Stevens Institute of Technology. The associate editor coordinating the review
of this manuscript and approving it for publication was Dr. Olivier Bernard. framework for automatic object of interest extraction from a
L. Wang, J. Xue, Z. Gao, and N. Zheng are with the Institute of Arti- single image, where the auto-context model was iteratively
ficial Intelligence and Robotics, Xian Jiaotong University, Xian 710049, estimated from a single image. In contrast, in our new
China (e-mail: wangleabc@gmail.com; jrxue@mail.xjtu.edu.cn; zhanning-
gao@gmail.com; nnzheng@mail.xjtu.edu.cn). approach, the auto-context model is trained on all images
G. Hua is with the Department of Computer Science, Stevens Institute of of the image collection without using any pixel-wise labels.
Technology, Hoboken, NJ 07030 USA (e-mail: ghua@stevens.edu). It is able to exploit a large amount of contextual informa-
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. tion from the image collection, which facilitates more robust
Digital Object Identifier 10.1109/TIP.2014.2339196 object/background segmentation.
1057-7149 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4071

The recently proposed OPTIMOL [16] system is capable


of automatically collecting a large number of categorized
images and learning the object category model from noisy
Web images. It can provide a category label for each image,
and a bounding box indicating the location and size of the
object for each categorized image. Our proposed work goes
one step further to directly segment a categorized object from
a noisy Web image collection.
Our objective is to achieve an automated system to jointly
extract and recognize categorized objects from noisy Web
image collections. Our proposed method concurrently learns
an object segmentation model, which can automatically extract
the target categorized object from a collection of images, and
an object category recognition model, which operates on either
the segmented object regions, or the whole image. These two
models are learned under a co-training framework, where they
Fig. 1. The flowchart of our co-training framework for automated joint
mutually enhance each other in an iterative fashion. Intuitively, segmentation and recognition of categorized objects from noisy Web image
a good object category model helps remove outlier images collection.
from the noisy Web image collection and hence facilitates
building better object segmentation model. On the other hand, Fig. 1 presents the overall computational flow of our
a good object segmentation model largely removes the clut- co-training framework.
tered background and in turn can help learn better object Our motivation is to build the enabling technologies to
category model. automatically extract and recognize categorized objects from
The input to our co-training framework is a collection of a noisy Web image collection. We highlight our key contribu-
noisy Web images (e.g., lotus flower obtained by Google tions as follows:
image search), the top ranked first dozens of images are We proposed a co-training framework to jointly learn an
first selected to be put in the categorized image collection, object segmentation model and a category recognition
which will grow along the co-training process. We found the model, which simultaneously filters out outlier images
top ranked first dozens of images from the text query rarely from the noisy Web image collection and extracts the
contain any outlier/noisy images, hence they provide a good target categorized objects from the filtered image collec-
initialization of a clean categorized image collection. Starting tion. To the best of our knowledge, such a co-training
from this initial categorized image collection, we learn an framework has not be presented in previous work before.
object segmentation model by interleaving the learning of Based on the co-training framework and combining it
an auto-context model [12] inside an iterative energy based with several useful heuristics, we proposed an automatic
segmentation algorithm using graph cut (i.e., min-cut/max- approach to address the problem of co-segmentation of
flow) [17], [18]. Similar algorithms have been proposed categorized objects from noisy Web image collections.
by Wang et al. [13], [19] for automatic object of interest To fully evaluate the proposed method, we collected
extraction from both single image and a clean collection of a thirty category noisy Web image collection through
categorized images. The learned object segmentation model modern image search engine with ground-truth category
will automatically segment the target categorized objects labels and pixel level labels of the foreground mask of
from all images in the categorized image collection. A the target object inside each of the images. We will make
bag-of-words [16], [20] based object category recognition this dataset publicly available to the community.
model is then learned from the extracted foreground masks The remainder of this paper is organized as follows.
of the target object to help identify true positives from the In Section II we discuss related work. In Section III we
rest of the noisy Web image collection. present an overview of our co-training framework at the system
Those images in the noisy Web image collection but not level. In Section IV we present the detailed formulation of the
in the categorized collection that are predicted with high proposed co-training framework, including the pre-filtering of
confidence from the learned category model to be true pos- the noisy Web images, the learning of the object segmentation
itives of the target category will be added into the cate- model with an embedded auto-context model, and the learning
gorized image collection. The object segmentation model is of the category recognition model. In Section V we present
then incrementally updated with the augmented categorized the 30 categories of labeled noisy Web image collections we
collection. This co-training process alternates between the gathered, along with the detailed experiments and discussions.
learning of these two models until they stabilize, at which Finally, we conclude in Section VI.
point no more images will be added to the categorized collec-
tion. The final output of this co-training framework includes II. R ELATED W ORK
an object segmentation model, an object category recogni- In this section, we review related work in cleansing Web
tion model, and a segmented, denoised collection of catego- images, segmentation of categorized objects, object category
rized images from the original noisy Web image collection. recognition, and joint segmentation and recognition.
4072 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

A. Cleansing Web Images Our object segmentation model is cast into an energy
There is considerable previous work on cleansing image minimization framework with an embedded auto-context
sets from the raw output of image search engines, either in model [12]. Energy minimization on Markov Random Fields
an interactive fashion [21], [22] or in an automatic way [16], provides a standard framework to extract an object from a
[23][25]. Their main goal is to gather a large number of high single image [13], [17], [26][32], and they may be further
quality images of a specified object category from the Web for extended to extract the target objects from a collection of
visual concept learning by removing irrelevant images from images one by one. However, because of the single image
image search results. The interactive methods [21], [22] are modeling, they only model various visual cues such as appear-
capable of building large collections of images with ground ance [17], [26], [27], shape [31], and context [13] confined in a
truth labels, but they depend heavily on human efforts. Most of single image. Some of these previous works also made strong
the automatic methods [16], [23], [24] leverage an object cate- assumption on where the target objects are located [27][30],
gory model trained on text and/or visual features to distinguish such as the center of the image. Important contextual informa-
images with high confidence from outliers. In [25], a scheme tion across the image collection is neglected. Therefore, in our
named ARTEMIS was proposed to enhance automatic selec- formulation, we embed an auto-context model into an energy
tion of training images from noisy user-tagged Web images minimization framework, which is automatically trained on
using an instance-weighted mixture modeling framework. all images to effectively exploit the rich spatial contextual
Compared to the above methods, our approach is fully information presented across the image collection.
automatic without any user intervention, and its benefits are Our research is also related to object cosegmentation
two-fold: 1) we first employ both text-based and visual- [33][40], where the appearance consistency of the foreground
based image filtering to remove the illustration images, which objects across the image collection is exploited to benefit
have obvious differences with the images of the target object object segmentation. The goal of cosegmentation is to simul-
category in terms of text and visual features, and 2) we taneously segment a specific object from two or more images,
then remove the remaining difficult outliers by an object it is assumed that all images contain that object. Among
category model, which is trained on the categorized image these, there are a number of recent works [35], [38][40]
collection and its segmentations. Moreover, the object category that consider interleaving cosegmentation and discriminative
model is updated and strengthened with the expansion of the learning in an unsupervised fashion, while considering diverse
categorized image collection. object instances from the same category. There are also several
cosegmentation methods [41], [42] that further conduct the
cosegmentation of multiple objects of multiple categories, in
B. Segmentation of Categorized Objects which they assumed that each image should contain at least
A number of approaches recently have focused on simul- one object among the multiple categories. In contrast, we try
taneous segmentation of categorized objects from a set of to cosegment a collection of categorized images with different
images, through either supervised learning [5], [8], [9] or object instances of an unknown category, and some of the
unsupervised learning [1][4], [6], [7]. Most of them model the outlier images may not contain the categorized object at all.
appearance cues [3][5], [8], and/or the object shape [1], [7]
or subspace structure [8] across the image set. The super- C. Category Recognition
vised methods [5], [8], [9] estimated appearance models for The goal of image category recognition is to predict whether
foreground object through labeled pixels obtained from user an image belongs to a certain category. There are a number
interactions. In unsupervised methods, the aim is to automat- of recent works on category recognition using various models,
ically segment the different instances of an object from a set such as part-based models [43], [44] and bag-of-words mod-
of images. els [16], [20], [45][47]. It is out of the scope of our paper to
Among the unsupervised methods, the style of alternat- discuss all of them.
ing between learning a categorized object model and jointly A lot of methods based on bag-of-words model have shown
extracting the target categorized objects in all images is closest impressive results on image recognition in many settings [16],
to our work [1][3], [7]. In [7], the categorized object model [47][49], and provide several advantages over traditional
was trained on appearance and shape of the target object approaches of matching local features [50]. Such models are
category. Arora et al. [2] learned a consistent template based efficient due to the structure free representation of images and
on location and appearance across all images, and the seg- objects with dense patches [45]. Hence, due to its simplicity
mentation of the images was individually estimated. Winn and and efficacy, in our case of joint segmentation and recognition
Joijc [1] used a generative probabilistic model by incorporating of categorized objects from noisy Web image collections, the
shape, edge and color cues, and made an assumption on object bag-of-words model is selected as our object category model
shape consistency. In addition to these visual cues leveraged to recognize categorized images from noisy Web images. It
in the previous methods, our proposed method also models the is trained using histogram intersection kernel SVM due to its
contextual cue to facilitate more robust segmentation. Besides, success in recognition [51], [52].
we explicitly address the issue when there are outlier images
presented in the categorized image collection, while previous D. Joint Segmentation and Recognition
works all assumed that images from the categorized image Joint segmentation and recognition of a categorized object
collection all contain the target object. from a single image or an image collection has been
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4073

TABLE I an embedded auto-context model from all these images (see


T HE W ORKFLOW OF O UR C O -T RAINING F RAMEWORK FOR AUTOMATED Section IV-C), and learn an object category model (detailed in
J OINT S EGMENTATION AND R ECOGNITION OF C ATEGORIZED Section IV-D) based on all segmented images, by progressively
O BJECTS F ROM N OISY W EB I MAGE C OLLECTION expanding the categorized image collection with images from
the candidate Web image collection.
To proceed, we pick several new images from the rest
of the candidate Web image collection, and use our learned
object category classifier to recognize them. We reject the
outlier images which are predicted to be negative with high
confidence and add them to the rejected image set; accept the
images predicted to be positive with high confidence from the
object category model and add them to the categorized image
collection. We then jointly extract the objects from all images
in the augmented image collection, and incrementally update
the object segmentation model with the embedded auto-context
model, and the object category model.
This process alternates between segmenting the collected
categorized images and recognizing new categorized images
from the rest of the candidate Web image collection, until we
cannot find any more positive images from the candidate Web
image collection. Finally, we can obtain a categorized image
collection and its segmentations, an object segmentation model
with an embedded auto-context model which can help extract
extensively studied in recent years [4], [10], [53], [54]. the categorized objects, and an object category model of the
They either require a large number of labeled examples to train target object concept which can be used to identify new images
a generative model by jointly modeling shape and texture [10] in the target category and reject those are not.
for automatic object recognition and segmentation, or resort to
integrating multiple segmentations [53] or regions of similar IV. P ROBLEM F ORMULATION
appearances [4], [54] to achieve robust object recognition and In this section, we present the details of our co-training
segmentation. framework, including the pre-filtering of noisy Web images,
In contrast, our approach works in a weakly supervised the object segmentation model, and the object category recog-
fashion without the need of using manually labeled data, nition model.
while using the labeled data produced from the execution
of our approach. Specifically, the two tasks of segmentation
A. Pre-Filtering the Noisy Web Image Collection
and recognition are cast into a unified co-training framework,
and are synergistically learned. In particular, segmentation can The noisy Web images obtained through an Internet search
provide recognition with accurate labels to train an effec- often contain some illustration images, such as pencil, draw-
tive object category model, while recognition can provide ing, sketch, tattoo, symbol images, etc., and most of them
the segmentation with accurate categorized images to ensure are characterized by a keyword in the accompanying captions
both the segmentation accuracies of the categorized objects or a distinctive intensity distribution. Thus, we leverage both
and the efficacy of the auto-context model embedded in the text-based and visual-based image filtering to remove the
segmentation model. corresponding illustration images.
For the text-based filtering, we process the captions to
remove stop words and stem the remaining words using
III. OVERVIEW OF THE F RAMEWORK Porter stemmer [55]. Then, we directly reject the images
The overall framework of our approach is illustrated in whose accompanying captions contain the following stemmed
Fig. 1 and Table I. Starting from a seed image or a few key- keywords, i.e., draw, sketch, tattoo, graph, plot,
words, we first pre-filter the noisy Web images obtained from symbol, map, chart, paint, abstract, origami and
the Internet through image search engines to build a candidate watercolor. The rejected images by text-based image fil-
Web image collection of the target visual concept, and also tering for category of lotus flower are presented in Fig. 12.
obtain a rejected image set with outlier images which are not For the visual-based filtering, we simply use the intensity
associated with the target concept (detailed in Section IV-A). histogram to reject the drawing and symbolic images, as
Due to the fact that the top ranked first several dozens of they are characterized by a distinctive intensity distribution.
images in the candidate Web image collection are almost In our approach, the histogram is computed on all intensity
always in the same category, we initialize the categorized values from 0 to 255. The image whose number of bins with
image collection with these images. We then jointly extract a value larger than 6% of the maximum bin value of the
the categorized objects from the categorized image collection histogram is less than 60 is rejected as illustration image.
(see Section IV-B), learn an object segmentation model with Fig. 2 gives the comparison of intensity histograms between
4074 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

0 weights the relative importance between the data term


and the spatial prior term.
Due to the unknown auto-context model and appearance
model, an iterative optimization process is needed to achieve
an automated program to extract categorized objects from the
image collection, and learn the appearance and auto-context
Fig. 2. Comparison of intensity histograms between a natural image and a models on the fly. Hence, we first initialize the segmentations
drawing image. The red line corresponds to 6% of the maximum bin value
of the histogram. Note that only the bin values at intensities of 0, 254 and
of the images by using a generic visual saliency model [57],
255 are above the red line for the drawing image. and they together are used to train the initial Boosting classifier
of the auto-context model. Meanwhile, we train an appearance
model and compute the spatial prior term for each image
a natural image and a drawing image. The rejected images by according to its initial segmentation. We then concurrently
visual-based image filtering for category of lotus flower are minimize the energy function in Eq(1) for each image using
presented in Fig. 13. Finally, a candidate Web image collection graph-cut (i.e., min-cut/maxflow) [17], [18].
Na
A = {Ai }i=1 including Na images, and a rejected image set All the new segmentations along with the discriminative
Nr
R = {Ri }i=1 including Nr noisy images are obtained. probability maps estimated by the auto-context model are then
utilized to update the image independent auto-context model
(detailed below). Naturally, the contextual information existed
B. Segmentation of Categorized Objects in all images of the image collection can be employed for the
Our segmentation of categorized objects from an image segmentation of each image. The new segmentation of each
collection is cast into an energy minimization framework. image serves to update the corresponding image dependent
Suppose we have collected a categorized image collection appearance model and spatial prior term. This process iterates
I = {Ik }k=1
K including K categorized images, where Ik is the until convergence, which returns not only all the objects
kth image. Our objective is to simultaneously find an image extracted from all images in the image collection, but also
labeling set L = {Lk }k=1 K for all images in the collection I, an object segmentation model with an embedded auto-context
where Lk = {L kp | p Ik } is the labeling for image Ik , and model automatically learned from all images.
L kp {0, 1} denotes a binary label for each pixel p in image
Ik , 0/1 correspond to background/foreground, respectively. C. Auto-Context Model
Then, our energy function for segmentation of categorized
We leverage an auto-context model originally proposed by
objects becomes
Tu [12] and later extended by Wang et al. [13] for automatic
  object extraction from images. The auto-context model builds
E(L) = C p (L kp ) + D p (L kp )
a multi-layer Boosting classifier on image features and context
pI pIk
 features surrounding a pixel to predict if this pixel is associated
+ pq (L kp , L kq ), k = 1, . . . , K , (1) with the target concept, where subsequent layer is working
( p,q)Nk on the probability maps from the previous layer. Please refer
 
where pI C p (L kp ) and pIk D p (L kp ) compose the data to [12] and [13] to check the details of the auto-context
term, denoting the cost for assigning label L kp to pixel p model. Here, the auto-context model is embedded in the
of image Ik from an auto-context model and an appearance energy minimization formulation, and trained on all images
of the categorized image collection in the automatic process
model, respectively. The subscript p I in pI C p (L kp )
denotes that the auto-context model is learned from all of segmentation of categorized objects.
images of the categorized image collection I, thus it is In the first round of our iterative learning process, the
image independent. This is the major difference compared training set for the auto-context model is built on all images
to the image dependent auto-context model in [13], in which of the image collection I as
the auto-context model is  learned on a single target image. S1 = {(L 0kp , Okp
0
)| p Ik , k = 1, . . . , K },
The subscript p Ik in pIk D p (L kp ) denotes that the
appearance model is updated only based on the single target where L kp Lk denotes the initial binary label for pixel p
0 0

image Ik , thus it is image dependent, which is the same as of image Ik , and L0k = {L 0kp | p Ik } is the initial segmen-
the image dependent one built  in [13] by fusing the color tation map for image Ik obtained by jointly using a visual
and intensity cues. Moreover, ( p,q)Nk pq (L kp , L kq ) is saliency model [57] and an adaptive selection mechanism
0 O0 denotes the structured patches of
the spatial prior term to encourage the labels (i.e., L kp and designed in [13]. Okp k
L kq ) of neighboring pixels (i.e., p and q) to be consistent, the auto-context model centered at pixel p of image Ik , and
where Nk is the set of neighboring pixels in image Ik . pq O0k = {Okp 0 | p I } are the structured patches of all sample
k
is computed based on the edge probability map from Martin pixels for image Ik , which are sampled from the discriminative
et al. [56], and (L kp , L kq ) is a Dirac delta function. The probability map P0k of image Ik .
subscript ( p, q) Nk denotes that the spatial prior term is Here, we directly use the saliency map generated by a
computed from the target single image Ik , thus it is image visual saliency model [57] as the initial probability map
dependent and the same as the one in [13]. The parameter P0k = { pkp 0 | p I } for each image I . The saliency values are
k k
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4075

Fig. 3. From left to right: the categorized image collection, its segmentation
maps and discriminative probability maps. The sampling structure of the auto-
context model is illustrated on the probability maps.

used as the discriminative probabilities, and the probabilities of


the structured patches centered at pixel p denote the contextual
cue for pixel p. The segmentation maps, the discriminative
probability maps and the sampling structure of the auto-
context model are illustrated in Fig. 3.
After the first classifier is learned on the probabili-
ties of structured patches sampled on the probability maps
P 0 = {P0k }k=1
K across the image collection I, the learned
classifier outputs the new discriminative probability map
P1k = { pkp1 | p I } for each image I . The auto-context model
k k
C p (L kp ) in Eq(1) is then updated as
C 1p (L 1kp ) = pkp
1
= p(L 1kp |O0 ), Fig. 4. The iterative learning process of the auto-context model.

C 1p (L 1kp ) = 1, p Ik , k = 1, . . . , K . (2)
L 1kp

where O0 = {O0k }k=1


K denotes the structured patches set across

the image collection I.


From the second round of the iterative learning process, we
construct the training set as
S2 = {(L 1kp , Okp
1
)| p Ik , k = 1, . . . , K },
where Okp 1 O1 denotes the structured patches of the auto-
k
context model centered at pixel p of image Ik , which are
sampled from the new discriminative probability map P1k
produced from the previous round.
Then a new classifier is trained on the probabilities of
structured patches of auto-context model sampled on the dis-
criminative probability maps P 1 = {P1k }k=1
K across the image
collection I. The auto-context model is then updated as
follows
C 2p (L 2kp ) = pkp
2
= p(L 2kp |O1 ),

C 2p (L 2kp ) = 1, p Ik , k = 1, . . . , K . (3) Fig. 5. The training process of the object category model.

L 2kp

2 denotes the probability on the new discriminative rejecting the outlier/noisy images. Suppose we have collected a
where pkp
categorized image collection I = {Ik }k=1
K including K images,
probability map P2k of image Ik . and have extracted the objects from them. A rejected image
This process iterates until convergence, where the discrimi- Ni
set R = {Ri }i=1 including Ni noisy images is also collected.
native probability maps are not changing anymore. Actually, in The training process of the object category model is sum-
our formulation, the auto-context model is seamlessly updated marized in Fig. 5. We first compute the SIFT descriptors [50]
with the iterative energy minimization of Eq(1). We outline the on a regular grid [45] across each categorized image Ik and
iterative process of the auto-context model in Fig. 4. each noisy image Ri . The SIFT descriptors computed on the
image collection I and the rejected image set R are then
D. Categorized Images Recognition clustered into visual words by using k-means, and the visual
We leverage a bag-of-words model [20], [45] as the words to form a visual word vocabulary. We then compute
object category model to recognize categorized images while the histogram of visual words from each image Ik , each
4076 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

foreground of the image given the visual word vocabulary. The


histograms are classified by the learned object category clas-
sifier, and the category decisions of the images are obtained.
We simply reject the noisy images and add them to the
rejected image set R; and accept the categorized images to
the categorized image collection I.
1) Histogram Intersection Kernel SVM: The fast HIKSVM
proposed in [51] is leveraged to train our object category
classifier. It is proven to be an efficient learning algorithm
for such a recognition task. We use the histograms of visual
words and their labels as training data {(xi , yi )}i=1 N , where

xi = (x i (1), . . . , x i (n)), yi {1, +1}. With SVM, we maxi-


mize the following quadratic problem
Fig. 6. The testing process of the object category model.

N
1 
N
W () = i i j yi y j k(xi , x j )
2
object extracted from image Ik , each background extracted i i, j =1

from image Ik , and each noisy image Ri , respectively. Each 


N

histogram can be regarded as a representation for the cor- subject to i yi = 0 (4)


responding image/object/background. The histograms of all i
images in the collection I and all objects extracted from the where k(xi , x j ) is the histogram intersection kernel.
collection I are treated as positive training examples; while the Let histogram xl : l {1, 2, . . . , m} denote the support vec-
histograms of all backgrounds extracted from the collection I tor, and the decision function is sign(h(x)), where
and all noisy images of the rejected image set R are treated
as negative training examples. Finally, an object category 
m
classifier is trained on the histograms for the target category h(x) = l yl k(x, xl ) + b. (5)
using a histogram intersection kernel SVM (HIKSVM) [51]. l=1
Note that we use the histograms of the objects segmented The histogram intersection kernel k(x, xl ) is defined as
from the categorized image collection I and all images in 
k(x, xl ) = ni=1 min(x(i ), xl (i )), and h(x) becomes
the categorized image collection I, instead of only the his-
tograms of all images in the categorized image collection I, 
m 
n 
as positive training examples; and we use the histograms of the h(x) = l yl min(x(i ), xl (i )) + b. (6)
backgrounds segmented from the categorized image collection l=1 i=1
I and all noisy images of the rejected image set R, instead of
Then, a trick can be used for intersection kernel to reduce
only the histograms of all noisy images of the rejected image
the cost of kernel computation. Specifically, the summations
set R, as negative training examples. This way, the positive
in Eq(6) can be exchanged to obtain
and negative training examples are both more representative,
and the differences between the proportions of the positive and n 
 m 
negative training examples to all the training examples can be h(x) = l yl min(x(i ), xl (i )) + b (7)
smaller. The reasons are that 1) the objects extracted from the i=1 l=1
categorized image collection are almost always on the ground n
truth segmentation foregrounds, 2) the backgrounds extracted = h i (x(i )) + b (8)
from the categorized image collection are almost always not on i=1
the ground truth foregrounds, 3) the images of the categorized
Rewriting the function h(x) as the sum of the individual
image collection indeed contain the target objects and thus can
functions, h i , one for each dimension, where
provide obvious distinction from the outliers, although they are
surrounded by the backgrounds, and 4) the vast majority of 
m
the noisy images of the rejected image set are outlier images. h i (x(i )) = l yl min(x(i ), xl (i )) (9)
Therefore, the object category recognition model trained by l=1
 
 
using the object segmentation results can have strong capa- = l yl xl (i ) + l yl x(i ) (10)
bility to differentiate the categorized images from the outlier xl (i)<x(i) xl (i)x(i)
images, and this will be validated in Section V-E. 
The testing process of the object category model is sum- To be noted that the terms xl (i)<x(i) l yl x l (i ) and
marized in Fig. 6. As we pick several new images from y
xl (i)x(i) l l are independent of the input data and depend
the candidate Web image collection A, we first segment only on the support vectors and . Thus, we can sort the
them respectively using the energy minimization based object support vector values in each coordinate, and precompute
segmentation model. We then compute the SIFT descriptors, them. Finally, the runtime complexity of computing h(x) is
and compute the histogram of visual words for each extracted O(n log m) as opposed to O(nm).
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4077

TABLE II TABLE III


F-M EASURE S CORES OF O UR M ETHODS , T U S M ETHOD [12] S EGMENTATION A CCURACIES OF O UR M ETHODS AND 8 O THER
AND R EN et al. S M ETHOD [59] T ESTED ON M ETHODS T ESTED ON W EIZMANN H ORSE DATASET [58]
THE W EIZMANN H ORSE D ATASET [58].

V. E XPERIMENTS AND D ISCUSSIONS


In this section, we present the experimental results and dis-
cussions. We start by the evaluation of our object segmentation
algorithm on three segmentation benchmarks in Section V-A.
Our 30-categories image dataset is introduced in Section V-B.
We proceed to discuss the implementation details of our
approach in Section V-C, the evaluation of pre-filtering the
Fig. 7. Comparison results of our method with [37] on Weizmann horse
noisy Web images in Section V-D, and the evaluation of our dataset [58]. 1st and 4th columns: original images. 2nd and 5th columns:
approach on the 30-categories of noisy Web image collections segmentation results from [37]. 3rd and 6th columns: segmentation results by
in Section V-E. our object segmentation method.

A. Evaluation of Our Object Segmentation Algorithm


Before testing our full approach on the task of joint
extraction and recognition of categorized objects from noisy
Web image collections, we first evaluate our object segmen-
tation algorithm on the Weizmann horse dataset [58], MSRC
dataset [10] and iCoseg dataset [36].
1) Evaluation on Weizmann Horse Dataset: The Weizmann
horse dataset [58] consists of 328 complex horse images
Fig. 8. 1st and 4th columns: images from Weizmann horse dataset [58]. 2nd
along with manually annotated label maps. We implement and 5th columns: segmentation results by using the auto-context model learned
two versions of our method, one segments all the images from the single target image. 3rd and 6th columns: segmentation results by
simultaneously using the auto-context model learned from all using the single auto-context model learned from all images of the dataset.
images of the categorized image collection, and the other
segments the images one by one using the auto-context model
learned from the single target image, as presented in [13]. where the numerator denotes the number of pixels of the
We compute the average F-measure on the entire dataset, intersection of the segmentation result and the ground truth
and compare it with various methods, including Tu [12] and segmentation, and the denominator denotes the total number
Ren et al. [59]. The F-measure is the harmonic mean of recall of pixels of the target image. In Table III, the average
and precision calculated on the foreground pixels, i.e., segmentation accuracy of our object segmentation method
outperforms [4], [7], [13], [27], [37], [39], [40], and compares
2 Recall Pr eci si on
F measur e = . competitively with LOCUS [1]. Note that the accuracy of our
Recall + Pr eci si on method is lower than [60], and this may be explained by the
As shown in Table II, the average F-measure of our object fact that [60] has strong ability of encouraging segmentation
segmentation method is 0.872, which outperforms Tus of images along boundaries of homogeneous color/texture. We
method [12], Ren et.als method [59], and our single object plan to incorporate such constraints in the proposed framework
extraction method [13]. Our method performed better as we in the future. We also give some sample comparison results
combined appearance based segmentation with an auto-context of our method with [37] in Fig. 7. The results show that both
model learned from all images in the target image collection, our method and [37] suffer from cluttered background, but our
which can capture the contextual information across the whole method works better than [37] when the object and background
dataset. have distinctive difference in color.
We also compute the average segmentation accuracy, and We also present some comparison results obtained by two
compare it with GrabCut [27], 4 different object segmentation versions of our method in Fig. 8, which clearly demonstrate
methods [1], [4], [7], [60] and 3 cosegmentation methods [37], that the single auto-context model learned from all images of
[39], [40]. The segmentation accuracy is measured by the the dataset is more effective than the one learned from the
ratio of the number of pixels classified correctly as either single target image.
foreground or background in agreement with the ground truth In addition, some object segmentation results are presented
segmentation to the total number of pixels. It is calculated as, in Fig. 9. It is shown that our method is able to extract the
Nseggt intact horses, even in the case of large changes in shape, color
Accur acy = , and backgrounds.
Nt ot al
4078 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

TABLE IV
S EGMENTATION A CCURACIES OF O UR M ETHOD AND 4 O THER
C OSEGMENTATION M ETHODS ON MSRC DATASET [10]

Fig. 9. Some object segmentation results on Weizmann horse dataset [58].

TABLE V
S EGMENTATION A CCURACIES OF O UR M ETHOD AND 5 O THER
C OSEGMENTATION M ETHODS ON iC OSEG D ATASET [36]

Fig. 10. Comparison results of our method with [8] on MSRC dataset [10].
The 1, 3 and 5 columns are segmentation results from [8]; the 2, 4 and 6
columns are our results.

2) Evaluation on MSRC Dataset: The MSRC dataset [10]


contains 20 categories of single object with about 30 images in
each category. We simultaneously segment the images of each
category using our object segmentation method, and compute
the average segmentation accuracy for each category.
We also compare our method with 4 other cosegmentation
methods [8], [35], [37], [38]. Table IV summarizes the average
segmentation accuracies of our method and 4 other methods.
The results show that our method is superior on 4 categories performs better in 7 out of 14 categories, and is on par with
(car, cow, flower and plane), but does not outperform the other the best for 2 other categories. For 5 categories, other methods
methods on other 4 categories (bird, cat, dog and sheep). The are clearly better than ours. We discuss the reasons as follows.
bird category has large intra-class variability, especially in The balloons are difficult to segment well due to the dra-
color and shape, thus it is difficult to segment the birds. The matic variations in size. The accuracy on the elephant category
low accuracy on cat is because the cats contain strong edges is low because the foreground objects have similar appearance
inside their bodies, and some of them have very similar color distributions with the backgrounds, and the backgrounds are
with the background. The low accuracy on the dog category is cluttered. For the panda category, the images are complex due
because the dogs have large shape deformations and viewpoint to the strong edges inside the objects. The skating category
changes. The accuracy on sheep category is slightly lower is difficult to segment since all the skaters are considered as
because our method sometimes fails to segment the legs and objects, and they have dramatic deformations and appearance
feet of the sheep. changes. The stone 2 category contains images with difficult
In addition, some comparison results of our method with [8] lighting conditions. These results show that our method may
are shown Fig. 10. The results showed that our method can not work well in cases that the objects undergo dramatic
segment the intact sheep out, and can work well on flowers and variations in shape, size and lighting, or have similar color
planes which have complex shape and strong color contrast with the backgrounds. We will further address these issues in
with the backgrounds. our future research.
3) Evaluation on iCoseg Dataset: The iCoseg dataset intro- In summary, as the results on the Weizmann horse dataset,
duced in [36] consists of 38 categories of images with hand MSRC dataset and iCoseg dataset have shown, our object
labeled pixel-wise segmentation ground truth for each image. segmentation method has the ability to segment the objects
For each category, we simultaneously segment all the images with moderate variations in color, pose, viewpoint and
using our object segmentation method, and compute the aver- object shape on most of the datasets, but has encountered some
age segmentation accuracy. We also compare our method with difficulties when dealing with objects which have very similar
5 other cosegmentation methods [8], [35], [37], [38], [40]. We appearance with the backgrounds, or objects which are subject
present the average segmentation accuracies of our method and to dramatic deformations in shape and size, or objects which
5 other methods in Table V. Our object segmentation method contain strong edges inside them.
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4079

B. Our 30-Categories Image Dataset


To evaluate the efficacy of our proposed approach and to
establish a benchmark for future research, we have down-
loaded 30 categories of 15, 634 Web images from Google
image search, which include a large number of outlier/noisy
images.
We first provide a seed image as the query into an image
search engine (e.g., Google image search) to conduct an
image-based search, it returns a large set of similar images and
a best guess of the object category of the seed image based
on the search results. Our goal is to harvest images from the
Internet with maximum diversity, while the returned similar
images have limited diversity. Thus, we treat the best guess
as the category of the image and subsequently conduct a text-
based image search through Google image search. Finally, a
large set of images with high diversity and the accompanying
captions can be obtained. If the inputs are a few keywords
instead of a seed image, we directly conduct a text-based
image search to collet a set of Web images. In support of the
final evaluation of our end-to-end system, we manually assign
each image a category label (e.g., accordion) denoting whether
the image is in the target category, and also manually give the
pixel-wise ground truth foreground labels for each categorized
image.
In total, we collect an image dataset of 30 categories, and
Fig. 11. Examples of ground truth labels for images of all 30 categories.
give the names of the 30 categories in Table VI. The number
of all Web images, the number of positive (categorized) Web
images and the precision for each category are also shown We then pick 30 new images from the rest of the candidate
in Table VI. The precision is the percentage of the number Web image collection A, and use the trained object category
of positive images to the number of all images. Examples classifier to classify them, as detailed in Section IV-D. Noisy
of hand-annotated ground truth labels for images of all images identified by the object category recognition model are
30 categories are shown in Fig. 11. This dataset will be made directly rejected and added to the rejected image set R; the
publicly available to facilitate future research. images identified to be in the target category are accepted and
added to the categorized image collection I. The visual words
C. Implementation Details of Our Approach vocabulary is then updated with the updated categorized image
Provided with a seed image or a few keywords, we collection I and rejected image set R.
first download a set of Web images from the Internet and We proceed to jointly extract the objects from the updated
pre-filter them by jointly using text-based and visual-based image collection I, and simultaneously update the object seg-
image filtering. When finished, we obtain a candidate Web mentation model. After we obtained the new segmentations,
image collection A = {Ai }i=1Na
including Na images, and a the object category classifier is updated. This process iterates
Nr until we finished all images of the candidate Web image
rejected image set R = {Ri }i=1 including Nr noisy images,
collection A, upon which no more images can be added to
as described in Section IV-A.
the categorized image collection. This forms a closed-loop co-
Due to the fact that the top ranked 40 images of the
training framework to automatically learn the object segmen-
candidate Web image collection across different categories
tation model and the object category recognition model.
are almost always in the target category, the categorized
To be noted that, both the auto-context model embedded
image collection I = {Ik }k=1
K is initialized with the top ranked
in the object segmentation and the object recognition
40 images of the candidate Web image collection, i.e., K
model are gradually enhanced with the iteration of our
is initialized to be 40, although other numbers can also be
approach, by adding several new images from the rest of the
used.
candidate Web image collection. According to our empirical
With the initial categorized image collection I, we extract
observations, we pick 30 new images during each iteration
the objects from the categorized image collection, and also
by considering both the efficacy and computational efficiency
train an object segmentation model with an embedded auto-
of our approach.
context model on all images which can help extract the catego-
rized objects, as described in Section IV-B and Section IV-C.
An object category model is then trained on the categorized D. Evaluation of Pre-Filtering the Noisy Web Images
image collection I and its segmentations, and the rejected By exploiting text-based and visual-based image filtering
image set R, as detailed in Section IV-D. described in Section IV-A, a number of noisy images are
4080 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

TABLE VI
S TATISTICS ON THE N OISY W EB I MAGE C OLLECTION , THE C ANDIDATE W EB I MAGE
C OLLECTION AND THE C ATEGORIZED I MAGE C OLLECTION

rejected from the downloaded Web images, and a candidate


Web image collection is built for each category. The number
of all images, the number of positive images, the precision
and the recall of the pre-filtering procedure for each candidate
Web image collection of the 30 categories and for the whole
dataset are shown in Table VI. The recall is the percentage
of the number of positive images in the candidate Web image
collection to the number of positive images in the original
noisy Web image collection. The precision of the candidate Fig. 12. Examples of rejected images from the noisy Web image collection
by text-based image filtering for category of lotus flower.
Web image collection is significantly higher than the precision
of the original noisy Web image collection, and the recall
for each category is almost around 90%. Fig. 12 and Fig. 13
give some examples of the rejected images by text-based and
visual-based image filtering for the category of lotus flower,
respectively. They show that our text-based and visual-based
image filtering provides an effective method for an initial
cleansing of the noisy Web image collection, while has the
ability to retain most of the positive images.
Fig. 13. Examples of rejected images from the noisy Web image collection
E. Evaluation of Our Approach by visual-based image filtering for category of lotus flower.

The number of all images, the number of positive


images, the precision and recall of our object recognition recall of our object category recognition algorithm is almost
algorithm for each categorized image collection of the above 90%.
30 categories finally collected by our approach and for the Fig. 14 gives the numbers of categories for which the
whole dataset are presented in Table VI. It shows that the pre- precision are greater than 60%, 70%, 80% and 90% for the
cision of the collected categorized image collection is greatly original noisy Web image collection, the candidate Web image
improved after being processed by our approach, and the collection and the categorized image collection, respectively.
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4081

TABLE VII
R ECOGNITION P RECISION AND P RECISION OF THE C ATEGORIZED I MAGE
C OLLECTION FOR E ACH OF THE 30 C ATEGORIES BY U SING THE O BJECT
C ATEGORY M ODEL T RAINED T HROUGH T WO V ERSIONS . V-1 D ENOTES
T HAT THE O BJECT C ATEGORY M ODEL I S T RAINED J UST ON THE
C ATEGORIZED I MAGES AND THE N OISY I MAGES ; V-2 D ENOTES T HAT
THE O BJECT C ATEGORY M ODEL I S T RAINED ON THE C ATEGORIZED
I MAGES AND T HEIR S EGMENTATIONS , AND THE N OISY I MAGES
Fig. 14. The numbers of categories for which the precision and recall are
greater than 60%, 70%, 80% and 90%, respectively.

It also gives the numbers of categories for which the recall


are greater than 60%, 70%, 80% and 90% for the candidate
Web image collection and the categorized image collection,
respectively. As the experimental results shown, the numbers
of categories for which the precision are greater than 90% are
5, 12 and 25, respectively; the numbers for which the recall are
greater than 90% are 25 and 29, respectively. This manifests
that our approach is capable of cleansing the noisy Web
images while retaining most of the positive images to collect
a categorized image collection with foreground segmentation
masks.
It is worth noting that, the number of images of our collected
categorized image collection for each category is around 100
to 500, this is due to the fact that the number of Web images
we downloaded for each category is around 300 to 600.
Actually, we can collect more categorized images, as long as
we have downloaded more Web images.
1) Recognition Performance: To evaluate the recognition
performance of our method, we compute the average recogni-
tion precisions on all 30 categories in Table VII. The object
category model is trained with two variants of our proposed
method. In the first version, the object category model is
trained just on the categorized images and the noisy images.
In the second version, the object category model is trained
based on the categorized images and their segmentations is exactly the one we described above (named as our standard
(i.e., foregrounds and backgrounds), and the noisy images, method). The differences are that, 1) in the incremental
which is exactly the one described in Section IV-D. The version, we pick one new image identified by the object
major difference is whether the segmentations are employed category model each time, and just segment the one image
to train the object category model. As shown in Table VII, based on an object segmentation model incrementally updated
the recognition precision is further improved by using the with the target image; 2) in the standard version, we pick
segmentation results. This clearly manifests that having a 30 images to the object category model each time, and
good foreground segmentation can have a considerable and simultaneously segment all images of the categorized image
beneficial impact on recognition. collection (consistently augmented with the accepted images)
Table VII also presents the precision of the categorized based on an object segmentation model learned from the image
image collection for each of the 30 categories by using the collection. The average F-measure scores of our standard
object category model trained through the aforementioned two method significantly outperform our incremental method. This
versions. It strongly demonstrates that our object category shows the advantage of our object segmentation method and
model is able to identify the images in the target category the object segmentation model learned from the whole image
while excluding the outlier/noisy images, which benefited from collection.
the object segmentation results. Moreover, in our approach, as the categorized image col-
2) Segmentation Performance: To quantitatively evaluate lection was augmented, we update the object segmentation
the segmentation performance of our method, we compute model with the embedded auto-context model from the whole
the average F-measure scores on all 30 categories, as shown augmented image collection and proceed to simultaneously
in Table VIII. We implement two versions of our method, segment the categorized objects from them, including the
the first version is an incremental version of our method images we have segmented before. This is because more
(named as our incremental method), and the second version discriminative contextual information can be exploited from
4082 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

TABLE VIII
AVERAGE F-M EASURES FOR 30 C ATEGORIES BY U SING
T WO V ERSIONS OF O UR S EGMENTATION M ETHOD

Fig. 15. Average F-measures of the first 100 images for 5 categories varying
with the numbers of images of the categorized image collection.

Fig. 16. The 1st and 3rd rows: the segmentation results obtained by our
object segmentation method, varying with the augmentation of the categorized
image collection, where the above numbers are the numbers of images of
the corresponding categorized image collection. The 2nd and 4th rows: the
the augmented categorized image collection, which can further probability maps output by the last iteration of the auto-context model, varying
improve the segmentation on those images that we have with the augmentation of the categorized image collection. The example
images are from our 30-categories image dataset.
processed before. We present the average F-measures of the
first 100 images for 5 categories of images varying with
objects from general Web images with moderate variations
the numbers of images in the categorized image collection
in color, size, pose, viewpoint, and shape on most of the
segmented at each time in Fig. 15, and the segmentation results
categories, but encounters difficulties when the objects have
obtained by our object segmentation method and the probabil-
very similar color with the backgrounds (e.g., the cruise and
ity maps output by the last iteration of the auto-context model
tree frog), or exhibit dramatic variations in shape (e.g., the
of some example images varying with the augmentation of
eagle, elephant, and starfish) and size (e.g., the hummingbird),
the categorized image collection in Fig. 16. As demonstrated
or have very complex shape (e.g., the helicopter), or when
by the average F-measures and the segmentation results, the
the background is very cluttered (e.g., the clownfish and
object segmentation model indeed can be strengthened while
gecko).
learning the auto-context model from the augmented image
3) Convergence Analysis: It may not be feasible to derive
collection, and thus results in better segmentations; and as
a strict theoretic guarantee of the convergence of our object
shown in the probability maps, the auto-context model indeed
segmentation method which is cast into an energy mini-
can be enhanced while exploiting the contextual information
mization framework, but empirically it always converges. In
from the augmented image collection, and thus results in better
our experiments, if the energy values of consecutive three
estimations of the probability maps.
iterations satisfy both
Fig. 17 gives some examples of segmented categorized
images of the 30 categories. The 1st and 2nd examples of E T 2 (L) E T 1 (L)
< 0.01
each category are the segmentation results on the first 2 images E T 2 (L)
of the categorized image collection; the 3rd example of each and
category is the sample segmentation result on image with small
E T 1 (L) E T (L)
foreground object; and the 4th example of each category is the < 0.01,
sample failure segmentation result of the categorized image E T 1 (L)
collection. As the results in Table VIII and Fig. 17 shown, the iteration will terminate on the T th iteration. We present
our object segmentation method is capable of segmenting the the trend of the energy function on the first 40 images on
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4083

Fig. 17. Some examples of the segmentation results on the 30 categories.

6 categories of our 30-categories image dataset in Fig. 18. 4) Initialization Impact: To illustrate the impact of the
According to the experimental results, each step of the initialization on the performance of the auto-context model and
energy minimization ensures that the energy in Eq(1) is non- also our object segmentation method, we present the interme-
increasing, and the energy function always converges within diate probability maps output by the auto-context model and
10 iterations. the intermediate segmentation results obtained by our object
4084 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

mainly from the segmented categorized images. Empiri-


cal results on four datasets demonstrated the advantage of our
proposed method.

R EFERENCES
[1] J. Winn and N. Jojic, LOCUS: Learning object classes with unsuper-
vised segmentation, in Proc. 10th IEEE ICCV, Oct. 2005, pp. 756763.
[2] H. Arora, N. Loeff, D. A. Forsyth, and N. Ahuja, Unsupervised
segmentation of objects using efficient learning, in Proc. IEEE CVPR,
Jun. 2007, pp. 17.
[3] E. Borenstein and S. Ullman, Learning to segment, in Proc. 8th ECCV,
May 2004, pp. 315328.
[4] L. Cao and L. Fei-Fei, Spatially coherent latent topic model for
concurrent segmentation and classification of objects and scenes, in
Proc. IEEE 11th ICCV, Oct. 2007, pp. 18.
Fig. 18. Energy values in the iterative process of energy minimization on [5] J. Cui et al., Transductive object cutout, in Proc. IEEE CVPR,
the first 40 images for 6 categories of our 30-categories image dataset. Jun. 2008, pp. 18.
[6] Y. J. Lee and K. Grauman, Collect-cut: Segmentation with top-
down cues discovered in multi-object images, in Proc. IEEE CVPR,
Jun. 2010, pp. 31853192.
[7] B. Alexe, T. Deselaers, and V. Ferrari, Classcut for unsupervised class
segmentation, in Proc. 11th ECCV, 2010, pp. 380393.
[8] L. Mukherjee, V. Singh, J. Xu, and M. D. Collins, Analyzing the
subspace structure of related images: Concurrent segmentation of image
sets, in Proc. 12th ECCV, 2012, pp. 128142.
[9] Y. N. Law, H. K. Lee, M. K. Ng, and A. M. Yip, A semisupervised
segmentation model for collections of images, IEEE Trans. Image
Process., vol. 21, no. 6, pp. 29552968, Jun. 2012.
[10] J. Shotton, J. Winn, C. Rother, and A. Criminisi, Textonboost: Joint
appearance, shape and context modeling for multi-class object recogni-
tion and segmentation, in Proc. 9th ECCV, 2006, pp. 115.
[11] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and
S. Belongie, Objects in context, in Proc. IEEE 11th ICCV, Oct. 2007,
pp. 18.
[12] Z. Tu, Auto-context and its application to high-level vision tasks, in
Proc. IEEE CVPR, Jun. 2008, pp. 18.
[13] L. Wang, J. Xue, N. Zheng, and G. Hua, Automatic salient object
extraction with contextual cue, in Proc. IEEE ICCV, Nov. 2011,
pp. 105112.
[14] J. Xue, L. Wang, N. Zheng, and G. Hua, Automatic salient object
extraction with contextual cue and its applications to recognition and
Fig. 19. The 1st, 3rd and 5th rows: the intermediate segmentation results alpha matting, Pattern Recognit., vol. 46, no. 11, pp. 28742889, 2013.
obtained by our object segmentation method at each iteration. The 2nd, 4th
[15] L. Wang, G. Hua, R. Sukthankar, J. Xue, and N. Zheng, Video object
and 6th rows: the intermediate probability maps output by the auto-context
discovery and co-segmentation with extremely weak supervision, in
model at each iteration. The example images are from our 30-categories image
Proc. ECCV, 2014.
dataset.
[16] L.-J. Li and L. Fei-Fei, OPTIMOL: Automatic online picture collection
via incremental model learning, Int. J. Comput. Vis., vol. 88, no. 2,
pp. 147168, 2010.
segmentation method of each iteration for some example [17] Y. Y. Boykov and M.-P. Jolly, Interactive graph cuts for optimal
boundary & region segmentation of objects in N-D images, in Proc.
images from our 30-categories image dataset in Fig. 19. As 8th IEEE ICCV, 2001, pp. 105112.
the example results shown, the auto-context model and also [18] Y. Boykov and G. Funka-Lea, Graph cuts and efficient N-D image
our fully automatic object segmentation method are robust to segmentation, Int. J. Comput. Vis., vol. 70, no. 2, pp. 109131, 2006.
[19] L. Wang, J. Xue, N. Zheng, and G. Hua, Concurrent segmentation
the initial salient region, as long as it is not totally off the of categorized objects from an image collection, in Proc. 21st ICPR,
target. This conclusion can also be obtained by observing all Nov. 2012, pp. 33093312.
the results in our experiments. [20] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman,
To summarize, as shown above, our approach has the capa- Discovering object categories in image collections, in Proc. ICCV,
2005.
bility of automatically collecting a large number of categorized [21] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, LabelMe:
images from noisy Web images, and clearly segmenting the A database and web-based tool for image annotation, Int. J. Comput.
categorized objects from them. Vis., vol. 77, nos. 13, pp. 157173, 2008.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ImageNet:
A large-scale hierarchical image database, in Proc. IEEE CVPR,
Jun. 2009, pp. 248255.
VI. C ONCLUSIONS [23] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, Learning object
categories from Googles image search, in Proc. 10th IEEE ICCV,
We propose a method for automatically extracting and Oct. 2005, pp. 18161823.
recognizing categorized objects from noisy Web image collec- [24] F. Schroff, A. Criminisi, and A. Zisserman, Harvesting image databases
tions. This is achieved by co-training of an object segmentation from the web, in Proc. IEEE 11th ICCV, Oct. 2007, pp. 18.
[25] N. Sawant, J. Z. Wang, and J. Li, Enhancing training collections for
model with an embedded auto-context model learned from image annotation: An instance-weighted mixture modeling approach,
all categorized images, and an object category model learned IEEE Trans. Image Process., vol. 22, no. 9, pp. 35623577, Sep. 2013.
WANG et al.: JOINT SEGMENTATION AND RECOGNITION OF CATEGORIZED OBJECTS 4085

[26] X. Chen, J. K. Udupa, U. Bagci, Y. Zhuge, and J. Yao, Medical image [54] S. Yu, R. Gross, and J. Shi, Concurrent object recognition and seg-
segmentation by combining graph cuts and oriented active appearance mentation by graph partitioning, in Advances in Neural Information
models, IEEE Trans. Image Process., vol. 21, no. 4, pp. 20352046, Processing Systems. Cambridge, MA, USA: MIT Press, 2002.
Apr. 2012. [55] M. F. Porter, An algorithm for suffix stripping, Program, vol. 14, no. 3,
[27] C. Rother, V. Kolmogorov, and A. Blake, GrabCut: Interactive pp. 130137, 1980.
foreground extraction using iterated graph cuts, ACM Trans. Graph., [56] D. Martin, C. Fowlkes, and J. Malik, Learning to detect natural image
vol. 23, no. 3, pp. 309314, 2004. boundaries using local brightness, color, and texture cues, IEEE Trans.
[28] G. Hua, Z. Liu, Z. Zhang, and Y. Wu, Iterative local-global energy Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 530549, May 2004.
minimization for automatic extraction of objects of interest, IEEE [57] J. Harel, C. Koch, and P. Perona, Graph-based visual saliency,
Trans. Pattern Anal. Mach. Intell., vol. 28, no. 10, pp. 17011706, Advances in Neural Information Processing Systems. Cambridge, MA,
Oct. 2006. USA: MIT Press, 2007.
[29] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp, Image segmentation [58] E. Borenstein, E. Sharon, and S. Ullman, Combining top-down and
with a bounding box prior, in Proc. IEEE 12th ICCV, Oct. 2009, bottom-up segmentation, in Proc. CVPR, Jun. 2004, p. 46.
pp. 277284. [59] X. Ren, C. Fowlkes, and J. Malik, Cue integration for figure/ground
[30] W. Tao, Iterative narrowband-based graph cuts optimization for geo- labeling, in Advances in Neural Information Processing Systems.
desic active contours with region forces (GACWRF), IEEE Trans. Cambridge, MA, USA: MIT Press, 2005.
Image Process., vol. 21, no. 1, pp. 284296, Jan. 2012. [60] G. Liu, Z. Lin, X. Tang, and Y. Yu, A hybrid graph model for
[31] O. Veksler, Star shape prior for graph-cut image segmentation, in Proc. unsupervised object segmentation, in Proc. IEEE 11th ICCV, Oct. 2007,
10th ECCV, Oct. 2008, pp. 454467. pp. 18.
[32] C. Jung and C. Kim, A unified spectral-domain approach for saliency
detection and its application to automatic object segmentation, IEEE
Trans. Image Process., vol. 21, no. 3, pp. 12721283, Mar. 2012.
[33] C. Rother, T. Minka, A. Blake, and V. Kolmogorov, Cosegmentation of
Le Wang received the B.S. degree in automatic
image pairs by histogram matchingIncorporating a global constraint control engineering from Xian Jiaotong University,
into MRFs, in Proc. IEEE CVPR, Jun. 2006, pp. 9931000.
Xian, China, in 2008, where he is currently pursuing
[34] D. S. Hochbaum and V. Singh, An efficient algorithm for co-
the Ph.D. degree with the Institute of Artificial Intel-
segmentation, in Proc. IEEE 12th ICCV, Oct. 2009, pp. 269276.
ligence and Robotics. From 2013 to 2014, he was a
[35] S. Vicente, V. Kolmogorov, and C. Rother, Cosegmentation revisited: Visiting Ph.D. Student with the Stevens Institute of
Models and optimization, in Proc. 11th ECCV, 2010, pp. 465479.
Technology, Hoboken, NJ, USA.
[36] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, iCoseg: Interactive
His research interests include computer vision,
co-segmentation with intelligent scribble guidance, in Proc. IEEE
machine learning, and their application in object
CVPR, Jun. 2010, pp. 31693176. discovery and segmentation from images and videos.
[37] A. Joulin, F. Bach, and J. Ponce, Discriminative clustering for image
co-segmentation, in Proc. IEEE CVPR, Jun. 2010, pp. 19431950.
[38] S. Vicente, C. Rother, and V. Kolmogorov, Object cosegmentation, in
Proc. IEEE CVPR, Jun. 2011, pp. 22172224.
[39] Y. Chai, V. S. Lempitsky, and A. Zisserman, BiCoS: A bi-level co-
segmentation method for image classification, in Proc. IEEE ICCV,
Gang Hua (M03SM11) received the B.S. degree
Nov. 2011, pp. 25792586.
in automatic control engineering and the M.S. degree
[40] J. C. Rubio, J. Serrat, A. Lpez, and N. Paragios, Unsupervised co-
in pattern recognition and intelligence system from
segmentation through region matching, in Proc. IEEE CVPR, Jun. 2012,
Xian Jiaotong University (XJTU), Xian, China, in
pp. 749756.
1999 and 2002, respectively, and the Ph.D. degree
[41] A. Joulin, F. Bach, and J. Ponce, Multi-class cosegmentation, in Proc.
in electrical and computer engineering from North-
IEEE CVPR, Jun. 2012, pp. 542549.
western University, Evanston, IL, USA, in 2006.
[42] G. Kim and E. P. Xing, On multiple foreground cosegmentation, in
He is an Associate Professor of Computer Science
Proc. IEEE CVPR, Jun. 2012, pp. 837844.
with the Stevens Institute of Technology, Hoboken,
[43] R. Fergus, P. Perona, and A. Zisserman, Weakly supervised scale-
NJ, USA. He also holds an Academic Visiting
invariant learning of models for visual recognition, Int. J. Comput. Vis.,
Researcher position with the IBM T. J. Watson
vol. 71, no. 3, pp. 273303, 2007.
Research Center, Ossining, NY, USA. Prior to that, he was a Research Staff
[44] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
Member with the IBM Research T. J. Watson Center from 2010 to 2011, a
Object detection with discriminatively trained part-based models,
Senior Researcher with the Nokia Research Center Hollywood, Santa Monica,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 16271645,
CA, USA, from 2009 to 2010, and a Scientist with Microsoft Live Labs
Sep. 2010.
Research, Bellevue, WA, USA, from 2006 to 2009. He was enrolled in the
[45] L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning
Special Class for the Gifted Young of XJTU in 1994. He holds nine U.S.
natural scene categories, in Proc. IEEE CVPR, Jun. 2005, pp. 524531.
patents and 13 more U.S. patents pending.
[46] K. Kesorn and S. Poslad, An enhanced bag-of-visual word vector space
Dr. Hua is a Life Member of the Association for Computing Machinery. He
model to represent visual content in athletics images, IEEE Trans.
was a recipient of the Richter Fellowship and the Walter P. Murphy Fellowship
Multimedia, vol. 14, no. 1, pp. 211222, Feb. 2012.
from Northwestern University in 2005 and 2002, respectively.
[47] S. Lazebnik, C. Schmid, and J. Ponce, Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories, in Proc.
IEEE CVPR, 2006, pp. 21692178.
[48] J. Zhang, M. Marszaek, S. Lazebnik, and C. Schmid, Local features
and kernels for classification of texture and object categories: A com- Jianru Xue (M06) received the masters and Ph.D.
prehensive study, Int. J. Comput. Vis., vol. 73, no. 2, pp. 213238, degrees from Xian Jiaotong University (XJTU),
2007. Xian, China, in 1999 and 2003, respectively. He
[49] L. Fei-Fei, R. Fergus, and A. Torralba, Recognizing and learning object was with FujiXerox, Tokyo, Japan, from 2002 to
categories, in Proc. ICCV, 2009. 2003, and visited the University of California at
[50] D. Lowe, Distinctive image features from scale-invariant keypoints, Los Angeles, Los Angeles, CA, USA, from 2008
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91110, 2004. to 2009.
[51] S. Maji, A. Berg, and J. Malik, Classification using intersection kernel He is currently a Professor with the Institute of
support vector machines is efficient, in Proc. IEEE CVPR, Jun. 2008, Artificial Intelligence and Robotics at XJTU. His
pp. 18. research field includes computer vision, visual nav-
[52] J. Wu, Efficient HIK SVM learning for image classification, IEEE igation, and video coding based on analysis.
Trans. Image Process., vol. 21, no. 10, pp. 44424453, Oct. 2012. Prof. Xue served as a Co-Organization Chair of the 2009 Asian Conference
[53] C. Pantofaru, C. Schmid, and M. Hebert, Object recognition by on Computer Vision and 2006 Virtual System and Multimedia conference. He
integrating multiple image segmentations, in Proc. 10th ECCV, 2008, also served as a PC Member of the 2012 Pattern Recognition conference, and
pp. 481494. the 2010 and 2012 Asian conference on Computer Vision.
4086 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

Zhanning Gao received the B.S. degree in auto- Nanning Zheng (SM94F06) received the degree
matic control engineering from Xian Jiaotong Uni- from the Department of Electrical Engineering,
versity, Xian, China, in 2012, where he is currently Xian Jiaotong University (XJTU), Xian, China, in
pursuing the Ph.D. degree with the Institute of Artifi- 1975, the M.E. degree in information and control
cial Intelligence and Robotics. His research interests engineering from XJTU in 1981, and the Ph.D.
include image/text processing and image collection. degree in electrical engineering from Keio Univer-
sity, Tokyo, Japan, in 1985.
He is currently a Professor and the Director of
the Institute of Artificial Intelligence and Robotics
at XJTU. His research interests include computer
vision, pattern recognition, computational intelli-
gence, image processing, and hardware implementation of intelligent systems.
Dr. Zheng has been the Chinese Representative on the Governing Board of
the International Association for Pattern Recognition since 2000. He currently
serves as an Executive Editor of the Chinese Science Bulletin. He became a
member of the Chinese Academy Engineering in 1999.

You might also like