You are on page 1of 68

Deformable part models

Ross Girshick
UC Berkeley

CS231B Stanford University


Guest Lecture April 16, 2013
Image understanding

Snack time in the lab

photo by “thomas pix” http://www.flickr.com/photos/thomaspix/2591427106


What objects are where?

I see
twinkies!

.
.
.

robot: “I see a table with twinkies,


pretzels, fruit, and some mysterious
chocolate things...”
DPM lecture overview

Part 1: modeling

Part 2: learning
inatively Trained, Multiscale, Deformable Part Model

David McAllester Deva Ramanan


o Toyota Technological Institute at Chicago UC Irvine
u mcallester@tti-c.org dramanan@ics.uci.edu

stract

iscriminatively trained, multi-


l for object detection. Our sys-
rovement in average precision
n the 2006 PASCAL person de-
performs the best results in the
twenty categories. The system
parts. While deformable part
(d) (e)
opular, their value had not been (f) Figure 1. Example (g) detection obtained with the person model. The
model is defined by a coarse template, several higher resolution
ours

We
AP 12%
(especially the head,
nchmarks such as the PASCAL shoulders and feet). The
) The average gradient image over the training examples.
drelies heavily
on the pixel.
ptor combine
on new
(c) Likewise
weighted by a
for methods
most active blocks
(b) Each and
part templates
are
“pixel”
the negative SVM weights. (d) A test image.
margin-sensitive
respectively the positive and the negative SVM weights.
27%
a spatial model for the location of each part. 36% 45% 49%
neral
VM. A latent
nvex training
2005
ard negative examples with a
[10] T. Joachims. Making large-scale svm
SVM, like
B. Schlkopf,
Kernelproblem.
a hid-
C. Burges,
object
and A. Smola,
tained
categories.
learning
editors,
with Advances
How-Vector Learning. The MIT Press,
Methods - Support
in
our person
2008
practical. InFigure 1 shows an example detection ob-
model.
2009 2010 2011
uro- The notion that objects can be modeled by parts in a de-
convex
We andCambridge, MA, USA,
the training 1999.
prob-
formable configuration
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep- provides an elegant framework for
ent information
VM- is specified for
M’s.
resentation for local image descriptors.representing object
CVPR, Washington, categories [1–3, 6, 10, 12, 13, 15, 16, 22].
elieve thatDC, ourUSA,training
pages 66–75,meth-2004.
While these models are appealing from a conceptual point
ssible the [12] effective
D. G. Lowe. use of more
Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110,
of view, it has been difficult to establish their value in prac-
ierarchical
The (grammar) models 2004.
[13] R. K. McConnell.
tice. On difficult datasets, deformable models are often out-
hree dimensional pose. Method of and apparatus for pattern recog-
. de nition, January 1986. U.S. Patent performed
No. 4,567,610. by “conceptually weaker” models such as rigid
edes- templatesevaluation
[14] K. Mikolajczyk and C. Schmid. A performance [5] or bag-of-features
of [23]. One of our main goals
ation. local descriptors. PAMI, 2004. Accepted.is to address this performance gap.
nline: [15] K. Mikolajczyk and C. Schmid. Scale and affine invariant
of detecting and point
interest localizing
detectors.ob-
Our models include both a coarse global template cov-
IJCV, 60(1):63–86, 2004.
ng of
uch as [16]people or cars, in
K. Mikolajczyk,
ering
static and A. Zisserman.
C. Schmid,
an Human
entiredetec-
object and higher resolution part templates.
Car- tion based on a probabilistic The
assembly of templates
robust part represent
detectors. histogram of gradient features [5].
a new multiscale deformable
The 8th ECCV, Prague, Czech Republic, As in volume
[14,I,19,pages 69– we train models discriminatively. How-
21],
oblem.
s for The models
81, 2004. are trained
dure that
Face- [17]only requires
A. Mohan, bound- and T.ever,
C. Papageorgiou,
our system is semi-supervised, trained with a max-
Poggio. Example-based
rich,
e examples. object detection
Using these in images margin
mod-by components. PAMI, framework,
23(4):349– and does not rely on feature detection.
361, April 2001.
ion system that is both highly We also describe a simple and effective strategy for learn-
Com- ing
[18] C. Papageorgiou and T. Poggio. A trainable parts from
system weakly-labeled
for object data. In contrast to computa-
ssing
nfer- an image in about
detection. 2 sec- 2000.
IJCV, 38(1):15–33,
gton,
ion rates tionally demanding approaches such as [4], we can learn a
[19] that are significantly
R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-
model in 3
tures of people. The 7th ECCV, Copenhagen, Denmark, vol-
hours on a single CPU.
nt: A ume IV, pages 700–714, 2002. Another contribution of our work is a new methodology
Formalizing the object detection task

Many possible ways


Formalizing the object detection task

Many possible ways, this one is popular:


person

cat, motorbike
dog,
chair,
cow,
person,
motorbike,
car,

Input
...
Desired output
Formalizing the object detection task

Many possible ways, this one is popular:


person

cat, motorbike
dog,
chair,
cow,
person,
motorbike,
car,

Input
...
Desired output

Performance summary:

Average Precision (AP)


0 is worst 1 is perfect
Benchmark datasets

PASCAL VOC 2005 – 2012


- 54k objects in 22k images
- 20 object classes
- annual competition
Benchmark datasets

PASCAL VOC 2005 – 2012


- 54k objects in 22k images
- 20 object classes
- annual competition
Reduction to binary classification

pos = { ... ... }

Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusions
and a wide range of variations in pose, appearance, clothing, illumination and background.

probabilities to be distinguished more easily. We will often strength and edge-presence based voting were tested, with
use miss rate at 10−4 FPPW as a reference point for results. the edge threshold chosen automatically to maximize detec-
neg = { ... background patches
This is arbitrary but no more so than, e.g. Area Under ROC.
In a multiscale detector it corresponds to a raw error rate of
tion performance (the values selected were somewhat vari-
able, in the region of 20–50 graylevels).
... }
about 0.8 false positives per 640×480 image tested. (The full Results. Fig. 3 shows the performance of the various detec-
detector has an even lower false positive rate owing to non- tors on the MIT and INRIA data sets. The HOG-based de-
HOG
maximum suppression). Our DET curves are usually quite
Descriptor Cues
shallow so even very small improvements in miss rate are
equivalent to large gains in FPPW at constant miss rate. For
tectors greatly outperform the wavelet, PCA-SIFT and Shape
Context ones, giving near-perfect separation on the MIT test
set and at least an order of magnitude reduction in FPPW
example, for our default detector at 1e-4 FPPW, every 1% on the INRIA one. Our Haar-like wavelets outperform MIT
absolute (9% relative) reduction in miss rate is equivalent to wavelets because we also use 2nd order derivatives and con-
reducing the FPPW at constant miss rate by a factor of 1.57. trast normalize the output vector. Fig. 3(a) also shows MIT’s
best parts based and monolithic detectors (the points are in-
5 Overview of Results
terpolated from [17]), however beware that an exact compar-
Before presenting our detailed implementation and per- ison is not possible as we do not know how the database in
formance analysis, we compare the overall performance of [17] was divided into training and test parts and the nega-
SVM
our final HOG detectors with that of some other existing
methods. Detectors based on rectangular (R-HOG) or cir-
“Sliding window” detector
tive images used are not available. The performances of the
final rectangular (R-HOG) and circular (C-HOG) detectors
cular log-polar (C-HOG) blocks and linear or kernel SVM are very similar, with C-HOG having the slight edge. Aug-
are compared with our implementations of the Haar wavelet,
PCA-SIFT, and shape context approaches. Briefly, these ap-
The most important cues
menting R-HOG with primitive bar detectors (oriented 2nd
derivatives – ‘R2-HOG’) doubles the feature dimension but
proaches are as follows: are head, shoulder, leg
further improves the performance (by 2% at 10−4 FPPW).
Generalized Haar Wavelets. This is an extended set of ori- Replacing the linear SVM with a Gaussian kernel one im-
ented Haar-like wavelets similar to (but better than) that used silhouettes
proves performance by about 3% at 10−4 FPPW, at the cost
weighted
in [17]. The features are rectified responses from 9×9 and weighted
of much higher run times1 . Using binary edge voting (EC-
input image
12×12 oriented 1st and 2nd derivative box filters at 45◦ inter- HOG) instead of gradient magnitude weighted voting (C-
Vertical gradients inside
vals and the corresponding 2nd derivative xy filter. pos wts neg wts
HOG) decreases performance by 5% at 10−4 FPPW, while
omitting orientation information decreases it by much more,
the person count as
PCA-SIFT. These descriptors are based on projecting gradi- even if additional spatial or radial bins are added (by 33% at
ent images onto a basis learned from training images using
PCA [11]. Ke & Sukthankar found that they outperformed
negative
10−4 FPPW, for both edges (E-ShapeC) and gradients (G-
ShapeC)). PCA-SIFT also performs poorly. One reason is
SIFT for key point based matching, but this is controversial that, in comparison to [11], many more (80 of 512) principal
Dalal & Triggs (CVPR’05)
[14]. Our implementation uses 16×16 blocks with the same
derivative scale, overlap, etc., settings as our HOG descrip-
Overlapping blocks those
vectors have to be retained to capture the same proportion of
the variance. This may be because the spatial registration is
Sliding window detection

p
b+Q`2(A, T) = w · (A, T)

(a) (b) (c) (d) (e) (f) (g)


Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more general [10] T. Joachims. Making large-scale svm learning practical. In
situations. B. Schlkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods - Support Vector Learning. The MIT Press,
Acknowledgments. This work was supported by the Euro-
Cambridge, MA, USA, 1999.
Image pyramid HOG feature pyramid
pean Union research projects ACE M EDIA and PASCAL. We
thanks Cordelia Schmid for many useful comments. SVM- [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. CVPR, Washington,
Light [10] provided reliable training of large-scale SVM’s.
DC, USA, pages 66–75, 2004.
References [12] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110, 2004.
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The

• Compute HOG of the whole image at multiple resolutions


8th ICCV, Vancouver, Canada, pages 454–461, 2001.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedes-
[13] R. K. McConnell. Method of and apparatus for pattern recog-
nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of
trian detection: a test case for svm based categorization. local descriptors. PAMI, 2004. Accepted.
Workshop on Cognitive Vision, 2002. Available online:
• Score every subwindow of the feature pyramid
http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching of
[15] K. Mikolajczyk and C. Schmid. Scale and affine invariant
interest point detectors. IJCV, 60(1):63–86, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-
pictorial structures. CVPR, Hilton Head Island, South Car- tion based on a probabilistic assembly of robust part detectors.
olina, USA, pages 66–75, 2000. The 8th ECCV, Prague, Czech Republic, volume I, pages 69–
• Apply non-maxima suppression
[4] W. T. Freeman and M. Roth. Orientation histograms for
hand gesture recognition. Intl. Workshop on Automatic Face-
81, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based
and Gesture- Recognition, IEEE Computer Society, Zurich, object detection in images by components. PAMI, 23(4):349–
Switzerland, pages 296–301, June 1995. 361, April 2001.
Detection

p number of locations p ~ 250,000 per image


Detection

p number of locations p ~ 250,000 per image

test set has ~ 5000 images

>> 1.3x109 windows to classify


Detection

p number of locations p ~ 250,000 per image

test set has ~ 5000 images

>> 1.3x109 windows to classify

typically only ~ 1,000 true positive locations


Detection

p number of locations p ~ 250,000 per image

test set has ~ 5000 images

>> 1.3x109 windows to classify

typically only ~ 1,000 true positive locations

Extremely unbalanced binary classification


3.5 Overview of Results
Dalal & Triggs detector on INRIA

Recall−Precision −− different descriptors on INRIA static person database Recall−Precision −− desc


1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

Precision
Precision

0.5 0.5

0.4 0.4

0.3 Ker. R−HOG 0.3


Lin. R−HOG
0.2 Lin. R2−Hog 0.2
Wavelet
0.1 PCA−SIFT 0.1
Lin. E−ShapeC
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2
Recall
• AP = 75% (a)

(79% in my implementation)
Fig. 3.6. The performance of selected detectors on the INRIA st
• Very good
(right) person data sets. For both of the data sets, the plots show
• Declareobtained by using
victory and HOG features rather than other state-of-the-a
go home?
Dalal & Triggs on PASCAL VOC 2007

Descriptor Cues

AP = 12%
(using my implementation)
The most important cues
are head, shoulder, leg
silhouettes
weighted weighted
input image
How can we do better?

Revisit an old idea: part-based models (“pictorial structures”)


- Fischler & Elschlager ‘73, Felzenszwalb & Huttenlocher ’00

Combine with modern features and machine learning


Part-based models

• Parts — local appearance templates

• “Springs” — spatial connections between parts (geom. prior)

Image: [Felzenszwalb and Huttenlocher 05]


Part-based models

• Local appearance is easier to model than the global appearance

- Training data shared across deformations

- “part” can be local or global depending on resolution

• Generalizes to previously unseen configurations


General formulation

=( , )

= ( ,..., )

( ,..., )
v1

v2

p
part locations in the image
(or feature pyramid)
Part configuration score function

spring costs

score( , . . . , )= ( ) ( , )
= (,)
Part match scores

v1

v2

Highest scoring configurations


Part configuration score function

spring costs

score( , . . . , )= ( ) ( , )
= (,)
Part match scores

• Objective: maximize score over p1,...,pn

• hn configurations! (h = |P|, about 250,000)

• Dynamic programming

- If G = (V,E) is a tree, O(nh2) general algorithm

‣ O(nh) with some restrictions on dij


Star-structured deformable part models

root part

test image “star” model detection


Recall the Dalal & Triggs detector

p
b+Q`2(A, T) = w · (A, T)

(a) (b) (c) (d) (e) (f) (g)


Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more general [10] T. Joachims. Making large-scale svm learning practical. In
situations. B. Schlkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods - Support Vector Learning. The MIT Press,
Image pyramid HOG feature pyramid
Acknowledgments. This work was supported by the Euro-
pean Union research projects ACE M EDIA and PASCAL. We
Cambridge, MA, USA, 1999.

thanks Cordelia Schmid for many useful comments. SVM- [11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. CVPR, Washington,
Light [10] provided reliable training of large-scale SVM’s.
DC, USA, pages 66–75, 2004.
References [12] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110, 2004.
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
• HOG feature pyramid 8th ICCV, Vancouver, Canada, pages 454–461, 2001. [13] R. K. McConnell. Method of and apparatus for pattern recog-
nition, January 1986. U.S. Patent No. 4,567,610.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efficient pedes- [14] K. Mikolajczyk and C. Schmid. A performance evaluation of
trian detection: a test case for svm based categorization. local descriptors. PAMI, 2004. Accepted.
• Linear filter / sliding-window detector
Workshop on Cognitive Vision, 2002. Available online: [15] K. Mikolajczyk and C. Schmid. Scale and affine invariant
http://www.vision.ethz.ch/cogvis02/. interest point detectors. IJCV, 60(1):63–86, 2004.
[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching of [16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-
• SVM training to learn parameters w
pictorial structures. CVPR, Hilton Head Island, South Car-
olina, USA, pages 66–75, 2000.
tion based on a probabilistic assembly of robust part detectors.
The 8th ECCV, Prague, Czech Republic, volume I, pages 69–
[4] W. T. Freeman and M. Roth. Orientation histograms for 81, 2004.
hand gesture recognition. Intl. Workshop on Automatic Face- [17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based
and Gesture- Recognition, IEEE Computer Society, Zurich, object detection in images by components. PAMI, 23(4):349–
Switzerland, pages 296–301, June 1995. 361, April 2001.
D&T + parts

p0
A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro Felzenszwalb David McAllester Deva Ramanan


University of Chicago Toyota Technological Institute at Chicago UC Irvine
pff@cs.uchicago.edu mcallester@tti-c.org dramanan@ics.uci.edu

z Abstract
root

This paper describes a discriminatively trained, multi-


scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
Figure 1. Example detection obtained with the person model. The
Image pyramid models have become quite popular, their value had not been
HOG feature pyramid [FMR
model is defined by a coarse
demonstrated on difficult benchmarks such as the PASCAL template,CVPR’08]
several higher resolution
part templates and a spatial model for the location of each part.
challenge. Our system also relies heavily on new methods [FGMR PAMI’10]
for discriminative training. We combine a margin-sensitive
approach for data mining hard negative examples with a object categories. Figure 1 shows an example detection ob-
• Add parts to the Dalal & Triggs detector
formalism we call latent SVM. A latent SVM, like a hid-
den CRF, leads to a non-convex training problem. How-
tained with our person model.
The notion that objects can be modeled by parts in a de-
ever, a latent SVM is semi-convex and the training prob-
- HOG features
lem becomes convex once latent information is specified for
formable configuration provides an elegant framework for
representing object categories [1–3, 6, 10, 12, 13, 15, 16, 22].
the positive examples. We believe that our training meth-
While these models are appealing from a conceptual point
- Linear filters / sliding-window detector
ods will eventually make possible the effective use of more
of view, it has been difficult to establish their value in prac-
latent information such as hierarchical (grammar) models
tice. On difficult datasets, deformable models are often out-
- Discriminative training
and models involving latent three dimensional pose.
performed by “conceptually weaker” models such as rigid
templates [5] or bag-of-features [23]. One of our main goals
1. Introduction is to address this performance gap.
Sliding window DPM score function

p0
A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro Felzenszwalb David McAllester Deva Ramanan


University of Chicago Toyota Technological Institute at Chicago UC Irvine
pff@cs.uchicago.edu mcallester@tti-c.org dramanan@ics.uci.edu

z Abstract
root

This paper describes a discriminatively trained, multi-


scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
Figure 1. Example detection obtained with the person model. The
models have become quite popular, their value had not been
Image pyramid HOG feature pyramid
model is defined by a coarse template, several higher resolution
demonstrated on difficult benchmarks such as the PASCAL
part templates and a spatial model for the location of each part.
challenge. Our system also relies heavily on new methods

x = (TR , . . . , TM )
for discriminative training. We combine a margin-sensitive
approach for data mining hard negative examples with a object categories. Figure 1 shows an example detection ob-
formalism we call latent SVM. A latent SVM, like a hid- tained with our person model.
M M
den CRF, leads to a non-convex training problem. How-
The notion that objects can be modeled by parts in a de-
ever, a latent SVM is semi-convex and the training prob-
formable configuration provides an elegant framework for

score(A, Ty ) = max KB (A, TB ) /B (Ty , TB )


lem becomes convex once latent information is specified for
representing object categories [1–3, 6, 10, 12, 13, 15, 16, 22].
the positive examples. We believe that our training meth-
While these models are appealing from a conceptual point
T ,...,T
ods will eventually make possible the effective use of more
R as hierarchical
M of view, it has been difficult to establish their value in prac-
latent information such
B=y(grammar) models
and models involving latent three dimensional pose. B=R
tice. On difficult datasets, deformable models are often out-
performed by “conceptually weaker” models such as rigid

1. Introduction
Filter scores Spring costs
templates [5] or bag-of-features [23]. One of our main goals
is to address this performance gap.
Detection in a slide

test image

feature map feature map at 2x resolution

model

x ... x
x 1-st part filter n-th part filter

root filter ...

responses of part filters

response of root filter ... max [ ( ) ( , )]


transformed responses
+

color encoding of filter


response values
detection scores for
each root location
low value high value
What are the parts?
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .
aero bikebird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1
IRISA Aspect soup
.281 .318 .026 .097 .119
Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336
MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .
.289 .227 .221 .
Darmstadt .301
MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242
aero bike bird boat bottle bus car Oxford .262 cow
cat chair .409 table dog horse mbike .393person
.432 plant sheep sofa train tv .375
INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
Our rank 3 1 2 1 1 2 2 TKK 4 .186
1 .078
1 .043
1 .072
4 .002
2 .116
2 .184
1 .050
1 .028
2 .100
1 .086
4 .126
1 .186 .135 .061 .019 .
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253
Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336
MPI Center .060 Table .110 1..028
PASCAL.031 VOC
.000 2007
.164 results.
.172 .208 .002 precision
Average .044 .049 .141of .198
scores .170 .091
our system .004 systems
and other .091 .034 .237 .051
that entered the c
Darmstadt .301
MPI ESSOL .152 .157 indicate .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 The .208 .117 .002 .046 .147 .110 .054
INRIA Normal .092 .246 .012 .002 .068 boxes .197 .265 .018 that.097
a method
.039 was
.017not.016tested in the
.225 .153corresponding
.121 .093 class.
.002 .102 best .157score
.242in each class is shown in bo
Oxford .262 .409 first in 10 out of 20.393 .432 preliminary version of our system ranked .375 first in 6 classes in the official
.334 compe
INRIA Plus .136 .287 .041 .025 .077 ranks .279 .294 .132 .106 .127 classes.
.067A .071 .335 .249 .092 .072 .011 .092 .242 .275
TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253
MPI Center .060 Table .110 1..028 .031 .000
PASCAL VOC .164 .172 .208
2007 results. .002 precision
Average .044 .049 .141of.198
scores .170 .091
our system .004 systems
and other .091 .034 that .237
entered.051
the competition [7]. Empty
MPI ESSOL .152 boxes .098 .016
.157 indicate .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Sofa
that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system
Oxford .262 ranks
.409 first in 10 out of 20.393 .432A preliminary version of our system ranked
classes. .375 first in 6 classes in the official
.334 competition.
TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Car
Bicycle
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty
Bottle
Sofa
boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system
ranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.
Car
Bicycle
Bottle
Sofa

Car
Bottle

General philosophy: enrich models to better represent the data


Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientatio
the root and part filters, with the part filters placed at the center of the allowable displacements. We also show the
part, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

Figure 4. Some models learned from the PASCAL


in the PASCAL VOCwas
competition 2007 dataset.
.16, We show
obtained usingthe total energy
a rigid in no
but eachroot
orientation of the
filter and HOG cells
obtained .29 in
AP
the root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for each
(c) (a)
CarCar
component
component 12(initial
(initial parts)
parts)
Mixture models

(b) Car component 1 (trained parts)


Data driven: aspect, occlusion modes, subclasses
(d) Car component 2 (trained parts)
(a) Car component 1 (initial parts)

(c) Car component 2 (initial parts)


(b) Car component 1 (trained parts)

(e) Car component 3 (initial parts)


(d) Car component 2 (trained parts)
(c) Car component 2 (initial parts)

(d) Car component 2 (trained parts)

(f) Car component 3 (trained parts)


(e) Car component 3 (initial parts)

Figure 4.3: Car components with parts initialized by interpolated the root filter to
resolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
FMR CVPR ’08: AP = 0.27 (person) 62
(e) Car component 3 (initial parts)

FGMR PAMI ’10: AP = 0.36 (person) (f) Car component 3 (trained parts)

Figure 4.3: Car components with parts initialized by interpolated the root filter to twice it
resolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
623 (trained parts)
(f) Car component

Figure 4.3: Car components with parts initialized by interpolated the root filter to twice its
Pushmi–pullyu?

Good generalization properties on Doctor Dolittle’s farm

( + )/2=

This was supposed to


detect horses
Latent orientation

Unsupervised left/right orientation discovery


horse AP
0.42

0.47

0.57

FGMR PAMI ’10: AP = 0.36 (person)


voc-release5: AP = 0.45 (person)
Publicly available code for the whole system: current voc-release5
Trained, Multiscale, Deformable Part ModelSummary of results
David McAllester Deva Ramanan
a Technological Institute at Chicago UC Irvine
mcallester@tti-c.org dramanan@ics.uci.edu

y trained, multi-
ection. Our sys-
verage precision
CAL person de-
est results in the
ries. The system
deformable part
(e) (f)Figure 1. Example (g) detection obtained with the person model. The
lue had not been[DT’05] [FMR’08]
the head, shoulders and feet).modelThe most active blocks
is defined by are
a coarse template, several higher resolution
as the PASCAL
radient image over the training examples. (b) Each “pixel”
onLikewise
c) new methods
AP 0.12 part templates and a spatial model forAP
for the negative SVM weights. (d) A test image.
0.27 of each part.
the location [FGMR’10]
ymargin-sensitive
respectively the positive and the negative SVM weights. AP 0.36 [GFM voc-release5]
examples with a AP 0.45
. Joachims. Making large-scale object categories.
svm learning practical.Figure
In 1 shows an example detection ob-
SVM,
B. Schlkopf,likeC.aBurges,
hid- and A.tained
Smola, editors, Advances in
with our person model.
ernel MethodsHow-
problem. - Support Vector Learning. The MIT Press,
Cambridge, MA, USA, 1999. The notion that objects can be modeled by parts in a de-
e training prob-
Y. Ke and R. Sukthankar. Pca-sift: formable
A more configuration
distinctive rep- provides an elegant framework for
n is specified for
esentation for local image descriptors. CVPR, Washington,
representing object categories [1–3, 6, 10, 12, 13, 15, 16, 22].
r training
DC, USA, pagesmeth-
66–75, 2004.
While these models are appealing from a conceptual point
D.tive use ofDistinctive
G. Lowe. more image features from scale-invariant
eypoints. IJCV,
rammar) 60(2):91–110,of
models
view, it has been difficult to establish their value in prac-
2004.
R. K. McConnell. Method of and tice. On difficult
apparatus datasets, deformable models are often out-
for pattern recog-
nal pose.
ition, January 1986. U.S. Patent performed
No. 4,567,610. by “conceptually weaker” models such as rigid
K. Mikolajczyk and C. Schmid.templatesA performance [5]evaluation of
or bag-of-features [23]. One of our main goals
ocal descriptors. PAMI, 2004. Accepted.
is to address this performance gap.
K. Mikolajczyk and C. Schmid. Scale and affine invariant
ndnterest point detectors.
localizing Our models
ob- IJCV, 60(1):63–86, 2004. include both a coarse global template cov-
K.or Mikolajczyk, C. Schmid, andering
cars, in static an entire
A. Zisserman. Human object
detec- and higher resolution part templates.
The of
on based on a probabilistic assembly templates represent histogram of gradient features [5].
robust part detectors. [GFM’11]
cale
he 8th deformable
ECCV, Prague, Czech Republic, volume I, pages 69–
As in [14, 19, 21], we train models discriminatively. How- AP 0.49
odels
1, 2004.are trained
A.requires
Mohan, C.bound-
ever, our system is semi-supervised, trained with a max-
Papageorgiou, and T. Poggio. Example-based
bject detection
Using these in margin framework,
images by components.
mod- PAMI, 23(4):349– and does not rely on feature detection.
61, April 2001. We also describe a simple and effective strategy for learn-
at is both highly
C. Papageorgiou and T. Poggio.ing A trainable
parts system
from for object
weakly-labeled data. In contrast to computa-
eetection.
in aboutIJCV,238(1):15–33,
sec- 2000.
tionally demanding approaches such as [4], we can learn a
Part 2: DPM parameter learning

fixed model structure

?
?

? ? ?
?
? ?
? ? ?
?

component 1 component 2
Part 2: DPM parameter learning

fixed model structure training images y

?
?

? ? ?
? +1
? ?
? ? ?
?

component 1 component 2
Part 2: DPM parameter learning

fixed model structure training images y

?
?

? ? ?
? +1
? ?
? ? ?
?

component 1 component 2

-1
Part 2: DPM parameter learning

fixed model structure training images y

?
?

? ? ?
? +1
? ?
? ? ?
?

component 1 component 2

Parameters to learn:
– biases (per component) -1
– deformation costs (per part)
– filter weights
Linear parameterization

x = (TR , . . . , TM )
M M
score(A, Ty ) = max KB (A, TB ) /B (Ty , TB )
TR ,...,TM
B=y B=R
Filter scores Spring costs

Filter scores KB (A, TB ) = wB · (A, TB )

k k
Spring costs /B (Ty , TB ) = dB · (/t , /v , /t, /v)

b+Q`2(A, Ty ) = max w · Ǫ(A, (Ty , x))


x
Positive examples (y = +1)

x specifies an image and bounding box

person

We want
7w (t) = max w · Ǫ(t, x)
x w(t)

to score >= +1

w(t) includes all z with more than 70% overlap


with ground truth
Negative examples (y = -1)

x specifies an image and a HOG pyramid location p0


p0

We want
7w (t) = max w · Ǫ(t, x)
x w(t)

to score <= -1

w(t) restricts the root to p0 and allows any


placement of the other filters
Typical dataset

300 – 8,000 positive examples

500 million to 1 billion negative examples


(not including latent configurations!)

Large-scale*

*unless someone from google is here


How we learn parameters: latent SVM

R k
1(w) = w +* max{y, R vB 7w (tB )}
k B
How we learn parameters: latent SVM

R k
1(w) = w +* max{y, R vB 7w (tB )}
k B

R k
1(w) = w +* max{y, R max w · Ǫ(tB , x)}
k x w(t)
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L
How we learn parameters: latent SVM

R k
1(w) = w +* max{y, R vB 7w (tB )}
k B

R k
1(w) = w +* max{y, R max w · Ǫ(tB , x)}
k x w(t)
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

+ score

z1 z4

z2 z3

w
convex
How we learn parameters: latent SVM

R k
1(w) = w +* max{y, R vB 7w (tB )}
k B

R k
1(w) = w +* max{y, R max w · Ǫ(tB , x)}
k x w(t)
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

– score + score

z1 z4
z3
z2

z4
z2 z3
z1
w w
concave :( convex
Observations
– score + score

z1 z4
z3
z2

z4
z2 z3
z1
w w
concave :( convex

Latent SVM objective is convex in the negatives

but not in the positives

>> “semi-convex”
Convex upper bound on loss
– score

max{y, R max w · Ǫ(tB , x)} z2


z3
x w(t)

z4

z1
w
w (current)
– score
ZPi = z2
max{y, R w · Ǫ(tB , wSB )}
z3
convex
z4

z1
w
w (current)
Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

R k
1(w, wS ) = w +* max{y, R w · Ǫ(tB , wSB )}
k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L
Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

R k
1(w, wS ) = w +* max{y, R w · Ǫ(tB , wSB )}
k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

Note that 1(w, wT ) min 1(w, wS ) = 1(w)


wS
Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

R k
1(w, wS ) = w +* max{y, R w · Ǫ(tB , wSB )}
k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

Note that 1(w, wT ) min 1(w, wS ) = 1(w)


wS

and w = min 1(w, wS ) = min 1(w)


w,wS w
Auxiliary objective

w = min 1(w, wS ) = min 1(w)


w,wS w

This isn’t any easier to optimize


Auxiliary objective

w = min 1(w, wS ) = min 1(w)


w,wS w

This isn’t any easier to optimize

Find stationary point by coordinate descent on 1(w, wS )


Auxiliary objective

w = min 1(w, wS ) = min 1(w)


w,wS w

This isn’t any easier to optimize

Find stationary point by coordinate descent on 1(w, wS )

Initialization: either by picking a w(0) (or ZP)


Auxiliary objective

w = min 1(w, wS ) = min 1(w)


w,wS w

This isn’t any easier to optimize

Find stationary point by coordinate descent on 1(w, wS )

Initialization: either by picking a w(0) (or ZP)

Step 1:
wSB = argmax w(i) · Ǫ(tB , x) B S
x w(tB )
Auxiliary objective

w = min 1(w, wS ) = min 1(w)


w,wS w

This isn’t any easier to optimize

Find stationary point by coordinate descent on 1(w, wS )

Initialization: either by picking a w(0) (or ZP)

Step 1:
wSB = argmax w(i) · Ǫ(tB , x) B S
x w(tB )

Step 2:
w(i+R) = argmin 1(w, wS )
w
Step 1

wSB = argmax w(i) · Ǫ(tB , x) B S


x w(tB )

This is just detection:

test image

feature map feature map at 2x resolution

model

x ... x
x 1-st part filter n-th part filter

root filter ...

responses of part filters

response of root filter ...

transformed responses
+

color encoding of filter


response values
detection scores for
each root location
low value high value
Step 2

R k
min w +* max{y, R w · Ǫ(tB , wSB )}
w k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

Convex
Step 2

R k
min w +* max{y, R w · Ǫ(tB , wSB )}
w k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

Convex

Similar to a structural SVM


Step 2

R k
min w +* max{y, R w · Ǫ(tB , wSB )}
w k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

Convex

Similar to a structural SVM

But, recall 500 million to 1 billion negative examples!


Step 2

R k
min w +* max{y, R w · Ǫ(tB , wSB )}
w k
B S

+* max{y, R+ max w · Ǫ(tB , x)}


x w(t)
B L

Convex

Similar to a structural SVM

But, recall 500 million to 1 billion negative examples!

Can be solved by a working set method


– “bootstrapping”
– “data mining”
– “constraint generation”
– requires a bit of engineering to make this fast
Comments

Latent SVM is mathematically equivalent to MI-SVM (Andrews et al.


NIPS 2003)
z1 xi1
z3 xi3
z2 xi2

latent labels for xi bag of instances for xi

Latent SVM can be written as a latent structural SVM (Yu and


Joachims ICML 2009)
– natural optimization algorithm is concave-convex procedure
– similar to, but not exactly the same as, coordinate descent
What about the model structure?

fixed model structure training images y

?
?

? ? ?
? +1
? ?
? ? ?
?

component 1 component 2

Model structure
– # components
-1
– # parts per component
– root and part filter shapes
– part anchor locations
Learning model structure

Split positives by aspect ratio

Warp to common size

Train Dalal & Triggs model for each aspect ratio on its own
Learning model structure

Use D&T filters as initial w for LSVM training

Merge components

Root filter placement and component choice are latent


Learning model structure

Add parts to cover high-energy areas of root filters

Continue training model with LSVM


Learning model structure

without orientation clustering

with orientation clustering


Learning model structure

In summary
– repeated application of LSVM training to models of increasing complexity
– structure learning involves many heuristics (and vision insight!)