You are on page 1of 68

Deformable part models

Ross Girshick
UC Berkeley
CS231B Stanford University
Guest Lecture April 16, 2013
Image understanding
photo by thomas pix http://www.ickr.com/photos/thomaspix/2591427106
Snack time in the lab
What objects are where?
.
.
.
I see
twinkies!
robot: I see a table with twinkies,
pretzels, fruit, and some mysterious
chocolate things...
DPM lecture overview
(a) (b) (c) (d) (e) (f) (g)
Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixel
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more general
situations.
Acknowledgments. This work was supported by the Euro-
pean Union research projects ACEMEDIA and PASCAL. We
thanks Cordelia Schmid for many useful comments. SVM-
Light [10] provided reliable training of large-scale SVMs.
References
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efcient pedes-
trian detection: a test case for svm based categorization.
Workshop on Cognitive Vision, 2002. Available online:
http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efcient matching of
pictorial structures. CVPR, Hilton Head Island, South Car-
olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms for
hand gesture recognition. Intl. Workshop on Automatic Face-
and Gesture- Recognition, IEEE Computer Society, Zurich,
Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-
puter vision for computer games. 2nd International Confer-
ence on Automatic Face and Gesture Recognition, Killington,
VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: A
survey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-
trian detection: the protector+ system. Proc. of the IEEE In-
telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection for
smart vehicles. CVPR, Fort Collins, Colorado, USA, pages
8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for nding
people. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. In
B. Schlkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods - Support Vector Learning. The MIT Press,
Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. CVPR, Washington,
DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-
nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of
local descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and afne invariant
interest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-
tion based on a probabilistic assembly of robust part detectors.
The 8th ECCV, Prague, Czech Republic, volume I, pages 69
81, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based
object detection in images by components. PAMI, 23(4):349
361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for object
detection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-
tures of people. The 7th ECCV, Copenhagen, Denmark, vol-
ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detection
using the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-
jection: analytic structure and relevance to perception. Bio-
logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians
using patterns of motion and appearance. The 9th ICCV, Nice,
France, volume 1, pages 734741, 2003.
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro Felzenszwalb
University of Chicago
pff@cs.uchicago.edu
David McAllester
Toyota Technological Institute at Chicago
mcallester@tti-c.org
Deva Ramanan
UC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-
scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
models have become quite popular, their value had not been
demonstrated on difcult benchmarks such as the PASCAL
challenge. Our system also relies heavily on new methods
for discriminative training. We combine a margin-sensitive
approach for data mining hard negative examples with a
formalism we call latent SVM. A latent SVM, like a hid-
den CRF, leads to a non-convex training problem. How-
ever, a latent SVM is semi-convex and the training prob-
lem becomes convex once latent information is specied for
the positive examples. We believe that our training meth-
ods will eventually make possible the effective use of more
latent information such as hierarchical (grammar) models
and models involving latent three dimensional pose.
1. Introduction
We consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in static
images. We have developed a new multiscale deformable
part model for solving this problem. The models are trained
using a discriminative procedure that only requires bound-
ing box labels for the positive examples. Using these mod-
els we implemented a detection system that is both highly
efcient and accurate, processing an image in about 2 sec-
onds and achieving recognition rates that are signicantly
better than previous systems.
Our system achieves a two-fold improvement in average
precision over the winning system [5] in the 2006 PASCAL
person detection challenge. The system also outperforms
the best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National Science
Foundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. The
model is dened by a coarse template, several higher resolution
part templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-
tained with our person model.
The notion that objects can be modeled by parts in a de-
formable conguration provides an elegant framework for
representing object categories [13, 6, 10, 12, 13, 15, 16, 22].
While these models are appealing from a conceptual point
of view, it has been difcult to establish their value in prac-
tice. On difcult datasets, deformable models are often out-
performed by conceptually weaker models such as rigid
templates [5] or bag-of-features [23]. One of our main goals
is to address this performance gap.
Our models include both a coarse global template cov-
ering an entire object and higher resolution part templates.
The templates represent histogram of gradient features [5].
As in [14, 19, 21], we train models discriminatively. How-
ever, our system is semi-supervised, trained with a max-
margin framework, and does not rely on feature detection.
We also describe a simple and effective strategy for learn-
ing parts from weakly-labeled data. In contrast to computa-
tionally demanding approaches such as [4], we can learn a
model in 3 hours on a single CPU.
Another contribution of our work is a new methodology
for discriminative training. We generalize SVMs for han-
dling latent variables such as part positions, and introduce a
new method for data mining hard negative examples dur-
ing training. We believe that handling partially labeled data
is a signicant issue in machine learning for computer vi-
sion. For example, the PASCAL dataset only species a
1
AP 12% 27% 36% 45% 49%
2005 2008 2009 2010 2011
Part 1: modeling
Part 2: learning
Formalizing the object detection task
Many possible ways
Input
person
motorbike
Desired output
Many possible ways, this one is popular:
Formalizing the object detection task
cat,
dog,
chair,
cow,
person,
motorbike,
car,
...
Input
person
motorbike
Desired output
Performance summary:
Average Precision (AP)
0 is worst 1 is perfect
Many possible ways, this one is popular:
Formalizing the object detection task
cat,
dog,
chair,
cow,
person,
motorbike,
car,
...
Benchmark datasets
PASCAL VOC 2005 2012
- 54k objects in 22k images
- 20 object classes
- annual competition
Benchmark datasets
PASCAL VOC 2005 2012
- 54k objects in 22k images
- 20 object classes
- annual competition
Reduction to binary classication
Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusions
and a wide range of variations in pose, appearance, clothing, illumination and background.
probabilities to be distinguished more easily. We will often
use miss rate at 10
4
FPPW as a reference point for results.
This is arbitrary but no more so than, e.g. Area Under ROC.
In a multiscale detector it corresponds to a raw error rate of
about 0.8 false positives per 640480 image tested. (The full
detector has an even lower false positive rate owing to non-
maximum suppression). Our DET curves are usually quite
shallow so even very small improvements in miss rate are
equivalent to large gains in FPPW at constant miss rate. For
example, for our default detector at 1e-4 FPPW, every 1%
absolute (9% relative) reduction in miss rate is equivalent to
reducing the FPPW at constant miss rate by a factor of 1.57.
5 Overview of Results
Before presenting our detailed implementation and per-
formance analysis, we compare the overall performance of
our nal HOG detectors with that of some other existing
methods. Detectors based on rectangular (R-HOG) or cir-
cular log-polar (C-HOG) blocks and linear or kernel SVM
are compared with our implementations of the Haar wavelet,
PCA-SIFT, and shape context approaches. Briey, these ap-
proaches are as follows:
Generalized Haar Wavelets. This is an extended set of ori-
ented Haar-like wavelets similar to (but better than) that used
in [17]. The features are rectied responses from 99 and
1212 oriented 1
st
and 2
nd
derivative box lters at 45

inter-
vals and the corresponding 2
nd
derivative xy lter.
PCA-SIFT. These descriptors are based on projecting gradi-
ent images onto a basis learned from training images using
PCA [11]. Ke & Sukthankar found that they outperformed
SIFT for key point based matching, but this is controversial
[14]. Our implementation uses 1616 blocks with the same
derivative scale, overlap, etc., settings as our HOG descrip-
tors. The PCA basis is calculated using positive training im-
ages.
Shape Contexts. The original Shape Contexts [1] used bi-
nary edge-presence voting into log-polar spaced bins, irre-
spective of edge orientation. We simulate this using our C-
HOG descriptor (see below) with just 1 orientation bin. 16
angular and 3 radial intervals with inner radius 2 pixels and
outer radius 8 pixels gave the best results. Both gradient-
strength and edge-presence based voting were tested, with
the edge threshold chosen automatically to maximize detec-
tion performance (the values selected were somewhat vari-
able, in the region of 2050 graylevels).
Results. Fig. 3 shows the performance of the various detec-
tors on the MIT and INRIA data sets. The HOG-based de-
tectors greatly outperformthe wavelet, PCA-SIFT and Shape
Context ones, giving near-perfect separation on the MIT test
set and at least an order of magnitude reduction in FPPW
on the INRIA one. Our Haar-like wavelets outperform MIT
wavelets because we also use 2
nd
order derivatives and con-
trast normalize the output vector. Fig. 3(a) also shows MITs
best parts based and monolithic detectors (the points are in-
terpolated from [17]), however beware that an exact compar-
ison is not possible as we do not know how the database in
[17] was divided into training and test parts and the nega-
tive images used are not available. The performances of the
nal rectangular (R-HOG) and circular (C-HOG) detectors
are very similar, with C-HOG having the slight edge. Aug-
menting R-HOG with primitive bar detectors (oriented 2
nd
derivatives R2-HOG) doubles the feature dimension but
further improves the performance (by 2% at 10
4
FPPW).
Replacing the linear SVM with a Gaussian kernel one im-
proves performance by about 3% at 10
4
FPPW, at the cost
of much higher run times
1
. Using binary edge voting (EC-
HOG) instead of gradient magnitude weighted voting (C-
HOG) decreases performance by 5% at 10
4
FPPW, while
omitting orientation information decreases it by much more,
even if additional spatial or radial bins are added (by 33% at
10
4
FPPW, for both edges (E-ShapeC) and gradients (G-
ShapeC)). PCA-SIFT also performs poorly. One reason is
that, in comparison to [11], many more (80 of 512) principal
vectors have to be retained to capture the same proportion of
the variance. This may be because the spatial registration is
weaker when there is no keypoint detector.
6 Implementation and Performance Study
We now give details of our HOG implementations and
systematically study the effects of the various choices on de-
1
We use the hard examples generated by linear R-HOG to train the ker-
nel R-HOG detector, as kernel R-HOG generates so few false positives that
its hard example set is too sparse to improve the generalization signicantly.
pos = { ... ... }
neg = { ... background patches ... }
Descriptor Cues
input image
weighted
pos wts
weighted
neg wts
avg. grad outside in block
The most important cues
are head, shoulder, leg
silhouettes
Vertical gradients inside
the person count as
negative
Overlapping blocks those
just outside the contour
are the most important
Histograms of Oriented Gradients for Human Detection p. 11/13
SVM Sliding window detector
Dalal & Triggs (CVPR05)
HOG
Sliding window detection

Compute HOG of the whole image at multiple resolutions

Score every subwindow of the feature pyramid

Apply non-maxima suppression


(a) (b) (c) (d) (e) (f) (g)
Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixel
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more general
situations.
Acknowledgments. This work was supported by the Euro-
pean Union research projects ACEMEDIA and PASCAL. We
thanks Cordelia Schmid for many useful comments. SVM-
Light [10] provided reliable training of large-scale SVMs.
References
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efcient pedes-
trian detection: a test case for svm based categorization.
Workshop on Cognitive Vision, 2002. Available online:
http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efcient matching of
pictorial structures. CVPR, Hilton Head Island, South Car-
olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms for
hand gesture recognition. Intl. Workshop on Automatic Face-
and Gesture- Recognition, IEEE Computer Society, Zurich,
Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-
puter vision for computer games. 2nd International Confer-
ence on Automatic Face and Gesture Recognition, Killington,
VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: A
survey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-
trian detection: the protector+ system. Proc. of the IEEE In-
telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection for
smart vehicles. CVPR, Fort Collins, Colorado, USA, pages
8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for nding
people. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. In
B. Schlkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods - Support Vector Learning. The MIT Press,
Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. CVPR, Washington,
DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-
nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of
local descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and afne invariant
interest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-
tion based on a probabilistic assembly of robust part detectors.
The 8th ECCV, Prague, Czech Republic, volume I, pages 69
81, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based
object detection in images by components. PAMI, 23(4):349
361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for object
detection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-
tures of people. The 7th ECCV, Copenhagen, Denmark, vol-
ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detection
using the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-
jection: analytic structure and relevance to perception. Bio-
logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians
using patterns of motion and appearance. The 9th ICCV, Nice,
France, volume 1, pages 734741, 2003.
Image pyramid HOG feature pyramid
p
(, ) = w (, )
Detection
p
number of locations p ~ 250,000 per image
Detection
p
number of locations p ~ 250,000 per image
test set has ~ 5000 images
>> 1.3x10
9
windows to classify
Detection
p
number of locations p ~ 250,000 per image
test set has ~ 5000 images
>> 1.3x10
9
windows to classify
typically only ~ 1,000 true positive locations
Detection
p
number of locations p ~ 250,000 per image
test set has ~ 5000 images
>> 1.3x10
9
windows to classify
typically only ~ 1,000 true positive locations
Extremely unbalanced binary classication
Dalal & Triggs detector on INRIA
3.5 Overview of Results 27
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
P
r
e
c
i
s
i
o
n
RecallPrecision different descriptors on INRIA static person database
Ker. RHOG
Lin. RHOG
Lin. R2Hog
Wavelet
PCASIFT
Lin. EShapeC
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
P
r
e
c
i
s
i
o
n
Recall!Precision !! descriptors on INRIA static+moving person database
R!HOG + IMHmd
R!HOG
Wavelet
(a) (b)
Fig. 3.6. The performance of selected detectors on the INRIA static (left) and static+moving
(right) person data sets. For both of the data sets, the plots show the substantial overall gains
obtained by using HOG features rather than other state-of-the-art descriptors. (a) Compares
static HOG descriptors with other state of the art descriptors on INRIA static person data set.
(b) Compares combined the static and motion HOG, the static HOG and the wavelet detectors
on the combined INRIA static and moving person data set.
[2001] but also includes both 1
st
and 2
nd
-order derivative lters at 45

interval and the corre-


sponding 2
nd
derivative xy lter. It yields AP of 0.53. Shape contexts based on edges (E-ShapeC)
perform considerably worse with an AP of 0.25. However, Chapter 4 will show that generalised
shape contexts [Mori and Malik 2003], which like standard shape contexts compute circular
blocks with cells shaped over a log-polar grid, but which use both image gradients and orienta-
tion histograms as in R-HOG, give similar performance. This highlights the fact that orientation
histograms are very effective at capturing the information needed for object recognition.
For the video sequences we compare our combined static and motion HOG, static HOG, and
Haar wavelet detectors. The detectors were trained and tested on training and test portions of
the combined INRIA static and moving person data set. Details on how the descriptors and the
data sets were combined are presented in Chapter 6. Figure 3.6(b) summarises the results. The
HOG-based detectors again signicantly outperform the wavelet based one, but surprisingly
the combined static and motion HOG detector does not seem to offer a signicant advantage
over the static HOG one: The static detector gives an AP of 0.553 compared to 0.527 for the
motion detector. These results are surprising and disappointing because Sect. 6.5.2, where we
used DET curves (c.f . Sect. B.1) for evaluations, shows that for exactly the same data set, the
individual windowclassier for the motion detector gives signicantly better performance than
the static HOG window classier with false positive rates about one order of magnitude lower
than those for the static HOG classier. We are not sure what is causing this anomaly and are
currently investigating it. It seems to be linked to the threshold used for truncating the scores
in the mean shift fusion stage (during non-maximum suppression) of the combined detector.

AP = 75%

(79% in my implementation)

Very good

Declare victory and go home?


Dalal & Triggs on PASCAL VOC 2007
AP = 12%
(using my implementation)
Descriptor Cues
input image
weighted
pos wts
weighted
neg wts
avg. grad outside in block
The most important cues
are head, shoulder, leg
silhouettes
Vertical gradients inside
the person count as
negative
Overlapping blocks those
just outside the contour
are the most important
Histograms of Oriented Gradients for Human Detection p. 11/13
How can we do better?
Revisit an old idea: part-based models (pictorial structures)
- Fischler & Elschlager 73, Felzenszwalb & Huttenlocher 00
Combine with modern features and machine learning
Part-based models

Parts local appearance templates

Springs spatial connections between parts (geom. prior)


Image: [Felzenszwalb and Huttenlocher 05]
Part-based models

Local appearance is easier to model than the global appearance


-
Training data shared across deformations
-
part can be local or global depending on resolution

Generalizes to previously unseen congurations


General formulation
= (, )
= (

, . . . ,

)
(

, . . . ,

!
"
!
#
$
part locations in the image
(or feature pyramid)
Part conguration score function
$
score(

, . . . ,

) =

(,)

)
Part match scores
spring costs
!
"
!
#
Highest scoring congurations
Part conguration score function

Objective: maximize score over p


1
,...,p
n

h
n
congurations! (h = |P|, about 250,000)

Dynamic programming
-
If G = (V,E) is a tree, O(nh
2
) general algorithm
!
O(nh) with some restrictions on d
ij
score(

, . . . ,

) =

(,)

)
Part match scores
spring costs
Star-structured deformable part models
test image star model detection
root part
Recall the Dalal & Triggs detector
HOG feature pyramid
Linear lter / sliding-window detector
SVM training to learn parameters w
(a) (b) (c) (d) (e) (f) (g)
Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixel
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more general
situations.
Acknowledgments. This work was supported by the Euro-
pean Union research projects ACEMEDIA and PASCAL. We
thanks Cordelia Schmid for many useful comments. SVM-
Light [10] provided reliable training of large-scale SVMs.
References
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efcient pedes-
trian detection: a test case for svm based categorization.
Workshop on Cognitive Vision, 2002. Available online:
http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efcient matching of
pictorial structures. CVPR, Hilton Head Island, South Car-
olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms for
hand gesture recognition. Intl. Workshop on Automatic Face-
and Gesture- Recognition, IEEE Computer Society, Zurich,
Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-
puter vision for computer games. 2nd International Confer-
ence on Automatic Face and Gesture Recognition, Killington,
VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: A
survey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-
trian detection: the protector+ system. Proc. of the IEEE In-
telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection for
smart vehicles. CVPR, Fort Collins, Colorado, USA, pages
8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for nding
people. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. In
B. Schlkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods - Support Vector Learning. The MIT Press,
Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. CVPR, Washington,
DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-
nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of
local descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and afne invariant
interest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-
tion based on a probabilistic assembly of robust part detectors.
The 8th ECCV, Prague, Czech Republic, volume I, pages 69
81, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based
object detection in images by components. PAMI, 23(4):349
361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for object
detection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-
tures of people. The 7th ECCV, Copenhagen, Denmark, vol-
ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detection
using the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-
jection: analytic structure and relevance to perception. Bio-
logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians
using patterns of motion and appearance. The 9th ICCV, Nice,
France, volume 1, pages 734741, 2003.
Image pyramid HOG feature pyramid
(, ) = w (, )
p
D&T + parts
Add parts to the Dalal & Triggs detector
- HOG features
- Linear lters / sliding-window detector
- Discriminative training
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro Felzenszwalb
University of Chicago
pff@cs.uchicago.edu
David McAllester
Toyota Technological Institute at Chicago
mcallester@tti-c.org
Deva Ramanan
UC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-
scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
models have become quite popular, their value had not been
demonstrated on difcult benchmarks such as the PASCAL
challenge. Our system also relies heavily on new methods
for discriminative training. We combine a margin-sensitive
approach for data mining hard negative examples with a
formalism we call latent SVM. A latent SVM, like a hid-
den CRF, leads to a non-convex training problem. How-
ever, a latent SVM is semi-convex and the training prob-
lem becomes convex once latent information is specied for
the positive examples. We believe that our training meth-
ods will eventually make possible the effective use of more
latent information such as hierarchical (grammar) models
and models involving latent three dimensional pose.
1. Introduction
We consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in static
images. We have developed a new multiscale deformable
part model for solving this problem. The models are trained
using a discriminative procedure that only requires bound-
ing box labels for the positive examples. Using these mod-
els we implemented a detection system that is both highly
efcient and accurate, processing an image in about 2 sec-
onds and achieving recognition rates that are signicantly
better than previous systems.
Our system achieves a two-fold improvement in average
precision over the winning system [5] in the 2006 PASCAL
person detection challenge. The system also outperforms
the best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National Science
Foundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. The
model is dened by a coarse template, several higher resolution
part templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-
tained with our person model.
The notion that objects can be modeled by parts in a de-
formable conguration provides an elegant framework for
representing object categories [13, 6, 10, 12, 13, 15, 16, 22].
While these models are appealing from a conceptual point
of view, it has been difcult to establish their value in prac-
tice. On difcult datasets, deformable models are often out-
performed by conceptually weaker models such as rigid
templates [5] or bag-of-features [23]. One of our main goals
is to address this performance gap.
Our models include both a coarse global template cov-
ering an entire object and higher resolution part templates.
The templates represent histogram of gradient features [5].
As in [14, 19, 21], we train models discriminatively. How-
ever, our system is semi-supervised, trained with a max-
margin framework, and does not rely on feature detection.
We also describe a simple and effective strategy for learn-
ing parts from weakly-labeled data. In contrast to computa-
tionally demanding approaches such as [4], we can learn a
model in 3 hours on a single CPU.
Another contribution of our work is a new methodology
for discriminative training. We generalize SVMs for han-
dling latent variables such as part positions, and introduce a
new method for data mining hard negative examples dur-
ing training. We believe that handling partially labeled data
is a signicant issue in machine learning for computer vi-
sion. For example, the PASCAL dataset only species a
1
[FMR CVPR08]
[FGMR PAMI10]
p
0
z
Image pyramid HOG feature pyramid
root
Sliding window DPM score function
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro Felzenszwalb
University of Chicago
pff@cs.uchicago.edu
David McAllester
Toyota Technological Institute at Chicago
mcallester@tti-c.org
Deva Ramanan
UC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-
scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
models have become quite popular, their value had not been
demonstrated on difcult benchmarks such as the PASCAL
challenge. Our system also relies heavily on new methods
for discriminative training. We combine a margin-sensitive
approach for data mining hard negative examples with a
formalism we call latent SVM. A latent SVM, like a hid-
den CRF, leads to a non-convex training problem. How-
ever, a latent SVM is semi-convex and the training prob-
lem becomes convex once latent information is specied for
the positive examples. We believe that our training meth-
ods will eventually make possible the effective use of more
latent information such as hierarchical (grammar) models
and models involving latent three dimensional pose.
1. Introduction
We consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in static
images. We have developed a new multiscale deformable
part model for solving this problem. The models are trained
using a discriminative procedure that only requires bound-
ing box labels for the positive examples. Using these mod-
els we implemented a detection system that is both highly
efcient and accurate, processing an image in about 2 sec-
onds and achieving recognition rates that are signicantly
better than previous systems.
Our system achieves a two-fold improvement in average
precision over the winning system [5] in the 2006 PASCAL
person detection challenge. The system also outperforms
the best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National Science
Foundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. The
model is dened by a coarse template, several higher resolution
part templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-
tained with our person model.
The notion that objects can be modeled by parts in a de-
formable conguration provides an elegant framework for
representing object categories [13, 6, 10, 12, 13, 15, 16, 22].
While these models are appealing from a conceptual point
of view, it has been difcult to establish their value in prac-
tice. On difcult datasets, deformable models are often out-
performed by conceptually weaker models such as rigid
templates [5] or bag-of-features [23]. One of our main goals
is to address this performance gap.
Our models include both a coarse global template cov-
ering an entire object and higher resolution part templates.
The templates represent histogram of gradient features [5].
As in [14, 19, 21], we train models discriminatively. How-
ever, our system is semi-supervised, trained with a max-
margin framework, and does not rely on feature detection.
We also describe a simple and effective strategy for learn-
ing parts from weakly-labeled data. In contrast to computa-
tionally demanding approaches such as [4], we can learn a
model in 3 hours on a single CPU.
Another contribution of our work is a new methodology
for discriminative training. We generalize SVMs for han-
dling latent variables such as part positions, and introduce a
new method for data mining hard negative examples dur-
ing training. We believe that handling partially labeled data
is a signicant issue in machine learning for computer vi-
sion. For example, the PASCAL dataset only species a
1
p
0
z
Spring costs Filter scores
= (

, . . . ,

)
score(,

) = max

,...,

(,

)
Image pyramid HOG feature pyramid
root
Detection in a slide
+
x
x x
...
...
...
model
response of root lter
transformed responses
responses of part lters
feature map feature map at 2x resolution
detection scores for
each root location
low value high value
color encoding of lter
response values
root lter
1-st part lter n-th part lter
test image

max

)]
What are the parts?
Aspect soup
General philosophy: enrich models to better represent the data
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1
Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336
Darmstadt .301
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242
INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253
MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051
MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Oxford .262 .409 .393 .432 .375 .334
TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty
boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system
ranks rst in 10 out of 20 classes. A preliminary version of our system ranked rst in 6 classes in the ofcial competition.
Bottle
Car
Bicycle
Sofa
Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in
the root and part lters, with the part lters placed at the center of the allowable displacements. We also show the spatial model for each
part, where bright values represent cheap placements, and dark values represent expensive placements.
in the PASCAL competition was .16, obtained using a rigid
template model of HOG features [5]. The best previous re-
sult of .19 adds a segmentation-based verication step [20].
Figure 6 summarizes the performance of several models we
trained. Our root-only model is equivalent to the model
from [5] and it scores slightly higher at .18. Performance
jumps to .24 when the model is trained with a LSVM that
selects a latent position and scale for each positive example.
This suggests LSVMs are useful even for rigid templates
because they allow for self-adjustment of the detection win-
dow in the training examples. Adding deformable parts in-
creases performance to .34 AP a factor of two above the
best previous score. Finally, we trained a model with parts
but no root lter and obtained .29 AP. This illustrates the
advantage of using a multiscale representation.
We also investigated the effect of the spatial model and
allowable deformations on the 2006 person dataset. Recall
that s
i
is the allowable displacement of a part, measured in
HOG cells. We trained a rigid model with high-resolution
parts by setting s
i
to 0. This model outperforms the root-
only system by .27 to .24. If we increase the amount of
allowable displacements without using a deformation cost,
we start to approach a bag-of-features. Performance peaks
at s
i
= 1, suggesting it is useful to constrain the part dis-
placements. The optimal strategy allows for larger displace-
ments while using an explicit deformation cost. The follow-
6
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1
Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336
Darmstadt .301
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242
INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253
MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051
MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Oxford .262 .409 .393 .432 .375 .334
TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty
boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system
ranks rst in 10 out of 20 classes. A preliminary version of our system ranked rst in 6 classes in the ofcial competition.
Bottle
Car
Bicycle
Sofa
Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in
the root and part lters, with the part lters placed at the center of the allowable displacements. We also show the spatial model for each
part, where bright values represent cheap placements, and dark values represent expensive placements.
in the PASCAL competition was .16, obtained using a rigid
template model of HOG features [5]. The best previous re-
sult of .19 adds a segmentation-based verication step [20].
Figure 6 summarizes the performance of several models we
trained. Our root-only model is equivalent to the model
from [5] and it scores slightly higher at .18. Performance
jumps to .24 when the model is trained with a LSVM that
selects a latent position and scale for each positive example.
This suggests LSVMs are useful even for rigid templates
because they allow for self-adjustment of the detection win-
dow in the training examples. Adding deformable parts in-
creases performance to .34 AP a factor of two above the
best previous score. Finally, we trained a model with parts
but no root lter and obtained .29 AP. This illustrates the
advantage of using a multiscale representation.
We also investigated the effect of the spatial model and
allowable deformations on the 2006 person dataset. Recall
that s
i
is the allowable displacement of a part, measured in
HOG cells. We trained a rigid model with high-resolution
parts by setting s
i
to 0. This model outperforms the root-
only system by .27 to .24. If we increase the amount of
allowable displacements without using a deformation cost,
we start to approach a bag-of-features. Performance peaks
at s
i
= 1, suggesting it is useful to constrain the part dis-
placements. The optimal strategy allows for larger displace-
ments while using an explicit deformation cost. The follow-
6
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1
Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336
Darmstadt .301
INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242
INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275
IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253
MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051
MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054
Oxford .262 .409 .393 .432 .375 .334
TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090
Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Empty
boxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current system
ranks rst in 10 out of 20 classes. A preliminary version of our system ranked rst in 6 classes in the ofcial competition.
Bottle
Car
Bicycle
Sofa
Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells in
the root and part lters, with the part lters placed at the center of the allowable displacements. We also show the spatial model for each
part, where bright values represent cheap placements, and dark values represent expensive placements.
in the PASCAL competition was .16, obtained using a rigid
template model of HOG features [5]. The best previous re-
sult of .19 adds a segmentation-based verication step [20].
Figure 6 summarizes the performance of several models we
trained. Our root-only model is equivalent to the model
from [5] and it scores slightly higher at .18. Performance
jumps to .24 when the model is trained with a LSVM that
selects a latent position and scale for each positive example.
This suggests LSVMs are useful even for rigid templates
because they allow for self-adjustment of the detection win-
dow in the training examples. Adding deformable parts in-
creases performance to .34 AP a factor of two above the
best previous score. Finally, we trained a model with parts
but no root lter and obtained .29 AP. This illustrates the
advantage of using a multiscale representation.
We also investigated the effect of the spatial model and
allowable deformations on the 2006 person dataset. Recall
that s
i
is the allowable displacement of a part, measured in
HOG cells. We trained a rigid model with high-resolution
parts by setting s
i
to 0. This model outperforms the root-
only system by .27 to .24. If we increase the amount of
allowable displacements without using a deformation cost,
we start to approach a bag-of-features. Performance peaks
at s
i
= 1, suggesting it is useful to constrain the part dis-
placements. The optimal strategy allows for larger displace-
ments while using an explicit deformation cost. The follow-
6
Mixture models
Data driven: aspect, occlusion modes, subclasses
FMR CVPR 08: AP = 0.27 (person)
FGMR PAMI 10: AP = 0.36 (person)
(a) Car component 1 (initial parts)
(b) Car component 1 (trained parts)
(c) Car component 2 (initial parts)
(d) Car component 2 (trained parts)
(e) Car component 3 (initial parts)
(f) Car component 3 (trained parts)
Figure 4.3: Car components with parts initialized by interpolated the root lter to twice its
resolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
62
(a) Car component 1 (initial parts)
(b) Car component 1 (trained parts)
(c) Car component 2 (initial parts)
(d) Car component 2 (trained parts)
(e) Car component 3 (initial parts)
(f) Car component 3 (trained parts)
Figure 4.3: Car components with parts initialized by interpolated the root lter to twice its
resolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
62
(a) Car component 1 (initial parts)
(b) Car component 1 (trained parts)
(c) Car component 2 (initial parts)
(d) Car component 2 (trained parts)
(e) Car component 3 (initial parts)
(f) Car component 3 (trained parts)
Figure 4.3: Car components with parts initialized by interpolated the root lter to twice its
resolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).
62
Pushmipullyu?
Good generalization properties on Doctor Dolittles farm
This was supposed to
detect horses
( + ) / 2 =
Latent orientation
Unsupervised left/right orientation discovery
FGMR PAMI 10: AP = 0.36 (person)
voc-release5: AP = 0.45 (person)
Publicly available code for the whole system: current voc-release5
0.42
0.47
0.57
horse AP
Summary of results
(a) (b) (c) (d) (e) (f) (g)
Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks are
centred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each pixel
shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.
(e) Its computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.
would help to improve the detection results in more general
situations.
Acknowledgments. This work was supported by the Euro-
pean Union research projects ACEMEDIA and PASCAL. We
thanks Cordelia Schmid for many useful comments. SVM-
Light [10] provided reliable training of large-scale SVMs.
References
[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The
8th ICCV, Vancouver, Canada, pages 454461, 2001.
[2] V. de Poortere, J. Cant, B. Van den Bosch, J. de
Prins, F. Fransens, and L. Van Gool. Efcient pedes-
trian detection: a test case for svm based categorization.
Workshop on Cognitive Vision, 2002. Available online:
http://www.vision.ethz.ch/cogvis02/.
[3] P. Felzenszwalb and D. Huttenlocher. Efcient matching of
pictorial structures. CVPR, Hilton Head Island, South Car-
olina, USA, pages 6675, 2000.
[4] W. T. Freeman and M. Roth. Orientation histograms for
hand gesture recognition. Intl. Workshop on Automatic Face-
and Gesture- Recognition, IEEE Computer Society, Zurich,
Switzerland, pages 296301, June 1995.
[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-
puter vision for computer games. 2nd International Confer-
ence on Automatic Face and Gesture Recognition, Killington,
VT, USA, pages 100105, October 1996.
[6] D. M. Gavrila. The visual analysis of human movement: A
survey. CVIU, 73(1):8298, 1999.
[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-
trian detection: the protector+ system. Proc. of the IEEE In-
telligent Vehicles Symposium, Parma, Italy, 2004.
[8] D. M. Gavrila and V. Philomin. Real-time object detection for
smart vehicles. CVPR, Fort Collins, Colorado, USA, pages
8793, 1999.
[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for nding
people. IJCV, 43(1):4568, 2001.
[10] T. Joachims. Making large-scale svm learning practical. In
B. Schlkopf, C. Burges, and A. Smola, editors, Advances in
Kernel Methods - Support Vector Learning. The MIT Press,
Cambridge, MA, USA, 1999.
[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-
resentation for local image descriptors. CVPR, Washington,
DC, USA, pages 6675, 2004.
[12] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91110, 2004.
[13] R. K. McConnell. Method of and apparatus for pattern recog-
nition, January 1986. U.S. Patent No. 4,567,610.
[14] K. Mikolajczyk and C. Schmid. A performance evaluation of
local descriptors. PAMI, 2004. Accepted.
[15] K. Mikolajczyk and C. Schmid. Scale and afne invariant
interest point detectors. IJCV, 60(1):6386, 2004.
[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-
tion based on a probabilistic assembly of robust part detectors.
The 8th ECCV, Prague, Czech Republic, volume I, pages 69
81, 2004.
[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based
object detection in images by components. PAMI, 23(4):349
361, April 2001.
[18] C. Papageorgiou and T. Poggio. A trainable system for object
detection. IJCV, 38(1):1533, 2000.
[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-
tures of people. The 7th ECCV, Copenhagen, Denmark, vol-
ume IV, pages 700714, 2002.
[20] Henry Schneiderman and Takeo Kanade. Object detection
using the statistics of parts. IJCV, 56(3):151177, 2004.
[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-
jection: analytic structure and relevance to perception. Bio-
logical Cybernetics, 25(4):181194, 1977.
[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians
using patterns of motion and appearance. The 9th ICCV, Nice,
France, volume 1, pages 734741, 2003.
A Discriminatively Trained, Multiscale, Deformable Part Model
Pedro Felzenszwalb
University of Chicago
pff@cs.uchicago.edu
David McAllester
Toyota Technological Institute at Chicago
mcallester@tti-c.org
Deva Ramanan
UC Irvine
dramanan@ics.uci.edu
Abstract
This paper describes a discriminatively trained, multi-
scale, deformable part model for object detection. Our sys-
tem achieves a two-fold improvement in average precision
over the best performance in the 2006 PASCAL person de-
tection challenge. It also outperforms the best results in the
2007 challenge in ten out of twenty categories. The system
relies heavily on deformable parts. While deformable part
models have become quite popular, their value had not been
demonstrated on difcult benchmarks such as the PASCAL
challenge. Our system also relies heavily on new methods
for discriminative training. We combine a margin-sensitive
approach for data mining hard negative examples with a
formalism we call latent SVM. A latent SVM, like a hid-
den CRF, leads to a non-convex training problem. How-
ever, a latent SVM is semi-convex and the training prob-
lem becomes convex once latent information is specied for
the positive examples. We believe that our training meth-
ods will eventually make possible the effective use of more
latent information such as hierarchical (grammar) models
and models involving latent three dimensional pose.
1. Introduction
We consider the problem of detecting and localizing ob-
jects of a generic category, such as people or cars, in static
images. We have developed a new multiscale deformable
part model for solving this problem. The models are trained
using a discriminative procedure that only requires bound-
ing box labels for the positive examples. Using these mod-
els we implemented a detection system that is both highly
efcient and accurate, processing an image in about 2 sec-
onds and achieving recognition rates that are signicantly
better than previous systems.
Our system achieves a two-fold improvement in average
precision over the winning system [5] in the 2006 PASCAL
person detection challenge. The system also outperforms
the best results in the 2007 challenge in ten out of twenty
This material is based upon work supported by the National Science
Foundation under Grant No. 0534820 and 0535174.
Figure 1. Example detection obtained with the person model. The
model is dened by a coarse template, several higher resolution
part templates and a spatial model for the location of each part.
object categories. Figure 1 shows an example detection ob-
tained with our person model.
The notion that objects can be modeled by parts in a de-
formable conguration provides an elegant framework for
representing object categories [13, 6, 10, 12, 13, 15, 16, 22].
While these models are appealing from a conceptual point
of view, it has been difcult to establish their value in prac-
tice. On difcult datasets, deformable models are often out-
performed by conceptually weaker models such as rigid
templates [5] or bag-of-features [23]. One of our main goals
is to address this performance gap.
Our models include both a coarse global template cov-
ering an entire object and higher resolution part templates.
The templates represent histogram of gradient features [5].
As in [14, 19, 21], we train models discriminatively. How-
ever, our system is semi-supervised, trained with a max-
margin framework, and does not rely on feature detection.
We also describe a simple and effective strategy for learn-
ing parts from weakly-labeled data. In contrast to computa-
tionally demanding approaches such as [4], we can learn a
model in 3 hours on a single CPU.
Another contribution of our work is a new methodology
for discriminative training. We generalize SVMs for han-
dling latent variables such as part positions, and introduce a
new method for data mining hard negative examples dur-
ing training. We believe that handling partially labeled data
is a signicant issue in machine learning for computer vi-
sion. For example, the PASCAL dataset only species a
1
[DT05]
AP 0.12
[FMR08]
AP 0.27
[FGMR10]
AP 0.36
[GFM voc-release5]
AP 0.45
[GFM11]
AP 0.49
Part 2: DPM parameter learning
?
?
?
?
?
?
?
?
?
?
?
?
component 1 component 2
xed model structure
Part 2: DPM parameter learning
?
?
?
?
?
?
?
?
?
?
?
?
component 1 component 2
xed model structure training images y
+1
Part 2: DPM parameter learning
?
?
?
?
?
?
?
?
?
?
?
?
component 1 component 2
xed model structure training images y
+1
-1
Part 2: DPM parameter learning
?
?
?
?
?
?
?
?
?
?
?
?
component 1 component 2
xed model structure training images y
+1
-1
Parameters to learn:
biases (per component)
deformation costs (per part)
lter weights
Linear parameterization
Spring costs Filter scores
= (

, . . . ,

)
score(,

) = max

,...,

(,

(,

) = w

(,

) = d

, , )
Filter scores
Spring costs
(,

) = max

w (, (

, ))
Positive examples (y = +1)
We want

w
() = max
()
w (, )
to score >= +1
() includes all z with more than 70% overlap
with ground truth
x species an image and bounding box
person
Negative examples (y = -1)
x species an image and a HOG pyramid location p
0
We want

w
() = max
()
w (, )
to score <= -1
() restricts the root to p
0
and allows any
placement of the other lters
p
0
Typical dataset
300 8,000 positive examples
500 million to 1 billion negative examples
(not including latent congurations!)
Large-scale*
*unless someone from google is here
How we learn parameters: latent SVM
(w) =

max{,

w
(

)}
(w) =

max{,

w
(

)}
(w) =

max{, max
()
w (

, )}
+

max{, + max
()
w (

, )}
How we learn parameters: latent SVM
(w) =

max{,

w
(

)}
(w) =

max{, max
()
w (

, )}
+

max{, + max
()
w (

, )}
w
+ score
z
1
z
2
z
3
z
4
convex
How we learn parameters: latent SVM
(w) =

max{,

w
(

)}
w
score
z
1
z
2
z
3
z
4
(w) =

max{, max
()
w (

, )}
+

max{, + max
()
w (

, )}
w
+ score
z
1
z
2
z
3
z
4
convex concave :(
How we learn parameters: latent SVM
Observations
w
score
z
1
z
2
z
3
z
4
w
+ score
z
1
z
2
z
3
z
4
convex concave :(
Latent SVM objective is convex in the negatives
but not in the positives
>> semi-convex
Convex upper bound on loss
w
score
z
1
z
2
z
3
z
4
w (current)
w
score
z
1
Z
Pi
= z
2
z
3
z
4
w (current)
max{, max
()
w (

, )}
max{, w (

)}
convex
Auxiliary objective
Let Z
P
= {Z
P1
, Z
P2
, ... }
(w,

) =

max{, w (

)}
+

max{, + max
()
w (

, )}
Auxiliary objective
Let Z
P
= {Z
P1
, Z
P2
, ... }
(w,

) =

max{, w (

)}
+

max{, + max
()
w (

, )}
Note that (w,

) min

(w,

) = (w)
Auxiliary objective
Let Z
P
= {Z
P1
, Z
P2
, ... }
(w,

) =

max{, w (

)}
+

max{, + max
()
w (

, )}
w

= min
w,

(w,

) = min
w
(w) and
Note that (w,

) min

(w,

) = (w)
Auxiliary objective
w

= min
w,

(w,

) = min
w
(w)
This isnt any easier to optimize
Auxiliary objective
w

= min
w,

(w,

) = min
w
(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on (w,

)
Auxiliary objective
w

= min
w,

(w,

) = min
w
(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on (w,

)
Initialization: either by picking a w
(0)
(or Z
P
)
Auxiliary objective
w

= min
w,

(w,

) = min
w
(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on (w,

)
Initialization: either by picking a w
(0)
(or Z
P
)
Step 1:

= argmax
(

)
w
()
(

, )
Auxiliary objective
w

= min
w,

(w,

) = min
w
(w)
This isnt any easier to optimize
Find stationary point by coordinate descent on (w,

)
Initialization: either by picking a w
(0)
(or Z
P
)
Step 1:
Step 2:
w
(+)
= argmin
w
(w,

= argmax
(

)
w
()
(

, )
Step 1
This is just detection:
+
x
x x
...
...
...
model
response of root lter
transformed responses
responses of part lters
feature map feature map at 2x resolution
detection scores for
each root location
low value high value
color encoding of lter
response values
root lter
1-st part lter n-th part lter
test image

= argmax
(

)
w
()
(

, )
Step 2
min
w

max{, w (

)}
+

max{, + max
()
w (

, )}
Convex
Step 2
min
w

max{, w (

)}
+

max{, + max
()
w (

, )}
Convex
Similar to a structural SVM
Step 2
min
w

max{, w (

)}
+

max{, + max
()
w (

, )}
Convex
Similar to a structural SVM
But, recall 500 million to 1 billion negative examples!
Step 2
min
w

max{, w (

)}
+

max{, + max
()
w (

, )}
Convex
Similar to a structural SVM
But, recall 500 million to 1 billion negative examples!
Can be solved by a working set method
bootstrapping
data mining
constraint generation
requires a bit of engineering to make this fast
Comments
Latent SVM is mathematically equivalent to MI-SVM (Andrews et al.
NIPS 2003)
Latent SVM can be written as a latent structural SVM (Yu and
Joachims ICML 2009)
natural optimization algorithm is concave-convex procedure
similar to, but not exactly the same as, coordinate descent
x
i1
bag of instances for x
i
x
i2
x
i3
z
1
z
2
z
3
latent labels for x
i
What about the model structure?
?
?
?
?
?
?
?
?
?
?
?
?
component 1 component 2
xed model structure training images y
+1
-1
Model structure
# components
# parts per component
root and part lter shapes
part anchor locations
Learning model structure
Split positives by aspect ratio
Warp to common size
Train Dalal & Triggs model for each aspect ratio on its own
Learning model structure
Use D&T lters as initial w for LSVM training
Merge components
Root lter placement and component choice are latent
Learning model structure
Add parts to cover high-energy areas of root lters
Continue training model with LSVM
Learning model structure
without orientation clustering
with orientation clustering
Learning model structure
In summary
repeated application of LSVM training to models of increasing complexity
structure learning involves many heuristics (and vision insight!)

You might also like