Multimodal Templates For Real-Time Detection of Texture-Less Objects in Heavily Cluttered Scenes

Multimodal Templates for Real-Time
Detection of Texture-less Objects in

Heavily Cluttered Scenes
Stefan Hinterstoisser, Stefan Holzer, Cedric Cagniart, Slobodan
Ilic,Kurt Konolige, Nassir Navab, Vincent Lepetit
Department of Computer Science, CAMP, Technische Universitat Munchen (TUM),

Germany WillowGarage, Menlo Park, CA, USA
IEEE International Conference on Computer Vision (ICCV) 2011

Outline
Goal & Challenges
Related Work
Modality Extraction
Image Cue
Depth Cue
Similarity Measure
Efficient Computation
Experiments
Goal
Challenges
Objects under different poses over heavily
cluttered background
Online learning
Real-time object learning and detection
Related Work
Solving the problem of multi-view 3D object
detection has two main categories:
Learning Based Methods
Template Matching
Learning Based Methods:
Require a large amount of training data
Require long offline training phase
Expensive learning for new object
Related Work
Template Matching:
Better adapted to low textured objects than
feature point approaches
Easily update template for new object
Direct matching is inappropriate for real-time.
Others:
Matching in Range Data :
Construct full 3D CAD model of the object
Outline
Goal & Challenges
Related Work
Modality Extraction
Image Cue
Depth Cue
Similarity Measure
Experiments
Modality Extraction-Image Cue
Image Cue:
Image gradients are proved to be discriminant and
robust to illumination change and noise.
Normalized gradients and not their magnitudes
makes the measure robust to contrast changes.
We compute the normalized gradients on each
color channel for input RGB color image.
Input image , gradient map at location x:
Keep only the gradients whose norms are larger
than a threshold.
Assign to the gradient whose quantized
orientation occurs most in a 3 3 neighborhood.
The similarity measurement function fg:
Og(r): the normalized gradient map of the

reference image at location r
Ig(t): the normalized gradient map of the
input image at location t
Quantizing the
Input color
gradient
image
orientations
Gradient image Gradient image

computed on computed with
gray image our approach
Modality Extraction-Depth Cue
Depth Cue
We use a standard camera and a aligned depth
sensor to obtain depth map.
Use quantized surface normal computed on a
dense depth field for our template representation.
Consider the first order Taylor expansion of the
depth function D(x):
Within a patch defined around x, each pixel offset

dx yields an equation.
Estimate an optimal gradient in least-square.

Depth gradient corresponds to a tangent plane
going through three points X, X1 and X2:
: vector along the line of sight that goes through

pixel x (obtain from parameters of depth sensor)
The normal to the surface can be estimated as the
normalized cross-product of X1 X and X2 X.
Within a patch defined around +Z
x, this would not be robust X

D(x) Tangent
around occluding contours. plane

Depth sensor Normal of X
Inspired by bilateral filtering,
we ignore the pixels whose depth difference with
the central pixel (X) is above a threshold.
Quantize the normal directions into n0 bins.
Assign to each location the quantized value that
occurs most often in a 5 5 neighborhood.
The similarity measurement function fD:
OD(r): the normalized surface normal of the

reference image at location r
ID(t): the normalized surface normal of the
input image at location t
Quantizing the Input
surface normals image
The corresponding
depth image
Surface normals
computed with
our approach.
Details are clearly visible and depth discontinuities are well handled.
Outline
Goal & Challenges
Related Work
Modality Extraction
Image Cue
Depth Cue
Similarity Measure
Experiments
Similarity Measure
We define a template as T = ({Om} mM, P ).
P: a list of pairs (r,m) made of the locations r of a
discriminant feature in modality m.
Each template is created by extracting for each
m a set of its most discriminant features (P).
P:(rk, surface normals) r : record the
feature location
C with respect to
object center (C).
P:(ri, gradients)
Similarity Measure
The object measurement energy function :
T: ({Om} mM, P )
c: the detected location (could be object center)

R(c+r): [ c+r- , c+r+ ][ c+r- , c+r+ ] , N const.
2 2 2 2
(neighborhood of size N centered on (c+r) in Im)
fm (Om (r), Im (t)) :
computes the similarity score for modality m
We first quantize the input data for each
modality into a small number of n0.
Use a lookup table i,m for energy response:
C C
Lm Lm
i: the index of the quantized value of modality m.
(also use i to represent the corresponding value)
Lm: list of values of a special modality m appearing
in a local neighborhood of a value i from input I.
Spread [11] the data around neighborhood to
obtain a robust representation Jm instead of Lm.
For each quantized value of one modality m
with index i we can now compute the response
at each location c:
i,m : the precomputed lookup table, Jm as the index
[11] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, P. Fua, N. Navab, and V. Lepetit. Gradient
response maps for realtime detection of texture-less objects. under revision PAMI.
Finally, the similarity measure can be:
Since the maps Si,m are shared between the

templates, matching several templates against
the input image can be done very fast once
they are computed.
Experiments
LINE-MOD: our approach (intensity & depth)
LINE-2D: introduced in [11] (use only intensity)
LINE-3D: use only the depth map
Hardware:
Performed on one processor of a standard notebook
with an Intel Centrino Processor Core2Duo with 2.4
GHz and 3 GB of RAM.
Test data:
Six object sequences made of 2000 real images each.
Each sequence presents illumination and large
viewpoint changes over heavy cluttered background.
Experiments
Robustness:
A threshold (about 80) separates almost all true

positives for LINE-MOD.
Experiments
Speed:
Learning new templates only requires extracting
and storing features, which is almost instantaneous.
Templates include: 360 degree tilt rotation, 90
degree inclination rotation and in-plane rotations
of 80 degrees, scale changes from 1.0 to 2.0.
Parse a 640480 image with over 3000 templates
with 126 features at about 10 fps(real-time).
The runtime of LINE-MOD is only dependent on the
number of features and independent of the
object/template size.
Experiments
Speed:
Experiments
Occlusion:
Right: Average recognition score for the six objects with respect to
occlusion.
With over 30% occlusion our method is still able to recognize objects.
True positive rates
Experiments
Cup
Toy-Car
Hole punch
Experiments
Toy-Monkey
Toy-Duck
Camera
Experiments

True positive rates = 100%
+

False positive rates = 100%
+
Experiments

Multimodal Templates For Real-Time Detection of Texture-Less Objects in Heavily Cluttered Scenes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal Templates For Real-Time Detection of Texture-Less Objects in Heavily Cluttered Scenes

Uploaded by

Copyright:

Available Formats

Multimodal Templates for Real-Time

Detection of Texture-less Objects in

Department of Computer Science, CAMP, Technische Universitat Munchen (TUM),

IEEE International Conference on Computer Vision (ICCV) 2011

Og(r): the normalized gradient map of the

Gradient image Gradient image

Within a patch defined around x, each pixel offset

Estimate an optimal gradient in least-square.

: vector along the line of sight that goes through

x, this would not be robust X

OD(r): the normalized surface normal of the

i,m : the precomputed lookup table, Jm as the index

Since the maps Si,m are shared between the

A threshold (about 80) separates almost all true

You might also like