Professional Documents
Culture Documents
Computer Vision-
ECCV "92
Second European Conference on Computer Vision
Santa Margherita Ligure, Italy, May 19-22, 1992
Proceedings
Springer-Verlag
Berlin Heidelberg NewYork
London Paris Tokyo
Hong Kong Barcelona
Budapest
Series Editors
Gerhard Goos Juris Hartmanis
Universit~it Karlsruhe Cornell University
Postfach 69 80 Department of Computer Science
Vincenz-Priessnitz-Stra6e 1 5149 Upson Hall
W-7500 Karlsruhe, FRG Ithaca, NY 14853, USA
Volume Editor
Giulio Sandini
Dept. of Communication, Computer, and Systems Science, University of Genova
Via Opera Pia, 11A, 1-16145 Genova, Italy
This work is subject to copyright. All rights are reserved, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, re-use of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other way,
and storage in data banks. Duplication of this publication or parts thereof is permitted
only under the provisions of the German Copyright Law of September 9, 1965, in its
current version, and permission for use must always be obtained from Springer-Verlag.
Violations are liable for prosecution under the German Copyright Law.
9 Springer-Verlag Berlin Heidelberg 1992
Printed in Germany
Typesetting: Camera ready by author
Printing and binding: Druckhaus Beltz, Hemsbach/Bergstr.
45/3140-543210 - Printed on acid-free paper
Foreword
This volume collects the papers accepted for presentation at the Second European Con-
ference on Computer Vision held in Santa Margherita Ligure, Italy, May 19 - 22, 1992.
The selection of the papers has been extremely difficult because of the high number
and excellent quality of the papers submitted. I wish to thank my friends on the Pro-
gramme Committee who, with the help of other qualified referees, have done a tremendous
job in reviewing the papers at short notice.
In order to maintain a single-track conference, and to keep this book within a rea-
sonable size, 16 long papers, 41 short papers and 48 posters have been selected from the
308 submissions, and structured into 14 sections reflecting the major research topics in
computer vision currently investigated worldwide.
I personally would like to thank all the authors for the quality of their work and for
their collaboration in keeping the length of the papers within the limits requested (we
all know how painful it is to delete lines after they have been printed). Particular thanks
to those who submitted papers from outside Europe for the credit given to the Euro-
pean computer vision community. Their contribution has been fundamental in outlining
the current state of the art in computer vision and in producing further evidence of the
maturity reached by this important research field. Thanks to ESPRIT and other collab-
orative projects the number of "transnational" (and even "transcontinental") papers is
increasing, with political implications which outlast, to some extent, the scientific ones.
It will be a major challenge for the computer vision community to take advantage of
the recent political changes worldwide in order to bring new ideas into this challenging
research field.
I wish to thank all the persons who contributed to make ECCV-92 a reality, in
particular Piera Ponta of Genova Ricerche, Therese Bricheteau and Cristine Juncker of
INRIA and Lorenza Luceti of DIST who helped in keeping things under control during
the hot phases of the preparation.
I give special thanks to Olivier Faugeras, who, as a the chairman of ECCV-90, estab-
lished the high standard of this conference, thus contributing significantly in attracting
so many good papers to ECCV-92.
Finally let me thank Anna, Pietro and Corrado for their extra patience during these
last months.
Chairperson
Giulio Sandini DIST, University of Genova
Board
Bernard Buxton GEC Marconi, Hirst Research Center
Olivier Faugeras INR.IA - Sophia Antipolis
Goesta Granlund LinkSping University
John Mayhew Sheffield University
Hans H. Nagel Karlsruhe University Fraunhofer Inst.
Programme Committee
Nicholas Ayache INRIA Rocquencourt
Andrew Blake Oxford University
Mike Brady Oxford University
Hans Burkhardt University ttamburg-Harburg
Hilary Buxton Queen Mary and Westfield College
James Crowley LIFIA - INPG, Grenoble
Rachid Deriche INRIA Sophia Antipolis
Ernest Dickmanns University Miinchen
Jan Olof Eklundh P~yal Institute of Technology, Stockholm
David Hogg Leeds University
Jan Koenderink Utrecht State University
Hans Knutsson Linkoping University
P~oger Mohr LIFIA - INPG, Grenoble
Bernd Neumann Hamburg University
Carme Torras Institute of Cybernetics, Barcelona
Vincent Torte University of Genova
Video Proceedings:
Giovanni Garibotto Elsag Bailey S.p.a.
Experimental Sessions:
Massimo Tistarelli DIST, University of Genova
E S P R I T Day Organization:
Patrick Van Hove CEC, DG XIII
E S P R I T Workshops Coodination:
James L. Crowley LIFIA - INPG, Grenoble
Coordination:
Piera Ponta Consorzio Genova Ricerche
Cristine Juncker INRIA, Sophia-Antipolis
Therese Bricheteau INRIA
Lorenza Luceti DIST, University of Genova
Nicoletta Piccardo Eurojob, Genova
Referees
G a x i b o t t o G. Italy N o r d s t r S m N. Sweden
A m a t J. Spain G i r a u d o n G. France
A n d e r s s o n M.T. Sweden G o n g S. U.K. Olofsson G. Sweden
A u b e r t D. France G r a n l u n d G. Sweden
A y a c h e N. France G r o s P. France P a h l a v a n K. Sweden
Grosso E. Italy P a m p a g n i n L.H. France
BArman H. Sweden Gueziec A. France P a p a d o p o u l o T. France
Bascle B. France P a t e r n a k B. Germany
Bellissant C. France H a g l u n d L. Sweden P e t r o u M. France
B e n a y o u n S. France Heitz F. France P u g e t P. France
Berger M.O. France H~ranlt H. France
Bergholm F. Sweden Herlin I.L. France Q u a n L. France
Berroir J.P. France H o e h n e H.H. Germany
Berthod M. France H o g g D. U.K R a d i g B. Germany
BesaSez L. Spain Horaud R. France Reid I. U.K.
Betsis D. Sweden Howarth R. U.K Riehetin M. France
Beyer H. France Hugog D. U.K. Rives G. France
Blake A. U.K. H u m m e l R. France R o b e r t L. France
Boissier O. France
Bouthemy P. France Inglebert C. France S a g e r e r G. Germany
Boyle R. U.K. Izuel M.J. Spain Sandini G. Italy
Brady M. U.K. Sanfeliu A. Spain
Burkhardt H. Germany Juvin D. France S c h r o e d e r C. Germany
Buxton B. U:K. Seals B. France
Buxton H. U.K. Kittler J. U.K. S i m m e t h H. Germany
Knutsaon H. Sweden Sinclair D. U.K.
C a l e a n D. France Koenderink I. The Netherlands Skordas Th. France
Carlsson S. Sweden Koller D. Germany S o m m e r G. Germany
Casals A. Spain S p a r r G. Sweden
C a s t a n S. France L a n g e S. Germany Sprengel R. Germany
C e l a y a E. Spain Lapreste J.T. France Stein T h . y o n Germany
Chamley S. France Levy-Vehel J. France Stiehl H.S. Germany
Chassery J.M. France Li M. Sweden
C h e h i k i a n A. France L i n d e b e r g T. Sweden T h i r i o n J.P. France
C h r i s t e n s e n H. France Lindsey P. U.K. T h o m a s B. France
Cinquin Ph. France L u d w i g K.-O. Germany T h o m a s F. Spain
C o h e n I. France L u o n g T. France T h o n n a t M. France
Cohen L. France Lux A. France Tistarelli M. Italy
Crowley J.L. France T o a l A.F. U.K.
Curwen R. U.K. M a g r a s s i M. Italy T o r r a s C. Spain
M a l a n d a i n G. France T o r t e V. Italy
Dagless E. France M a r t i n e z A. Spain Tr~v~n H. Sweden
Daniilidis K. Germany M a y b a n k S.J. France
De Micheli E. Italy M a y h e w J. U.K. Uhlin T. Sweden
Demazeau Y. France M a z e r E. France Usoh M. U.K.
Deriche R. France Mc L a u c h l a n P. U.K.
Devillers O. France Mesrabi M. France Veillon F. France
D h o m e M. France Milford D. France Verri A. Italy
Dickmanns E. Germany Moeller R. Germany Vieville T. France
Dinten J.M. France M o h r R. France Villanueva J . J . Spain
Dreschler-FischerL. Germany M o n g a O. France
Drewniok C. Germany Montseny E. Spain W a h l F. Germany
M o r g a n A. France Westelius C.J. Sweden
Eklundh J.O. Sweden Morin L. France Westin C.F. Sweden
Wieske L. Germany
Faugeras O.D. France Nagel H.H. Germany W i k l u n d J. Sweden
Ferrari F. Italy N a s t a r C. France W i n r o t h H. Sweden
Fossa M. Italy N a v a b N. France W y s o c k i J. U.K.
F u a P. France N e u m a n n B. Germany
N e u m a n n H. Germany Z e r u b i a J. France
GArding J. Sweden N o r d b e r g K. Sweden Z h a n g Z. France
Organization and Support
Organized by:
DIST, University of Genova
In Cooperation with:
Consorzio Genova Ricerche
INRIA - Sophia Antipolis
Commission of the European Communities, DGXIII - ESPRIT
Supported by:
C.N.R. Special Project on Robotics
European Vision Society
Corporate Sponsors
Digital Equipment Corporation - Italy
Sincon - Fase S.p.A. - Italy
Sun Microsystems - Italy
Contents
Features
Steerable-Scalable Kernels for Edge Detection and Junction Analysis . . . . . . . . . . . . . . . . 3
P.Perona
Families of Tuned Scale-Space Kernels . . . . ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
L.J.Florack, B.M.ter IIaar Romeny, J.J.Koenderink, M.A. Vicrgever
Contour Extraction by Mixture Density Description Obtained from Region
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
M.Etoh, Y.Shirai, M.Asada
The M5bius Strip Parameterization for Line Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
G.-F. Wcstin, IL Knutsson
Edge Tracing in a priori Known Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
A.Nowak, A.Fiorek, T.Piascik
Features Extraction and Analysis Methods for Sequences of Ultrasound Images . . . . 43
L L.tterlin, N.A yachc
Figure-Ground Discrimination by Mean Field Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
L.Hdrault, R.Itoraud
Deterministic Pseudo-Annealing: Optimization in Markov-Random-Fields
An Application to Pixel Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67~
M.Berthod, G. Giraudon, J.P.Stromboni
A Bayesian Multiple Hypothesis Approach to Contour Grouping . . . . . . . . . . . . . . . . . . . 72
LJ.Cox, J.M.Rehg, S.Hingorani
Detection of General Edges and Keypoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
L.Rosenthaler, F.Heitger, O.Kiibler, R.von der Heydt
Distributed Belief Revision for Adaptive Image Processing Regulation . . . . . . . . . . . . . . 87
V.Murino, M.F.Peri, C.S.Regazzoni
Finding Face Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
L Craw, D. Tock, A.Bennctt
Color
Detection of Specularity Using Color and Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
S. W.Lee, R.Bajcsy
Data and Model-Driven Selection Using Color Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
T.F.Syeda-Mahraood
Recovering Shading from Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B. V.Funt, M.S.Drew, M.Brockingston
Texture and Shading
Shading Flows and Scenel Bundles: A New Approach to Shape from Shading . . . . . 135
P.Breton, L.A.Iverson, M.S.Langer, S.W.Zucker
Texture: Plus ~a Change,. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
M. M. Fleck
Texture Parametrization Method for Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 160
A.Casals, J.Araal, A.Grau
Texture Segmentation by Minimizing Vector-Valued Energy Functionals:
The Coupled-Membrane Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
T.S.Lee, D.Mumford, A.Yuille
Boundary Detection in Piecewise Homogeneous Textured Images . . . . . . . . . . . . . . . . . 174
S.Casadei, S.Milter, P.Perona
M o t i o n Estimation
Surface Orientation and Time to Contact from Image Divergence and Deformation .. 187
R. Cipolla, A.Blake
Robust and Fast Computation of Unbiased Intensity Derivates in Images . . . . . . . . 203
T. Vieville, O.D.Faugeras
Testing Computational Theories of Motion Discontinuities:
A Psychophysical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
L.M. Vaina, N.M.Grzywacz
Motion and Structure Factorization and Segmentation of Long Multiple
Motion Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
C.Debrunner, N.Ahuja
Motion and Surface Recovery Using Curvature and Motion Consistency . . . . . . . . . . 222
G.Soucy, F.P.Ferrie
Finding Clusters and Planes from 3D Line Segments with Application to 3D
Motion Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Z.Zhang, O.D.Faugeras
Hierarchical Model-Based Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
J.R.Bergen, P.Anandan, K.J.Hanna, R.ttingorani
A Fast Method to Estimate Sensor Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
V.Sundareswaran
Identifying Multiple Motions from Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
A.Rognone, M.Campani, A. Verri
A Fast Obstacle Detection Method Based on Optical Flow . . . . . . . . . . . . . . . . . . . . . . . 267
N.Ancona
A Parallel Implementation of a Structure-from-Motion Algorithm . . . . . . . . . . . . . . . . 272
H. Wang, C.Bowman, M.Brady, C.ttarris
Structure from Motion Using the Ground Plane Constraint . . . . . . . . . . . . . . . . . . . . . . . 277
T.N. Tan, G.D.Sullivaa, K.D.Baker
Detecting and Tracking Multiple Moving Objects Using Temporal Integration . . . . 282
M.Irani, B.Ronsso, S.Peleg
Depth
Image Blurring Effects due to Depth Discontinuities:
Blurring that Creates Emergent Image Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
T.C.Nguyen, T.S.Huang
Ellipse Based Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
J.Buurman
Applying Two-Dimensional Delaunay Triangulation to Stereo Data Interpretation.. 368
E.Bruzzone, M.Cazzanti, L.De Floriani, F.Mangili
Local Stereoscopic Depth Estimation Using Ocular Stripe Maps . . . . . . . . . . . . . . . . . . 373
K.-O.Ludwig, H.Neumann, B.Neumann
Depth Computations from Polyhedral Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
G.Sparr
Parallel Algorithms for the Distance Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
H.Embrechts, D.Roose
Stereo-motion
A Computational Framework for Determining Stereo Correspondence from
a Set of Linear Spatial Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
D. G. Jones, J.Malik
On Visual Ambiguities due to Transparency in Motion and Stereo . . . . . . . . . . . . . . . . 411
M.Shizawa
A Deterministic Approach for Stereo Disparity Calculation . . . . . . . . . . . . . . . . . . . . . . . 420
C. Chang, S. Chatlerjee
Occlusions and Binocular Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
D.Geiger, B.Ladendorf, A. Yuille
XII
Tracking
Model-Based Object Tracking in Traffic Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
D.Koller, K.Daniilidis, T. Th6rhallsson, tt.-tt.Nagel
Tracking Moving Contours Using Energy-Minimizing Elastic Contour Models . . . . . 453
iV. Ueda, K.Mase
Tracking Points on Deformable Objects Using Curvature Information . . . . . . . . . . . . . 458
LCohen, N.Ayache, P.Sulger
An Egomotion Algorithm Based on the Tracking of Arbitrary Curves . . . . . . . . . . . . . 467
E.Arbogast, R.Mohr
Region-Based Tracking in an Image Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
F.Meyer, P.Bouthemy
Combining Intensity and Motion for Incremental Segmentation and Tracking
over Long Image Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
M. J. Black
Active Vision
Active Egomotion: A Qualitative Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
Y.Aloimonos, Z.Dnrif
Active Perception Using DAM and Estimation Techniques . . . . . . . . . . . . . . . . . . . . . . . 511
W.P~izleitner, 11. Wechsler
Active-Dynamic Stereo for Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
E. Grosso, M. Tistarelli, G.Sandini
Integrating Primary Ocular Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
K.Pahlavan, T. Uhlin, J.-O.Eklundh
Where to Look Next Using a Bayes Net: Incorporating Geometric Relations . . . . . . 542
R.D.Rimey, C.M.Brown
An Attentional Prototype for Early Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
S.M. Cnlhane, J.K. Tsotsos
Binocular Heads
W h a t Can Be Seen in Three Dimensions with an Uncalibrated Stereo Rig? . . . . . . . 563
O.D.Faugeras
Estimation of Relative Camera Positions for Uncalibrated Cameras . . . . . . . . . . . . . . . 579
R.L Hartley
Gaze Control for a Binocular Camera Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
J.L. Crowley, P.Bobet, M.Mesrabi
Recognition
Canonical Frames for Planar Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
C.A.Rothwell, A.Zisserman, D.A.Forsyth, J.L.Mundy
Measuring the Quality of Hypotheses in Model-Based Recognition . . . . . . . . . . . . . . . . 773
D. P. Hutteniocher, T.A. Cass
Using Automatically Constructed View-Independent Relational Model in
3D Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778
S.Zhang, G.D.Sullivan, K.D.Baker
Learning to Recognize Faces from Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
S.Edelman, D.Reisfeld, Y. Yeshurun
Face Recognition Through Geometrical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
R.Brunelli, T.Poggio
Fusion Through Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
M.J.L.Orr, J.Hallam, R.B.Fisher
3D Object Recognition Using Passively Sensed Range Data . . . . . . . . . . . . . . . . . . . . . . 806
K.M.Dawson, D. Vernon
Interpretation of Remotely Sensed Images in a Context of Multisensor Fusion . . . . 815
V. Clement, G. Girandon, S.Honzelle
Limitations of Non Model-Based Recognition Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Y.Moses, S. Ullman
Constraints for Recognizing and Locating Curved 3D Objects from
Monocular Image Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
D.J.Kriegman, B. Vijayakumar, J.Ponce
Polynomial-Time Object Recognition in the Presence of Clutter, Occlusion,
and Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
T.A. Cass
Hierarchical Shape Recognition Based on 3D Multiresolution Analysis . . . . . . . . . . . . 843
S.Morita, T.Kawashima, Y.Aoki
Object Recognition by Flexible Template Matching Using Genetic Algorithms . . . . 852
A.Hill, C.J. Taylor, T.Cootes
Matching and Recognition of Road Networks from Aerial Images . . . . . . . . . . . . . . . . . 857
S.Z.Li, J.Kiltler, M.Petrou
Applications
Intensity and Edge-Based Symmetry Detection Applied to Car-Following . . . . . . . . . 865
T.Zielke, M.Brauckmann, W.von Seelen
Indexieality and Dynamic Attention Control in Qualitative Recognition
of Assembly Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874
Y.Kuniyoshi, H.lnoue
Real-Time Visual Tracking for Surveillance and Path Planning . . . . . . . . . . . . . . . . . . . 879
R.Cnrwen, A.Biake, A.Zisserman
Spatic~Temporal Reasoning Within a Traffic Surveillance System . . . . . . . . . . . . . . . . . 884
A.F. Toal, H.Buxton
Template Guided Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
A.Noble, V.D.Nguyen, C.Marinos, A.T.Tran, J.Farley, K.Hedengren, J.L.Mundy
Hardware Support for Fast Edge-Based Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 902
P.Courtney, N.A. Thacker, C.R.Brown
1 Introduction
Points, lines, edges, textures, motions are present in almost all images of everyday's
world. These elementary visual structures often encode a great proportion of the infor-
mation contained in the image, moreover they can be characterized using a small set
of parameters that are locally defined: position, orientation, characteristic size or scale,
phase, curvature, velocity. It is threrefore resonable to start visual computations with
measurements of these parameters. The earliest stage of visual processing, common for
all the classical early vision modules, could consist of a collection of operators that calcu-
late one or more dominant orientations, curvatures, scales, velocities at each point of the
image or, alternatively, assign an 'energy', or 'probability', value to points of a position-
orientation-phase-scale-etc, space. Ridges and local maxima of this energy would mark
special interest loci such as edges and junctions. The idea that biological visual systems
might analyze images along dimensions such as orientation and scale dates back to work
by Hubel and Wiesel [19, 18] in the 1960's. In the computational vision literature the idea
of analyzing images along multiple orientations appears at the beginning of the seventies
with the Binford-Horn linefiuder [17, 3] and later work by Granlund [14].
A computational framework that may be used to performs this proto-visual analy-
sis is the convolution of the image with kernels of various shapes, orientations, phases,
elongation, scale. This approach is attractive because it is simple to describe, imple-
ment and analyze. It has been proposed and demonstrated for a variety of early vision
tasks [23, 22, 5, 1, 6, 15, 40, 30, 28, 31, 10, 26, 4, 41, 20, 21, 11, 36, 2]. Various 'general'
computational justifications have been proposed for basing visual processing on the out-
put of a rich set of linar filters: (a) Koenderink has argued that a structure of this type is
an adequate substrate for local geometrical computations [24] on the image brightness,
(b) Adelson and Bergen [2] have derived it from the 'first principle' that the visual system
* This work was partially conducted while at MIT-LIDS with the Center for Intelligent Control
Systems sponsored by ARC}grant DAAL 03-86-K-0171, .
computes derivatives of the image along the dimensions of wavelength, parallax, position,
time, (c) a third point of view is the one of 'matched filtering': where the kernels are
synthesized to match the visual events that one looks for.
The kernels that have been proposed in the computational literature have typically
been chosen according to one or more of three classes of criteria: (a) 'generic optimality'
(e.g. optimal sampling of space-frequency space), (b) 'task optimality' (e.g. signal to
noise ratio, localization of edges) (c) emulation of biological mechanisms. While there is
no general consensus in the literature on precise kernel shapes, there is convergence on
kernels roughly shaped like either Gabor functions, or derivatives or differences of either
round or elongated Gaussian functions - all these functions have the advantage that they
can be specified and computed easily. A good rule of the thumb in the ~hoice of kernels
for early vision tasks is that they should have good localization in space and frequency,
and should be roughly tuned to the visual events that one wants to analyze.
Since points, edges, lines, textures, motions can exist at all possible positions, orien-
tations, scales of resolution, curvatures one would like to be able to use families of filters
that are tuned to all orientations, scales and positions. Therefore once a particular con-
volution kernel has been chosen one would like to convolve the image with deformations
(rotations, scalings, stretchings, bendings etc.) of this 'template'. In reality one can afford
only a finite (and small) number of filtering operations, hence the common practice of
'sampling' the set of orientations, scales, positions, curvatures, phases 3. This operation
has the strong drawback of introducing anisotropies and algorithmic difficulties in the
computational implementations. It would be preferable to keep thinking in terms of a
continuum, of angles for example, and be able to localize the orientation of an edge with
the m a x i m u m accuracy allowed by the filter one has chosen.
This aim m a y sometimes be achieved by means of interpolation: one convolves the
image with a small set of kernels, say at a number of discrete orientations, and obtains the
result of the convolution at a n y orientation by taking linear combinations of the results.
Since convolution is a linear operation the interpolation problem m a y be formulated
in terms of the kernels (for the sake of simplicity the case of rotations in the plane is
discuased here): Given a kernel F : R 2 -~ C z, define the family of 'rotated' copies of F as:
F0 = F o R0, 8 E $1, where $z is the circle and/~e is a rotation. Sometimes it is possible
to express Fe as
n
3 Motion flow computation using spatiotemporal filters has been proposed by Adelson and
Bergen [1] as a model of human vision and has been demonstrated by Heeger [15] (his
implementation had 12 discrete spati~temporal orientations and 3 scales of resolution).
Work on texture with multiple-resolution multiple-orientation kernels is due to Knuttson
and Granlund [23] (4 scales, 4 orientations, 2 phases), Turner [40] (4 scales, 4 orientations,
2 phases), Fogel and Sagi [10] (4 scales, 4 orientations, 2 phases), Malik and Perona [26] (11
scales, 6 orientations, 1 phase) and Bovik et al. [4] (n scales, m orientations, 1 phases). Work
on stereo by Kass [22] (12 filters, scales, orientations and phases unspecified) and Jones and
Malik [20, 21] (see also the two articles in this book) (6 scales, 2-6 orientations, 2 phases).
Work on curved line grouping by Parent and Zucker [31] (1 scale, 8 orientations, lphase) and
Malik and Gigus [25] (9 curvatures, 1 scale, 18 orientations, 2 phases). Work on brightness
edge detection by Binford and Horn [17, 3] (24 orientations), Canny [6] (1-2 scales, oo-6 orien-
tations, 1 phase), Morrone,Owens and Burr [30, 28] (1-3 scales, 2-4 orientations, c~ phases),
unpublished work on edge and illusory contour detection by Heitger, Rosenthaler, Kfibler and
yon der Heydt (6 orientations, 1 scale, 2 phases). Image compression by Zhong and Mallat [41]
(4 scales, 2 orientations, 1 phase).
Fig. 1.
30 31 32 33 34 35 36 37 30 31 32 33 34 35 36 37
energies 2-sided 8x8 orientations 8x8
i /' t /
i::::i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Perona-Malik ax = 3, ev = 1 Canny u = 1
j
F i g . 2. Example of the use of orientation-selective filtering on a continuum of orientations (see
Perona and Malik [35, 36]). Fig. 1 (Left) Original image. Fig. 1 (Right) A T-junction (64x64
pixel detail from a region roughly at the centre of the original image). The kernel of the filter
for the edge-detector is elongated to have high orientation selectivity; it is depicted in Fig. 3.
(Top-left) Modulus R(x, y, 0) of the output of the complex-valued filter (polar plot shown for 8x8
pixels in the region of the T-junction). (Top-right) The local maxima of JR(x, y, 0)] with respect
to 0. Notice that in the region of the junction one finds two local maxima in 0 corresponding
to the orientation of the edges. Searching for local maxima in (x,y) in a direction ortogonal to
the maximizing 8's one can find the edges (Bottom left) with high accuracy (error around 1
degree in orientation and 0.1 pixels in position). (Bottom right) Comparison with the output of
a Canny detector using the same kernel width (a in pixel units).
have proposed non-optimal extensions to the case of joint rotation and scaling.
In this p a p e r the general case of compact deformations is reviewed in section 2. Some
results of functional analysis are recalled to formulate the decomposition technique in all
generality. T h e case of rotations is briefly recalled in section 3 to introduce some notation
which is used later in the paper. In section 4 it is shown how to generate a steerable and
scalable family. Experimental results and implementation issues are presented and dis-
cussed. Finally, in section 5 some basic symmetries of edge-detection kernels are studied
and their use described in (a) reducing calculations and storage, and (b) implementing
filters useful for junction analysis at no extra cost.
2 Deformable functions
In order to solve the approximation problem one needs of course to define the 'qual-
ity' of the approximation G~n] ~ F0. There are two reasonable choices: (a) a distance
D(Fo,G~hI) in the space 112 S 1 where F0 is defined; (b) if F0 is the kernel of some
filter one is interested in the worst-case error in the 'output' space: the maximum dis-
tance d((Fo, f}, (G~nl, f)) over all unit-norm f defined on It 2. The symbols An and din
will indicate the 'optimal' distances, i.e. the minimum possible approximation errors us-
ing n components. These quantities may be defined using the distances indflced by the
L2-norm:
Definition.
8n(Fo)=inf sup
a M II/11=I
II(Fo-@l,y),,lls,
The existence of the optimal finite-sum approximation of the kernel Fe(x) as decribed
in the introduction is not peculiar to the case of rotations. This is true in more general
circumstances: this section collects a few facts of functional analysis that show that one
can compute finite optimal approximations to continuous families of kernels whenever
certain 'compactness' conditions are met.
Consider a parametrized family of kernels F(x; 9) where x ~ X now indicates a generic
vector of variables in a set X and 0 E T a vector of parameters in a set T. (The notation
is changed slightly from the previous section.) Consider the sets A and B of continuous
functions from X and T to the complex numbers, call a(x) and b(0) the generic elements
of these two sets. Consider the operator L : A - - ~ B defined by F as:
A first theorem says that if the kernel F has bounded norm then the associated
operator L is compact (see [7] pag. 316):
T h e o r e m 1. Let X and T be locally compact Hausdorff spaces and F E L2(X T). Then
L is well defined and is a compact operator.
A third result says that if L is continuous and operates on Hilbert spaces then the
compactness property transfers to the adjoint of L (see [8] pag. 329):
As a result we know that when our original template kernel F(x) and the chosen family
of deformations R(O) define a Hilbert-Schmidt kernel F(x; 0) = ( F o R(9))(x) then it is
possible to compute a finite discrete approximation as for the case of 2D rotations.
Are the families of kernels F(x; 0) of interest in vision Hilbert-Schmidt kernels? In the
cases of interest for vision applications the 'template' kernel F i x ) typically has a finite
norm, i.e. it belongs to L2(X) (all kernels used in vision are bounded compact-support
kernels such as Gaussian derivatives, Gabors etc.). However, this is not a sufficient con-
dition for the family F(x; 0) = F o R(0)(x) obtained composing F(x) with deformations
RiO ) (rotations, scalings) to be a Hilbert-Schmidt kernel: the norm of Fix; 0) could be
unbounded (e.g. if the deformation is a scaling in the unbounded interval (0, co)). A suf-
ficient condition for the associated family F(x; 0) to be a Hilbert-Schmidt kernel is that
the inverse of the Jacobian of the transformation R, IJR1-1 belongs to L2(T) (see [34]).
A typical condition in which this arises is when the transformation R is unitary, e.g.
a rotation, translation, or an appropriately normalized scaling, and the set T is bounded.
In that case the norm of I]JRI1-1 is equal to the measure of T. The following sections in
this paper will illustrate the power of these results by applying them to the decomposition
of rotating rotating and scaled kernels.
A useful subclass of kernels F for which the finite orthonormal approximation can
be in part explicitly computed is obtained by composing a template function with trans-
formations To belonging to a compact group. This situation arises in the case of n-
dimensional rotations and is useful for edge detection in tomographic data and spa-
tiotemporal filtering. It is discussed in [32, 33, 34].
3 Rotation
To make the paper self-contained the formula for generating a steerable approximation
is recalled here. The F[on] which is the best n-dimensional approximation of Fo is defined
as follows:
Definition. Call F~n] the n-terms sum:
iO) (3)
4=1
with ~i, al and bi defined in the following way: let h(v) be the (discrete) Fourier transform
of the function h(O) defined by:
N mmmlmm_mmR
(gaus-3) (sfnc.0) (sfnc.1) (sfnc.2) (sfnc.3) (sfnc.4) (sfnc.5) (sfnc.6) (sfnc.7) (sf.nc.8)
Fig. 3. The decomposition (ai, b~, ai) of a complex kernel used for brightness-edge detection [36].
(Left) The template function (gans-3) is shown rotated counterclockwise by 120~ Its real part
(above) is the second derivative along the vertical (Y) axis of a Gussian with a= : ay ratio of
1:3. The imaginary part (below) is the Itilbert transform of the real part along the Y axis. The
singular values a~ (not shown here - see [34]) decay exponentially: a~+~ ~ 0.75ai. (Right) The
functions a~ (sfnc.i) are shown for i = 0... 8. The real part is above; the imaginary part below.
The functions b~(O) are complex exponentials (see text) with associated frequencies t,i = i.
t
h(O) = ]m2 r,(x)Fs,=o(x)dx (4)
and let t~i be the frequencies on which ]t() is defined, ordered in such a way that it(vl) >
h(t,j) if i _< j. Call g _< oo the number of nonzero terms h(z~). Finally, define the
quantities:
A number of filter-based early vision and signal processing algorithms analyze the image
at multiple scales of resolution. Although most of the algorithms are defined on, and
would take advantage of, the availability of a continuum of scales only a discrete and
small set of scales is usually employed due to the computational costs involved with
filtering and storing images. The problem of multi-scale filtering is somewhat analogue
to the multi-orientation filtering problem: given a template function F ( x ) and defined
Fr as Fa(x) --- r ~ E (0,oo) one would like to be able to write F~ as a
(small) linear combination:
One has therefore to renounce to the idea of generating a continuum of scales spanning
the whole positive line. This is not a great loss: the range of scales of interest is never the
entire real line. An interval of scales (~1,a2), with 0 < ~rl < a2 < cr is a very realistic
scenario; if one takes the human visual system as an example, the range of frequencies
to which it is most sensitive goes from approximatly 2 to 16 cycles per degree of visual
angle i.e. a range of 3 octaves. In this case the interval of scales is compact and one can
apply the results of section 2 and calculate the SVD and therefore an L2-optimal finite
approximation.
In this section the optimal scheme for doing so is proposed. The problem of simul-
taneously steering and scaling a given kernel F(x) generating a family F($,0)(x) wich
has a finite approximation will be tackled. Previous non-optimal schemes are due to
Perona [32, 33] and Simoncelli et al. [9, 12].
4.1 P o l a r - s e p a r a b l e d e c o m p o s i t i o n
Observe first that the functions ai defined in eq.(7) are polar-separable. In fact x may
be written in polar coordinates as x = ]lxllR~(x)U where u is some fixed unit vector (e.g.
the 1st coordinate axis versor) and r is the angle between x and u and R~(x) is a
rotation by r Substituting the definition of F0 in (7) we get:
= a:~*e-J2"~,r [ F(llxllRc(u))eJ2"~'r162
J$~
4.2 Scaling is a 1D p r o b l e m
The scaling operation only affects the radial components c~ and does not affect the
angular components. The problem of scaling the kernels al, and therefore Fs through its
decomposition, is then the problem of finding a finite (approximate) decomposition of
continuously scaled versions of functions c(p):
ii!iiiiiiiiiiiiiii!iiiiiii Y 10 .3
Gausslan 3:1 -- singular functions
iiiiiiiii~i~i~iiiiiiiii x
i i
iiiiiiiiiiiiiii!iii!i!iii! o.oo-
polav-~.
~1~:~'.~
1
.....
~ - ~ : $ ....
sfnc 0 ~o'~C~-4- - -
-5.00 --
.:.:.:. ~r~ ~
-- "---s . - -
-10.00 -- * po~- ~ "I- --
pol~r-s~.s
-15.00 --
-25.00 --
:::::::::::::::::::::::
...........,............
-:30.00 "
X
0,00 10.00 20 00
Fig.4. (Right)The plots of ci(p), the radial part of the singular functions a~ (err. eq. 9). The
0 part is always a complex exponential. The original kernel is the same as in fig. 3. (Left) The
0th, 4th and 8th components co, c4 and cs represented in two dimensions.
i i i
c~(p) = E 7tsk(~ (12)
k
As discussed before (Theorem 4) one can calculate the approximation error from the
sequence of the singular values 7~. Finally, substituting (12) into (10) the scale-orientation
expansion takes the form (see Fig. 6):
N ni
Filtering an image I with a deformable kernel built this way proceeds as follows:
first the image is filtered with kernels a~(x) = exp(-j2ru~r i = 0,..., N,
k = 0 , . . . , nl, the outputs I~ of this operation can be combined as
Ie:(x) " ~ E -l o, b,(O) Ek=l "' ' ' '
7kSk(O)I~(x) to yeld the result.
"
Gims~lm
,..o,-
, - ~,~..\
., _ - . ~ .
5-
3:1 - s.f. scale d e c o m p o s i t i o n
~lt~l~.
-- w e i g h t ~
-~-r-
_;,;~,i~--
- ; = ~ , h ~ '1" -
mm
(cos(2~.@)so'(p)) (cos(2~,,.0)~O))
i "N
le-04 -
I
0.00
decomposition weights 7~
I
~.00
:
I 0 O0
X
(cos(2,~,0)s~(p)) (cos(2~,0)s~'(p))
Gatmlun 3:1 -~ s i n g u l a r tuition nA - radius Gau~ian 3:1 - singular tuncUon nA ~ ~ale
Y x 10 . 3 Y x I~3
I I ~ale 4-0
300.~ - 'aldllul 4 !
~ 4 - 2 I00.~- / "f'... / -i~',~-"~....
2~,~ - ~ ~-,f_~" --'
200.00-
I~@.00 -
IOOJDO-
50.00 -
~'|
-lfl@.00 -
!i/. ,'~ :
-150.00 -
-900.00 -
-urn.00 - !!
3
-300~00 - ~ -2~.~ - ~
-3S0.00 - I I I- X i i x
0.00 5.00 10.00 O.SO 1.00
Fig. 5. Scale-decomposition of the radial component of the functions a~. The interval of scales
a is a E (0.125, 1.00). See also Fig. 6. (Top-left) The weights 7~ of each polar functions' decom-
position (i = 0 , . . . , 8 , k along the x axis). The decay of the weights is exponential in k; 5 to 8
components are needed to achieve 1% error (e.g 5 for the 0th, 7 for the 4th and 8 for the 8th
shown in fig 4). (Bottom) The first four radial (left) and scale (right) components of the 5th
singular function: r~(p) and s~(a), k = 0 . . . . . 3 (see Eq. (12)). (Top-right) The real parts of the
first four scale-components of the 5th singular function as: cos(2~rv40)s~(p) with k = 0 . . . . ,3
(see Eq. (13)).
Fig. 6. The kernel at different scales and orientations: the scales are (left to right) 0.125, 0.33,
0.77, 1.00. The orientations are (left to right) 30~ 66~ 122~ 155~ The kernels shown here were
obtained from the scale-angle decomposition shown in the previous figures.
Fig. 7. Three complex-valued kernels used in edge and junction analysis (the real parts are
shown above and imaginary parts below). The first one (2-sided) is 'tuned' to edges that are
combinations of steps and lines (see [36]) - it is the same as in Fig. 3 top left, shown at an
orientation of 0~ the second kernel one (endstopped) is tuned to edge endings and 'crisscross'
junctions [13, 16, 39]: it is equivalent to a 1st derivative of the 2-sided kernel along its axis
direction; the third one (1-sided) may be used to analyze arbitrary junctions. All three kernels
may be obtained at any orientation by combining suitably the 'basis' kernels ai shown in Fig. 3.
center in the origin. The circular path begins and ends at the positive side of the X axis.
Consider now such a path for the 2-sided kernel of Fig. 7: observe that for every p we
have at least two symmetries.
For the real part:
24 24
A - 2-slded B - 1-sided
25 25
26 26
27 27
28 28 . . . . .
29 . . . . . . 9 9 . . . . . . . .
30 . . . . ~ ~ -,, % "~ ~. ~ . . . . .
31 " . \ \ %
34 \
35 - \
36 ",
37
38
39
24 25 26 27 28 '79 30 31 32 33 34 35 36 37 38 39
24 25 26 27 28 29 31 32 33 34 35 36 37 38 39
24
25
C - endstop 2~ D - e n d s t o p along 1-sided m a x .
25
26 26
27 27
28 28
29 29
30 - , - 9 30 . . . .
31 . . . . . ~ ~ 4 . . . .
32 - 9 "~-~-"v--v- r ,
33 ~ --,-- ".- r
34 ~ ~ ~
33 - * ~ 9 - - 35 - -
36
37
38
39
2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4 3 5 ~ 3 7 3 8 3 9 24 25 26 27 28 29 30 31 32 33 34 35 36 3"/ 38 39
Fig. 8. Demonstration of the use of the kernels shown in Fig. 7 for the analysis of orientation
and position of edges and junctions. For each pixel in s 16z16 neighbourhood of the T-junction
in Fig. 7 (right) the local maxima in orientation of the modulus of corresponding filter responses
are shown. A - 2-s|ded: (equivalent to Fig. 2 top-right) Within a distance of approximately
2 - 2.5~v from an isolated edge this kernel gives an accurate estimate of edge orientation. Near
the junction there is a distortion in the estimate of orientation; notice that the needles indicating
the orientation of the horizontal edge bend clockwise by approximately 15~ within a distance of
approx. 1.5ux from the junction. The periodicity of the maxima is 180~ making it difficult to
take a local decision about the identity of the junction (L, T, X). B - 1-slded: Notice the good
estimate of orientation near the junction; from the disposition of the local maxima it is possible
to identify the junction as a T-junction. The estimate of edge orientation near an isolated edge is
worse than with the 2-sided kernel since the 1-sided kernel has a 360~ symmetry. C - e n d s t o p :
The response along an 'isolated' edge (far from the junction) is null along the orientation of
the edge, while the response in the region of the junction has maxima along the directions of
the intervening edges. D - e n d s t o p along 1-sided m a x i m a : Response of the endstop kernel
along the orientations of maximal response of the 1-sided kernel. Notice that there is significant
response only in the region of the junction. The junction may be localized at the position with
maximal total endstop response.
16
n b b
F(t-2s)a = = = + 04)
i=1 u=-b u=0
where the indexing is now by frequency: a~ and a~ denote the ai and ai associated to
the frequency y = vl, and av=0 = 89 i =arg(ui = 0).
Consider now the endstopped kernel (Fig. 7, middle): the same symmetries are found
in a different combination: the real part has symmetries (E) and ( / / - ) while the imaginary
part has symmeries (O) and (/7+). A kernel of this form may be clearly obtained from
the coefficients of the 2-sided kernel exchanging the basis finctions: sinusoids for the even
frequencies and cosinusoids for the odd frequencies (equivalent to taking the Hilbert
transform of the 2-sided kernel along the circular concentric paths):
b
= + (15)
b'----0
The endstopped kernel shown in Fig. 7 has been obtained following this procedure from
the decomposition (ai, al, bi) of the 2-sided kernel in the same figure.
A kernel of the form 1-sided can now be obtained by summing the 2-sided and end-
stopped kernels previously constructed. It is the one shown in Fig. 7, right side. The
corresponding reconstruction equation is:
b
(ls)ev'J = (16)
Y~0
6 Conclusions
A technique has been presented for implementing families of deformable kernels for early
vision applications. A given family of kernels obtained by deforming continuously a tem-
plate kernel is approximated by interpolating a finite discrete set of kernels. The technique
may be applied if and only if the family of kernels involved satisfy a compactness con-
dition. This improves upon previous work by Freeman and Adelson on steerable filters
and Perona and Simoncelli et al. on scalable filters in that (a) it is formulated with max-
imum generality to the case of any compact deformation, or, equivalently any compact
family of kernels, and (b) it provides a design technique which is guaranteed to find the
most parsimonious discrete approximation. It has also been shown how to build edge-
terminator- and junction-tuned kernels out of a same family of 'basis' function.
Unlike common techniques used in early vision where the set of orientations is dis-
cretized, here the kernel and the response of the corresponding filter may be computed
in a continuum for any value of the deformation parameters, with no anisotropies. The
approximation error is computable a priori and it is constant with respect to the defor-
mation parameter. This allows one, for example, to recover edges with great spatial and
angular accuracy.
7 Acknowledgements
I have had useful conversations concerning this work with Ted Adelson, Stefano Casadei, Charles
Desoer, David Donoho, Peter Falb, Bill Freeman, Fedefico Gizosi, Takis Konst~ntopoulos, Paul
17
Kube, Olaf Kfibler, Jitendra Malik, Stephane Mallat, Sanjoy Mitter, Richard Murray, Mas-
simo Porrati. Federico Girosi and Peter Falb helped with references to the functional analysis
textbooks. The simulations have been carried out using Paul Kube's "viz" image-manipulation
package. The images have been printed with software provided by Eero Simoncelli. Some of the
simulations have been run on a workstation generously made available by prof. Canali of the
Universit/t di Padova. Part of this work was conducted while at the M.I.T.. I am very grateful
to Sanjoy Mitter and the staff of LIDS for their warm year-long hospitality.
References
1. ADELSON, E., AND BERGEN, J. Spatiotemporal energy models for the perception of motion.
J. Opt. Soc. Am. ~, 2 (1985), 284-299.
2. ADELSON, E., AND BERGEN, J. Computational models of visual processing. M. Landy and
J. Movshon eds. MIT press, 1991, ch. "The plenoptic function and the elements of early
vision". Also appeared as MIT-MediaLab-TR148. September 1990.
3. BINFORD, T. Inferring surfaces from images. Artificial Intelligence 17 (1981), 205-244.
4. BOVIK, A., CLARK, M., AND GEISLER, W. Multichannel texture analysis using localized
spatial filters. IEEE trans. Pattern Anal. Mach. Intell. 1P, l (1990), 55-73.
5. BURT, P., AND ADELSON, E. The laplacian algorithm as a compact image code. IEEE
Transactions on Communications 31 (1983), 532-540.
6. CANNY, J. A computational approach to edge detection. IEEE trans. Pattern Anal. Mach.
Inteli. 8 (1986), 679-698.
7. CHOQUET, G. Lectures on analysis, vol. I. W. A. Benjamin Inc., New York, 1969.
8. DIEUDONNE, J. Foundations of modern analysis. Academic Press, New York, 1969.
9. E. SIMONCELLI, W. FREEMAN, E. A., AND HEEGER, D. Shiftable multi-scale transforms.
Tech. Rep. 161, MIT-Media Lab, 1991.
10. FOGEL, I., AND SAGI, D. Cabot filters as texture discriminators. Biol. Cybern. 61 (1989),
103-113.
11. FREEMAN, W., AND ADELSON, E. Steerable filters for early vision, image analysis and
wavelet decomposition. In Third International Conference on Computer Vision (1990),
IEEE Computer Society, pp. 406-415.
12. FREEMAN, W., AND ADELSON, E. The design and use of steerable filters for image analy-
sis, enhancement and multi-scale representation. IEEE trans. Pattern Anal. Mach. Intell.
(1991).
13. FREEMAN,W., AND ADELSON, E. Junction detection and classification. Invest. Ophtalmol.
Vis. Sci. (Supplement) 3~, 4 (1991), 1279.
14. GRANLVND, G. H. In seaxch of a general picture processing operator. Computer Graphics
and Image Processing 8 (1978), 155-173.
15. HEEGER, D. Optical flow from spatiotemporal filters. In Proceedings o] the First Interna-
tional Conference on Computer Vision (1987), pp. 181-190.
16. HEITGER, F., ROSENTHALER, L., VON DER HEYDT, P~., FETERHAN$, E., AND KUBLER, O.
Simulation of neural contour mechanismn: From single to end-stopped cells. Tech. Rep.
126, IKT/Image science lab ETH-Zuerich, 1991.
17. HORN, B. The binford-horn linefinder. Tech. rep., MIT AI Lab. Memo 285, 1971.
18. HUEEL, D., AND Vr T. Receptive fields of single neurones in the cat's striate cortex.
J. Physiol. (Land.) I48 (1959), 574-591.
19. HUBEL, D., AND ~rlESEL, T. Receptive fields, binocular interaction and functional archi-
tecture in the cat's visual cortex. J. Physiol. (Lond.) 160 (1962), 106-154.
20. JONES, D., AND MALIK, J. Computational stereopsis-beyond zero-crossings. Invest. Oph.
taimol. Via. Sci. (Supplement) 31, 4 (1990), 529.
21. JONES, D., AND MALIK, J. Using orientation and spatial frequency disparities to recover
3d surface shape - a computational model. Invest. Ophtalmol. Vis. Sci. (Supplement) 3~,
4 (1991), 710.
18
This article was processed using the IbTEX macro package with ECCV92 style
Families of Tuned Scale-Space Kernels *
L.M.J. Florack 1, B.M. ter Haar Romeny 1, J.J. Koenderink 2, M.A. Viergever 1
1 3D Computer Vision Research Group, University Hospital, Room E.02.222,
Heidelberglaan 100, 3584 CX Utrecht, The Netherlands
2 Dept. of Medical and Physiological Physics, University of Utrecht,
Princetonplein 5, 3584 CC Utrecht, The Netherlands
Abstract.
We propose a formalism for deriving parametrised ensembles of local
neighbourhood operators on the basis of a complete family of scale-space
kernels, which are apt for the measurement of a specific physical observable.
The parameters are introduced in order to associate a continuum of a priori
equivalent kernels with each scale-space kernel, each of which is tuned to a
particular parameter value.
Ensemble averages, or other functional operations in parameter space,
may provide robust information about the physical observable of interest.
The approach gives a possible handle on incorporating multi-valuedness
(transparancy) and visual coherence into a single model.
We consider the case of velocity tuning to illustrate the method. The
emphasis, however, is on the formalism, which is more generally applicable.
1 Introduction
The problem of finding a robust operational scheme for determining an image's differ-
entiM structure is intimately related to the concept of resolution or scale. The concept
of resolution has been given a well-defined meaning by the introduction of a scale-space.
This is a 1-parameter family of images, derived from a given image by convolution with
a gaussian kernel, which defines a spatial aperture for measurements carried out on the
image and thus sets the "inner scale" (i.e. inverse resolution).
The gaussian emerges as the unique smooth solution from the requirement of absence
of spurious detail, as well as some additional constraints [1, 2, 3]. Alternatively, it is
uniquely fixed by the requirement of linearity and a set of basic symmetry assumptions,
i.c. translation, rotation and scale invariance [4, 5]. These symmetries express the absence
of a priori knowledge concerning the spatial location, orientation and scale of image
features that might be of interest.
Although several fundamental problems are yet to be solved, the crucial role of res-
olution in any front end vision system cannot be ignored. Indeed, scale-space theory is
gaining more and more appreciation in computer vision and image analysis. Neurophys-
iological evidence obtained from mammalian striate cortex also bears witness of its vital
importance [6]. There is also psychophysical support for the gaussian model [7].
Once the role of scale in a physical observable has been appreciated and a smooth
scale-space kernel has been established, the problem of finding derivatives that depend
* This work was performed as part of the 3D Computer Vision Research Program, sup-
ported by the Dutch Ministry of Economic Affairs through a SPIN grant, and by
the companies Agfa-Gevaert, Philips Medical Systems and KEMA. We thank J. Biota,
M. van Eert, R. van Maarseveen and A. Salden for their stimulating discussions and software
implementation.
20
continuously on the image (i.e. are well-posed in the sense of Hadamard), has a trivial
solution [4, 5, 8, 9]: just note that ifD is any linear differential operator, f is a given image
and gr is the scale-space kernel on a sc~e ~ (fairly within the available scale-range), then
the convolution f . Dgr precisely yields the derivative of f on scale a, i.e. D ( f * ga). The
1-parameter family containing the scaled gaussian and its linear derivatives constitutes
a complete family of scaled differential operators or local neighbourhood operators [10].
Despite its completeness, however, the gaussian family is not always the most conve-
nient one. For example, local optic flow in a time varying image can be obtained directly
from the output of a gaussian family of space-time filters, at least in principle [11], but it
may be more convenient to first tune these filters to the physical parameter of interest,
i.c. a velocity vector field. This way, the filters have a more direct relation to the quantity
one wishes to extract.
To illustrate the formalism we will present an example of filter tuning, i.c. velocity
tuning [12, 13, 14, 15]. The emphasis, however, is on the formalism based on (Lie-group)
symmetries, expressing the a priori equivalence of parameter values, i.e. velocities. The
formalism is readily applicable to a more general class of physical tuning parameters, e.g.
frequency, stereo disparity, etc.
2 Filter Tuning
We have the following relationship between the kernels/'m...~," (X), given in dimen-
sionless, scale-invariant coordinates, and the scMe-parametrised kernels Gm...~,,(x, t; o', r):
~%//~T2ra ~V~ff2n-mel~,. .IJ. (x, t; o', "f) dxdt deafF#I. .Ij. (X) dX__ (2)
in which m is the number of zero-valued indices among Pl .../~n. Although the temporal
part of (1) obviously violates temporal causality, it can be considered as a limiting case
of a causal family in some precise sense [17]. Sofar for the basic, non-tuned gaussian
spaeetime family.
21
The tuning parameter of interest will be a spacetime vector ~_= (~0; ~). Apart from
this, the following variables are relevant: the scale parameters ~ and r, the frequency
variables ~ and o~0 for addressing the fourier domain, and the variables s and s rep-
resenting the scaled and original input image values in fourier space. According to the
Pi theorem, we may replace these by: A de_f ~//'~'0, ~ = (~'~0; ~e~) de_f (~0 ~ - ' ~ ; r ~ " ~ ) '
-~ = (~0; ~') de__f(~0/2X/~;~/2Vt~-~). Moreover we will use the conjugate, dimension-
less variables X = (X0; X) d~__f( t / ~ ; x/2X/~). Their dependency is expressed by
A = g (/2, ~), in which g is some unknown, scalar function.
In the ~, -~ 0 limit, the ~'--tuned kernels should converge to (1). Reversely, by applying
a spacetime tuning operation to this underlying family, we may obtain a complete family
of spacetime-tuned kernels and, more specifically, of velocity-tuned kernels:
(3)
The construction of velocity-tuned kernels from the basic gaussian family is a special
case of spacetime tuning, viz. one in which the tuning point is the result of a galilean
boost applied to the origin of a cartesian coordinate frame. This is a transvection in
spacetime, i.e. a linear, hypervolume preserving, unit eigenvalue transformation of the
following type:
T~ transforms static stimuli (straight curves of "events" parallel to the T-axis) into
dynamic stimuli, moving with a constant velocity "/. Note that TO is the identity and
T~-1 = T_~ is the inverse boost. Since a galilean boost is just a special type of spacetime
tuning, we immediately arrive at the following result:
Example 1. Consider a point stimulus L0(x, t) = AS(x - et), moving at fixed velocity e.
According to (4), the lowest order velocity-tuned filter is given in by:
G(x,t; a, r, v) =
1 1 { (x-vt).(x-vt) t2 ~
(5)
~ 2D 2V/~~ exp
2V/~"~ 2cr2 2r2
def
in which the velocity v is related to the parameter vector "7 in (4) by v = ~ / r .
Convolving the above input with this kernel yields the following result:
In this paper we have shown how the basic family of local scale-space operators may
give rise to a gamut of other families, each of which is characterised, apart from scale,
by some physical tuning parameter. We have presented a formalism for generating such
families from the underlying gaussian scale-space family in a way that makes the a priori
equivalence of all tuning parameter values manifest. We have illustrated the formalism
23
References
i Central Research Laboratories, Matsushita Electric Ind., Moriguchi, Osaka 570, Japan
2 Mech. Eng. for Computer-Controlled Machinery, Osaka University, Suita, Osaka 565, Japan
1 Introduction
The objective of this work is to extract an object contour from a given initial rough
estimate. Contour extraction is a basic operation in computer vision, and has many
applications such as cut/paste image processing of authoring systems, medical image
processing, aerial image analyses and so on.
There have been several works which takes an energy minimization approach to the
contour extraction(e.g.[i][2]). An active contour model (ACM) proposed by Kass [3] has
demonstrated an interactivity with a higher visual process for shape corrections. It results
in smooth and closed contour through energy minimization. The ACM, however, has the
following problems:
the discontinuities are just set at the high curvature points[4]. Generally, however,
it is difficult to interpret the high curvature points as corners or occluding points
automatically without knowledge about the object.
Taking into account the interactivity for correction, we adopt an ACM for the object
contour extraction. In light of the above problems, we will focus not on the ACM itself but
on an underlying image structure which guides an active contour precisely to the object
boundary against the obstacles. In this paper we propose a miztnre density description
as the underlying image structure. Mizture density has been noted to give mathematical
basis for clustering in terms of distribution characteristics[8][9][10]. It refers to a prob-
ability density for patterns which are drawn from a population composed of different
classes. We introduce this notion to a low level image representation. The features of our
approach are:
First, we present a basic idea of the mixture density description, its definition with
assumptions and a region clustering. Second, we describe the active contour model based
on the mixture density descriptions. The experimental results are presented thereafter.
2.1 B a s i c I d e a
Our ACM seeks for a boundary between an object and its background regions according
to their mixture density descriptions. The mixture density descriptions describe the po-
sitions and measurements of the sub-regions in the both sides of the object boundary. In
our approach, the mixture density description can be obtained from a region clustering
in which pixel positions are considered to be a part of features. Owing to the combination
of the positions and the measurements, the both side regions can be decomposed into
locally distributed sub-regions. Similarly, Izumi et a1.[11] and Crisman et a1.[12] decom-
posed a region into sub-regions and integrated the sub-regions' boundaries into an object
boundary. We do not take such boundary straightforwardly because they may be not
precise and jagged for our purpose.
For a pixel to be examined, by selecting a significant sub-region, which is nearest to the
pixel with respect to the position and the measurement, we can evaluate position-sensitive
likelihoods of inside and outside regions. Fig. 1 illustrates an example of mixture density
descriptions. In Fig.l, suppose that region descriptions were not for decomposed sub-
regions. The boundary between inside "black" and outside "blue" might not be correctly
obtained, because the both side regions include blue components and the likelihoods
of the both side regions would not indicate significant difference. On the other hand,
26
active contour
rnide-level
dlscontinult
object region backgroundreg[on
(inside region) (out sloe region)
using the mixture density description, the likelihoods of the both regions can indicate
significant difference knowing the sub-regions' positions. Moreover, the false edges can
be canceled by the equal likelihoods in the both sides of the false edges.
2.2 D e f i n i t i o n s a n d A s s u m p t i o n s
We introduce the probability density function for mixture densities [8]. In the mixture
densities, it is known that the patterns are drawn from a population of c classes. The
underlying probability density function for class ~i is given as a conditional density
p(xlwl, Oi) where Oi is a vector of unknown parameters. If a priori probability of class wi
is known as p(wl), ~'~i=1
c P (0~i ) = 1, a density function for the mixture densities can be
written as:
c
p(xl0) = (1)
i=1
where 0 = (01,02, ...,0c) 9 The conditional densities p(x[to/, 0i) are called component
densities, the priori probabilities p(w/) are called mixing parameters.
We assume that a region of interest R consists of c classes such that R = {wl, o;2, ..., a~ }.
In order to evaluate the likelihood that a pixel belongs to the region R, we take a model
that overlapping of the component densities are negligible and the most significant (high-
est) component density can dominate the pixel's mixture densities. Thus, the mixing pa-
rameters of the region R should be conditional with respect to position p(e.g, row,column
values) and measurements x(e.g. RGB values) of the pixel.
The mixing parameters are given by:
Thus, we can rewrite the mixture density function of the region R for (x, p) as:
c
For convenience of notation, we introduce the notation y to represent the joint vector of
x and p: y = ( p ) . We assume that the component densities take general multivariate
normal densities. According to this assumption, the component densities can be written
as: 1
1
exp[-~-X2(y; u4,27)] (5)
p(ylwi, Oi) = (27r)d/21S411/2 z
and
X2(y;u/, 574) = (y - ui)t L'/"I(y - u4) , (6)
where 0i = (ui,,U4), d, ui and 27i are the dimension of y, the mean vector and the
covariance matrix of y belonging to the class w4, (.)t denotes the transpose of a matrix,
and X2(.) is called a Mahalanobis distance function. For the multivariate normal densities,
the mixture density description is a set of means and covariance matrices for c classes:
M i x t u r e D e n s i t y D e s c r i p t i o n : 0 = ((ul, 271), (u2,272), ...., (ue, 27e)) 9 (7)
The log-likelihood function for the multivariate normal densities is given by:
2.3 R e g i o n C l u s t e r i n g
A mixture density description defined by (7) is obtained from decomposed sub-regions.
If the number of classes and parameters are all unknown, as noted in [8], there is no
analytical singular decomposition scheme. Our algorithm is similar to the Supervising
ISODATA 3. of Carman et al.[14] in which Akaike's information criterion(AIC)[15] is
adopted to reconcile the description error with the description compactness, and to eval-
uate goodness of the decomposition. The difference is that we assume the general multi-
variate normal distribution for the component densities and use a distance metric based
on (5), while Carman et. al. assume a multivariate normal distribution with diagonal
covariance matrices and used Euclidian distance metric.
By eliminating the constant terms from the negated logarithm of (5), our distance
metric between sample y and class wl is given by:
d(y; u~, El) = In [2~1 + X2(y; ui, 27i) 9 (9)
For general multivariate normal distributions (with no constraints on covariance ma-
trices), AIC is defined as:
c
where n~ and ]]0 H represent the number of samples y in the ith class and the degree
of free parameters to be estimated, respectively (note that E(X~(.)) = d). In (10), the
first term is the description error while the second term is a compactness measure which
increases with the number of classes.
The algorithm starts with a small number of clusters, and iterates split-and-merge
processes until the AIC has minima or other stopping rules, such as the number of
iterations, the number of class samples and intra-class distributions, are satisfied.
log-likelihood ~ log-likelihood
of the inside Sd~foroe of the outside
Rerglo.form MLE
>
inside~ positionon the normal outside
to the contour
3.1 E n e r g y M i n i m i z a t i o n Process
In (11), we assume that N control points vl on a active contour are spaced at equal
intervals. The energy minimization process proceeds by iterating the DP for the N con-
trol points until the number of position changes of the control points converges to a small
number or the number of iteration exceeds a predefined limit. At each control point vi,
we define two coordinates ti and ni, which are an approximated tangent and an approx-
imated normal to the contour, respectively. In the current implementation, each control
point vi is allowed to move to itself and its two neighbors as {vi, vl +~ni, v i - ~ n i } at each
iteration of the DP. For convenience of notation, we express them as {middle, outward,
inward }. Estretch(') and E~na(') in (11) are the first and the second order continuity
constraints, and they are called internal energy. In the ACM notation, Er~gion(') and
E~aa,(') in (11) are called external forces(image forces).
In the following subsections, we will describe the external forces.
3.2 E x t e r n a l F o r c e s
Two external forces of our ACM are briefly illustrated in Fig. 3. A region force has the
capability to globally guide an active contour to a true boundary. After guided nearly to
the true boundary, an edge force has the capability to locally enforce an active contour
to outline the precise boundary.
'!?:iiiii!ii,.troi
.............................
~::iii::i~::~......
:~ii!i!ili!i!i!i::i::i::i::::~::. :..
obl ~i::i)i::i::i::i::i::iii::::~?~:... n ~ t o r t tn~
reg
Point '~ii::iiiiiliiii~:
0
out In
I(ylR ,g) < I(ylR ,U) in equi um zone
at control point at control point
region force
region force region force + edge force
(squash) (protrude)
and
po,,t(vi) = - p l n ( v / ) = I(y[R~ 0~ - l(yl Rin, oin)v, , (13)
where l(.)v denotes the log-likelihood taking feature vectors at or nearly at the position
V.
In order to introduce the edge force, two auxiliary points v + and v~- are provided for
the control point v{, where v + = v{ + T/n/,v~- = vi - r/n{. The inside parameter Ovi is
selected from/9 in so that Ovi is the parameter of the highest component density at v~-.
30
The outside parameter O+i is selected at v + from 0 ~ in the same way. The edge force
is given by:
4 Experiments
Throughout the experiments, the conditions are 1.input image size:512 480 pixel RGB
, 2. feature vectors: five dimensional vectors (r,g,b,row,column).
Fig. 4 shows an input image with an initial contour drawn by a mouse. The initial
contour is used as a "band" which specifies the inside and outside regions. According to
the band description, the region clustering result is modified by splitting classes crossing
the band. Given an initial contour in Fig.4, we have obtained 24 classes for the inside and
61 classes for the outside through the region clustering. Mixture density descriptions are
obtained from the inside and outside classes. Using these mixture density descriptions,
the ACM performs the energy minimizing to outline the object boundary. Fig. 5 shows
that the discontinuous parts are precisely outlined with the discontinuity control.
Fig. 6 (a) shows a typical example of a trapped edge-based active contour. In contrast
with (a), (b) shows a fair result against the stronger background edges.
We can apply our ACM to an object tracking by iterating the following steps:l) extract
a contour using the ACM, 2) refine mixture density descriptions using the extracted
contour, 3) apply the region clustering to a successive picture taking the mixture density
description into the initial class data. In Fig.6 (c), the active contour tracked the boundary
according to the descriptions newly obtained from the previous result of (b).
5 Conclusion
We have proposed the precise contour extraction scheme using the mixture density de-
scriptions. In this scheme, region- and edge-based contour extraction processes are in-
tegrated. The ACM is guided against the complex backgrounds by the region force and
edge force based on the log-likelihood functions. Owing to the statistical measurement,
our model is robust for parameter setting. Throughout the experiments including other
pictures, the smoothing parameter has not been changed. In addition, the mixture den-
sity descriptions have enabled to represent the C 1 discontinuity. Its efficiency is also
demonstrated in the experiment.
31
Regarding the assumptions for the mixture density description, we have assumed
that the component densities take general multivariate normal densities. To be exact,
the position vector p is not in accordance with normal distribution. So far, however, the
assumption has not exhibited any crucial problems.
Further work is needed in getting initial contours in more general manner. Issues
for feature research include the initial contour estimation and extending our scheme to
describe a picture sequence.
References
1. Blake, A., Zisserman, A.: Visual Reconstructiort The MIT Press (1987)
2. Mumford, D., Shah, J.: Boundary Detection by Minimizing Functionals. Proc. CVPR'85
(1985) 22-26
3. Kass, M., Witikin, A., Terzoponlos, D.: SNAKES: Active Contour Models. Proc. 1st ICCV
(1987)259-268
4. Menet, S., Saint-Marc, P., Mendioni, G.: B-snakes: Implementation and Application to
Stereo. Proc. DARPA Image Understanding Workshop '90 (1990) 720-726
5. Cohen, L., Cohen, I.: A Finite Element Method Applied to New Active Contour Models
and 3D Reconstruction from Cross Sections. Proc. 3rd ICCV (1990) 587-591
6. Berger, M., Mohr, R.: Towards Autonomy in Active Contour Models. Proc. 11th ICPR
(1990) 847-851
7. Dennis, J., Schnabel, R.: Numerical Methods .for Unconstrained Optimization and Linear
Equations. Prentice-Hall (1988)
8. Dnda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley and Sons (1973)
9. Scoive, S.: Application of the Conditional Population-Mixture Model to Image Segmenta-
tion. IEEE Trans. on Putt. Anal. & Ma~h. Intell. 5 (1983) 429-433
10. Yarman-Vnral, F.: Noise,Histogram and Cluster Validity for Gaussian-Mixtured Data. Pat-
tern Recognition 20 (1987) 385-501
11. Izumi, N., Morikawa, H., Harashima, H.: Combining Color and Spatial Information for
Segmentation. IEICE 1991 Spring Nat. Convention Record, Part 7 (1991) 392 (in Japanese)
12. Crisman, J., Thorpe, C.: UNSCARF, A Color Vision System for the Detection of Unstruc-
tured Roads. Proc. Proc. Int. Conf. on Robotics & Auto. (1991) 2496-2501
13. Join, A., Dubes, R.: Algorithms .for Clustering Data. Prentice-Hall (1988)
14. Carman, C., Merickel,M: Supervising ISODATA with an Information Theoretic Stopping
Rule. Pattern Recognition 23 (1990) 185-197
15. Akaike, H.: A New Look at Statistical Model Identification. IEEE Trans. on Automat.
Contr. 19 (1974) 716-723
16. Etoh, M., Shiral, Y., Asada, M.: Active Contour Extraction by Mixture Density Description
Obtained from Region Clustering, SIGPRU Tech. Rep. 91-81. IEICE of Japan (1991)
17. Amini, A., Weymouth, T., Join, R.: Using Dynamic Programming for Solving Variational
Problems in Vision. IEEE Trans. on Putt. Anal. & Ma~h. InteU. 12 (1990) 855-867
This article was processed using the I~EX macro package with ECCV92 style
32
(a) without the discontinuity control (b) with the discontinuity control
1 Introduction
The reason for using a parameter mapping is often to convert a difficult global detection
problem in image space into a local one. Spatially extended patterns are transformed so
that they produce spatially compact features in a space of parameter values. In the case
of line segmentation the idea is to transform the original image into a new domain so that
colinear subsets, i.e. global lines, fall into clusters. The topology of the mapping must
reflect closeness between wanted features, in this case features describing properties of a
line. The metric describing closeness should also be uniform throughout the space with
respect to the features. If the metric and topology do not meet these requirements, sig-
nificant bias and ambiguities will be introduced into any subsequent classification process.
2 Parameter Mappings
In this section some problems with standard map-
pings for line segmentation will be illuminated.
The Hough transform, HT, was introduced by P. V.
C. Hough in 1962 as a method for detecting complex
patterns [Hough, 1962]. It has found considerable ap-
plication due to its robustness when using noisy or in-
complete data. A comprehensive review of the Hough
transform covering the years 1962-1988 can be found
in [Illingworth and Kittler, 1988].
Severe problems with standard Hough parameteriza-
tion are that the space is unbounded and will contain
singularities for large slopes. The difficulties of un- \ "-- X
* This work has been supported by the Swedish National Board for Techn. Development, STU.
34
vector to the line from the origin, the n o r m a l p a r a m e t e r i z a t i o n , see figure 1. This is
a mapping has the advantage of having no singularities.
Measuring local orientation provides additional information about the slope of the
line, or the angle V when using the normal parameterization. This reduces the standard
HT to a one-to-one mapping. With one-to-one we do not mean that the mapping is
invertible, but that there is only one point in the parameter space that defines the
parameter s that could have produced it.
Duda and Hart discussed this briefly in [Duda and Hart, 1973]. They suggested that
this mapping could be useful when fitting lines to a collection of short line segments.
Dudani and Luk [Dudani and Luk, 1978] use this technique for grouping measured edge
elements. Princen, Illingworth and Kittler do line extraction using a pyramid structure
[Princen et al., 1990]. At the lowest level they use the ordinary pv-HT on subimages
for estimating small line segments. In the preceding levels they use the additional local
orientation information for grouping the segments.
Unfortunately, however, the normal parameterization has problems when p is small.
The topology here is very strange. Clusters can be divided into parts very far away
from each other. Consider for example a line going through the origin in a xy-coordinate
system. When mapping the coordinates according to the normal parameterization, two
clusters will be produced separated in the v-dimension by ~r, see figure 2. Note that this
will happen even if the orientation estimates are perfectly correct. A line will always
have at least an infinitesimal thickness and will therefore be projected on both sides of
the origin. A final point to note is that a translation of the origin outside the image plane
will not remove this topological problem. It will only be transferred to other lines.
Granlund introduced a double angle notation [Granlund, 1978] in order to achieve a
suitable continuous representation for local orientation. However, using this double angle
notation for global lines, removes the ability of distinguishing between lines with the same
orientation and distance, p, at opposite side of the origin. The problem near p --- 0 is
removed, but unfortunately we have introduced another one. The two horizontal lines
(marked a and c), located at the same distance p from the origin, are in the double angle
normal parameterization mixed into one cluster, see figure 2.
It seems that we need a "double angle" representation around the origin and a "single
angle" representation elsewhere. This raises a fundamental dilemma: is it possible to
achieve a mapping that fulfills both the single angle and the double angle requirements
simultaneously?
We have been concerned with the problem of the normal parameterization spreading
the coordinates around the origin unsatisfactorily although they are located very close
in the cartesian representation. Why do we not express the displacement vector, i.e. the
normal vector to the line from the origin, in cartesian coordinates, ( X , Y ) , since the
topology is satisfactory? This parameterization is defined by
where V is as before the argument of the normal vector (same as the displacement vector
of the line).
Davis uses the p~o-parameterization in this way by storing the information in a carte-
sian array [Davis, 1986]. This gives the (X, Y) parameterization. There are two reasons
for not using this parameterization. First, the spatial resolution is very poor near the
origin. Secondly, and worse, all lines having p equal to 0 will be mapped to the same
cluster.
35
2~
2n 2n
b
IC
J b
II ~ a,c
,p ,P
Fig. 2. A test image containing three lines and its transformation to the p~-domain, the normal
parameterization of a line. The cluster from the line at 45~ is divided into two par~s. This
mapping has topological problems near p = O. The p2~-domain, however, folds the space so the
topology is good near p = O, but unfortunately it is now bad elsewhere. The two horizontal lines,
marked a and c, have in this parameter space been mixed in the same cluster.
The first problem, the poor resolution near the origin, can at least be solved by
mapping the XY-plane onto a logarithmic cone. That would stretch the XY-plane so
the points close to the origin get more space. However, the second problem still remains.
In this section we shall present a new parameter space and discuss its advantages with
respect to the arguments of the previous section. The M5bius Strip mapping is based
on a transformation to a 4D space by taking the n o r m a l p a r a m e t e r i z a t i o n in figure
1, expressed in cartesian coordinates ( X , Y ) and adding a "double angle" dimension,
(consider the Z-axis in a X Y Z - c o o r d i n a t e system). The problem with the cartesian
normal parameterization is as mentioned that all clusters from lines going through the
origin mix into one cluster. The additional dimension, ~b = 2~, separates the clusters
on the origin and close to the origin if the clusters originate from lines with different
orientation. Moreover, the wrap-around requirement for r is ensured by introducing a
fourth dimension, R.
The 4D-mapping
X -- x cos2(~) + y cos(~) sin(~)
y = y sin2( ) + x cos( ) sin( )
r =2~
R =Ro eIR +
The two first parameters, X and Y, is the normal vector in fig 1, expressed in cartesian
coordinates. The two following parameters, r and R, define a circle with radius R0 in
the Rr Any R0 > 0 is suitable. This gives a XY~b-system with wrap-around
in the C-dimension.
In the mapping above, the parameters are dependent. As the argument of the vector
in the XY-plane is ~ and the fourth dimension is constant, it follows that for a specific
36
(X, Y) all the parameters are given. Hence, the degree of freedom is limited to two, the
dimension of the XY-p!ane. Thus, all the mapped image points lie in a 2D subspace of
the 4D parameter space, see figure 3.
\
X
Fig. 3. The X Y 2 ~ parameter mapping. The wrap around in the double phi dimension gives the
interpretation of a M~bius strip
T h e 2D-surface
The regular form of the 2D-surface makes it possible to find a two-parameter form for
the wanted mapping. Let us consider a yr corresponding to the flattened surface
in figure 3. Let
p2 = X 2 + y2 = (x cos2(~) + y cos(~) sin(~)) 2 + (y sin2(~) + x cos(~) sin(~)) 2
p = 9 cos( ) + y sin( )
Then the (7?,~) mapping can be expressed as
~/={ p O<~<r
-p r_<~<2~r
~=2~
z/is the variable "across" the strip with 0 value meaning the position in the middle of the
strip, i.e on the 2~ axis. The wrap-around in the r dimension make the interpretation
that the surface is a M6bius strip easy, see figure 3.
Finally, using the same test image as before, we can see that we can distinguish
between the two lines at opposite side of the origin at the same time as the cluster cor-
responding to the line going through the origin is not divided, see figure 4.
37
4 Conclusion
0---.2~
The main contribution of this paper is the novel 2n
M5bius strip parameter mapping. The name, as
mentioned above, reflects the topology of the yr b
parameter surface, its twisted wrap-around in the C-
dimension. The proposed mapping has the following n ma
C*
properties:
References
[Davis, 1986] Davis, E. R. (1986). Image space transforms for detecting straight edges in indus-
trial parts. Pattern Recognition Letters, Vol 4:447456.
[Duda and Hart, 1972] Duda, R. O. and Hart, P. E. (1972). Use of the Hough transform to
detect lines and cures in pictures. Communications of the Association Computing Machinery,
15.
[Duda and Hart, 1973] Duda, R. O. and Hart, P. E. (1973). Pattern classification and scene
analysis. Wiley-Interscienee, New York.
[Dudani and Luk, 1978] Dudani, S. A. and Luk, A. L. (1978). Locating straight/lines edge seg-
mentgs on outdoor scenes. Pattern Recognition, 10:145-157.
[Granlund, 1978] Granlund, G. H. (1978). In search of a general picture processing operator.
Computer Graphics and Image Processing, 8(2):155-178.
[Hough, 1962] Hough, P. V. C. (1962). A method and means for recognizing complex patterns.
U.S. Patent 3,069,654.
[Illingworth and Kittler, 1988] Illingworth, J. and Kittler, J. (1988). A survey of the Hough
transform. Computer Vision, Graphics and Image Processing, 44.
[Princen et al., 1990] Princen, J., Hliagworth, J., and Kittler, J. (1990). A hierarchicalapproach
to line extraction based on the Hough transform. Computer Vision, Graphics, and Image
Processing, 52.
[Westin, 1991] Westin, C.-F. (1991). Feature extraction based on a tensor image description.
LiU-Tek-Lie-1991:28, ISY, LinkSping University, S-581 83 LinkSping, Sweden. Thesis No.
288, ISBN 91-7870-815-X.
This article was processed using the I ~ X macro package with ECCV92 style
Edge tracing in a priori known direction
1 Introduction
Edge detection and tracing is a crucial problem in the area of digital image process-
ing. An edge can be defined as a boundary between two homogeneous areas of different
luminance. Local luminance changes and edges corresponding to them are one of char-
acteristic image features providing information necessary in the process of scene analysis
and objects classification [BB1], [Prl]. Most of contour extraction algorithms consist of
the two basic steps: the edge detection (sometimes with thresholding) and the thinning
and linking. They are efficient when applied to the image of nearly homogeneous objects
differing significantly from the background (e.g. tools, industrial parts, writing etc.) if
the image is not contaminated by noise. When the level of noise increases the obtained
contours are often broken and deformed. That makes the process of interpretation and
recognition more difficult. More sophisticated methods should be then implemented, e.g.
incorporating a feedback path or a local edge enhancement [CS1]. However, all the uni-
versal edge tracing algorithms may still fail when implemented to noisy images. In these
cases, the use of a priori knowledge about the edge to be traced significantly facilitates
construction of an appropriate tracing algorithm.
2 Edge Tracing
The first step of the algorithm is the edge detection. It is made by convolving an image
(or a fragment of it) with one mask which is the most sensitive to luminance changes
in a chosen direction. The choice of this mask is based on our expectations about the
searched edge direction (called here: the assumed edge direction). The obtained edges
are thinned then. The edge tracing begins with finding a starting chain (edge fragment).
Then the edge is traced, point after point. When a gap is encountered, a procedure
for seeking and estimating all the chains passing close to the current boundary point is
activated. The chains are estimated according to a criterion examining their usefulness
for further tracing. The best chain, fulfiling also some threshold conditions, is accepted as
a continuation of the broken boundary. This chain is then connected with the previously
found boundary fragment and the tracing procedure continues. In the case when no
39
chains are found (or when none of them fulfils the threshold conditions), searching for a
new starting chain begins (it is led in the assumed edge direction.)
2.1 E d g e D e t e c t i o n a n d T h i n n i n g
In the present algorithm, the edge direction is quantized into one of the eight directions.
This is a common approach assuring good detection of edges in any direction. For detect-
ing edges a set of eight masks, of the size of 3x3 pixels is used, as proposed in [Rol]. The
edge direction (from 1 to 8) to which a given mask is the most sensitive, is called here a
mask direction. For a chosen image part, one mask which direction is the closest to the
assumed edge direction, is implemented. The edge detection is carried out by convolving
the considered image fragment with the appropriate mask. The result of this convolu-
tion is the edge magnitude image, having "lines" where edges previously existed. Since
edges in real images are usually blurred over some area, and after convolution with a 3x3
mask this blurring still increases, then resulting lines are at least a few pixels wide. The
point of the maximum edge magnitude on an edge cross-section is assumed a boundary
point. Because of blurring and the presence of noise a few local maxima can appear on
this cross-section. It is difficult to estimate then if these maxima originate from noise
or from a few edges passing close to one another. A simple solution to this problem is
to assume a minimum distance between two edges. If a distance between neighbouring
maxima is shorter than this minimum, then the bigger maximum is considered as an
edge point. It should be aimed to take this minimum distance as short as possible, to
prevent attenuating the "weaker" edge by a "stronger," neighbouring one. It was fixed
in the implementation that a minimum distance between edges cannot be shorter than
three pixels. It also ensures that the obtained chains will not be branched.
Thinning the previously detected edges is achieved in two passes. In the first one
all the points of the image fragment are analyzed in rows, from left to right. For each
analyzed point the following points (in a direction perpendicular to the assumed edge
direction) are checked (see Fig. 1.). If the gradient magnitude of the analyzed point is
smaller than the gradient magnitude of one of the two following points, then it is set to
zero. This procedure is repeated in the second pass, while moving from right to left.
F i g . 1. Neighbourhoods of the analyzed point (x) checked during thinning, for the first (1) ~ d
the second (2) pass, for different mask directions
As a result of the thinning procedure fragments of edges (called chains) are obtained.
Their direction is close to the direction of the used edge detector mask and they are one
pixel wide (measuring in the thinning direction - perpendicularly to the mask direction).
2.2 E d g e T r a c i n g
The first step of the edge tracing is searching for a starting chain. This procedure is
analogous to the one of finding the best chain in a window, described below. The difference
40
is that the starting chain must fulfil much stronger conditions concerning its length and
average gradient magnitude. Also the size of a search window is usually bigger in this
case. Additional conditions, as a position of the edge in relation to some characteristic
points of the image, derive from the possessed a priori knowledge and must be defined
separately for each implementation.
After finding the starting chain, the edge tracing begins from its first point (starting
point). For each, already found edge point, the next point is searched in the strictly
defined neighbourhood. The choice of the analyzed neighbours depends on a direction
of the edge detector mask and on a tracing direction (in accordance or in opposition to
the mask direction - see Fig. 2.). If the gradient of any of these neighbours is different
from zero, it is accepted as the next edge point. The tracing procedure is continued until
a margin of the analyzed area or a break in the traced edge is reached. In the latter
case searching for a new chain (which could be assumed as a continuation of the broken
edge) is activated. The search is led in the area limited by a window which center is the
last found edge point. The window has the square shape and its sides are parallel to the
t s t:
0 I1 s R ~ s!
2 t ~ 2
2 2 2
S S t
Fig. 2. Neighbours analyzed in searching for the next edge point azcording to the mask direction
image margins. Its size depends on possible local changes in the edge direction and on a
distance from a possible "strong" edge.
In the window all the chains originating in it are searched (except for the already found
boundary). All the window points are checked along lines perpendicular to the mask
direction, starting from one of the corners. For each point, when its gradient magnitude
soaf~h w~ndow
~ fitmtpointof s chain
lUt found~ point
"I,
tdOedirtction
Fig. 3. The way of calculating parameters ! and a when the mask of direction 1 or 5 was used
41
is higher than zero, all its neighbours from the previously checked line are analyzed. If
their gradient equals zero then the following (in this window) chain number is assigned
to this point and its coordinates are stored. Thanks to the use of one edge detector mask
and described thinning procedure, it is not possible that chains beginning in two separate
points will join in one chain. Thus, after having the whole window analyzed, the number
of chains beginning in it and their starting coordinates are known. Afterwards their
properties are analyzed from the point of view of their usefulness for tracing continuation
of the broken edge. In order to attain this, each chain is traced from its (already known)
starting point to the end (but not further than on a given maximum distance). The total
gradient magnitude of all its points and its length (the number of pixels) are counted
as well as the average gradient magnitude and deviation from the assumed direction.
For counting this deviation only coordinates of the first and the last point of the chain
are used. The best chain is chosen from among the chains satisfying specified threshold
conditions (concerning their length and average gradient magnitude). The best chain
is assumed the one maximizing a given criterion. An exemplifying criterion can be as
follows:
Q = el _ + a2 m__ + a 3 1- + a 4 (1 - [ t g a [ ) (1)
nrnax ~71rna~
where:
3 Experimental Results
The presented edge tracing algorithm was implemented to find the outlines of the outer
fat layer of halved pigs carcasses. It enables finding the maximum width of this layer
as well as other parameters necessary for meat classification. The resolution of analyzed
images was 512x512 pixels with 256 grey levels. The images were additionally low-pass
filtered (with a 3x3 mask). For edge detection the masks detecting vertical edges were
used. The summed up result of implementing both masks is shown in Fig. 4.b. After
having analyzed about 100 images the following parameter values were set:
- the minimum length of the best chain in a window was 10 pixels,
- the window size was 21 pixels (41 by 21 for a starting chain)
- the weight coefficients a l , a3, a4 were established so that the maximum of each product
in (1) was 1 (only a2 was set bigger).
Exemplary results are shown in Fig. 4. Edges obtained after detection and thinning
(Fig. 4.b) are broken in many points. This causes frequent searching for a chain which
could be accepted as a continuation of a broken boundary. In the places of joining chains,
42
Fig. 4. The result of implementing the edge tracing algorithm to tind tile outlines of the outer
fat layer of halved pig carcass: a) smoothed outlines shown on the original image, b) result of
implementing masks detecting vertical edges, c)obtaincd outlines of the fat layer
local changes in the outline direction can appear (Fig. 4.c). These outlines, after smooth-
ing, were put on a real image (see Fig. 4.a). On the basis of the obtained outlines char-
acteristic parameters of the fat layer were calculated. (Areas of interest are marked with
horizontal lines.)
4 Summary
A simple edge tracing algorithm was presented here. Applying universal edge tracing
algorithms is unjustified when analyzing images about which some a priori knowledge is
accessible. The computational complexity of these algorithms is usually high and their
efficiency is low, especially for noisy images. The presented algorithm (based on a pri-
ori knowledge about the analyzed scene) is efficient even when applied to noisy images
and distorted edges. However, for each implementation it requires setting values of all
parameters as well as defining additional conditions (characteristic for a particular im-
plementation) simplifying the tracing procedure.
References
[BB1] Ballard, D., Brown, C.: Computer Vision. Englewood Cliffs, N J: Prentice-Hall, 1982
[Prl] Pratt, W.: Digital Image Processing. New York: Wiley-Interscience, 1978
[CS1] Chen, B., Siy, P.: Forward/backward contour tracing with feedback. IEEE Trans. Pattern
Anal. Machine Intell., vol. PAMI-9, no. 3, pp. 438-446, May 1987
[Rol] Robinson, G.: Edge detection by compass gradient masks. Comput. Graphics Image Pro-
cessing, vol. 6, pp. 482-492, 1977
This article was processed using the ]rEX macro package with ECCV92 style
Features E x t r a c t i o n and Analysis M e t h o d s for
Sequences of Ultrasound Images
1 Introduction
1.1 M o t i v a t i o n a n d O b j e c t i v e s
There is a continuously increasing demand in the automated analysis of 2D and 3D
medical images at the hospital[I]. Among these images, ultrasound images play a crucial
role, because they can be produced at video-rate and therefore allow a dynamic analysis
of moving structures. Moreover, the acquisition of these images is non-invasive and the
cost of acquisition is relatively low compared to other medical imaging techniques.
On the other hand, the automated analysis of ultrasound images is a real challenge for
active vision, because it combines most of the difficult problems encountered in computer
vision in addition to some specific ones related to the acquisition mode:
- images are usually provided in polar geometry instead of cartesian geometry,
- images are degraded by a very high level of corrupting noise,
- o b s e r v e d objects usually correspond to non-static, non-polyhedric and non-rigid
structures.
The geometric transformation (called scan correction) which transforms the data from
a polar representation to the correct cartesian representation is usually applied through
a bilinear interpolation.
We show in this paper the limitations of this scheme, which does not account for the
varying resolution of the data, and we propose a new method, called sonar-space filtering,
which consists of computing the scan conversion with a low-pass filtering of the cartesian
image applied directly to the available polar data, and which can be used to optimally
reconstruct the data with a chosen level of spatial linear filtering.
Furthermore, we develop a methodology to automatically track s physiological struc-
ture on an echocardiographic sequence. Interactivity is used to initiate the process on the
first image of the sequence. Then edges are computed and an approximative segmentation
of the structure is obtained by using deterministic algorithms. This information is finally
combined with a deformable model to obtain the temporal tracking of the pre-selected
structure.
44
1.2 P r e v i o u s w o r k
Our approach is different from previous ones. This comes from the fact that we
study directly the ultrasonic data. More commonly, feature extraction is applied to the
cartesian video data. To our knowledge, there is only one study where all processing
is performed on sector scans in polar coordinate form. This was published by Taxt [13]
and reports noise reduction and segmentation in time-varying ultrasound images. But a
comparative study of scan correction methods to obtain cartesian images has apparently
not been pursued yet. For cartesian images, the most commonly used approach to obtain
the contour of left ventricle (in echocardiography) is radial search [5] [2] [7]: the procedure
starts from a point inside the heart chamber and searches along different radial lines for
edge points. The best-known dynamic approach is the one by Zhang and Geiser [15], who
compute temporal cooccurrences to obtain both stationary points and moving points. The
temporal information has also been used to filter images obtained at the same instant of
the cardiac cycle [14].
~ receive
cPosition
signal display
processor
t~ tracsjnit
P( posmon
I Other types of echographs using pseudo-random code correlation are studied in the Litera-
ture [11]
45
The process of converting from the polar coordinates representation to the cartesian
coordinates representation is necessary for the convenience of the users. Physicians are
accustomed to viewing images in cartesian data and it would be difficult for them to
interpret polar data. Moreover vizualisation hardware and image processing algorithms
are designed for data in cartesian coordinates.
Let us suppose that M different orientations are used to obtain an echocardiographic
image, and that each return signal is digitized to L points. Fig. 2 shows an echographic
image, with M rows and L columns, obtained with a commercial echographic machine,
providing an image represented in polar coordinates. Fig. 3 shows the cartesian image
corresponding to the same data.
Scan conversion requires the knowledge of the following set of parameters (see Fig. 4):
- the angular extent of data acquisition wedge a,
- minimal distance d for data acquisition,
- total distance D for data acquisition (these distances being calculated from the skin),
and
- the number of rows, N, desired in the ouput cartesian image (The number of columns
will be related to a, and will assume square pixels).
46
(transducer)
,\
I
I
I
t
1
t
t
I
I
1
t l
I
t
t
I
t
I
I
t
t
S
I
I
l
l
l E
l
Several methods may b e used for the conversion process. Usually, the video image on
the echographic machine is obtained by assigning to a cartesian point the grey level of
the nearest available point in polar coordinates, or the value of the bi]inear interpolation
of its four nearest points. In fact, we found that these methods d o n o r make an optimal
use of the available original data, and we introduced a new method, called sonar-space
filtering, which Can be used to optimally reconstruct the data with a chosen level of
spatial linear filtering.
However, the input is only available in the polar coordinate space. We thus apply the
following change of variables:
( J ( ~ - ~o) ~ + y~) 9 e - ~ .
where:
- A~v = d * ~ represents the distance from the surface of the skin where the
acquisition process begins, measured in pixel units along a scan line of the raw data,
- A~ = ~ is the angular difference between two successive angular positions of
k S
the probe,
- e = ( D -Dd ) ( N(L
- 1-) 1) performs the change of pixel sampling rates along the axial
direction of the beam, according to the desired height N of the cartesianimage.
We obtain:
Here [3(p, 0) I is the determinant of the Jacobian matrix corresponding to the inverse
transformation of variables:
I J(P, 0) l = (" + ~ Ne )2 * ~ a
The filter is also sampled in this domain in order to approximate the integral by
a discrete summation. Filtered numerical outputs are evaluated at original d a t a point
locations within the continuous domain.
We thus obtain the following equation:
where the summation is over the discrete collection of point (Pk, 8k) in polar coordinates,
where C is used to normalize the data and:
3.2 V i s u a l i z a t i o n a p p l i c a t i o n
3.3 E d g e d e t e c t i o n a p p l i c a t i o n
For further automatic boundary tracking, our goal is to use spatio-temporal ap-
proaches [10]. A time-varying edge may be represented as a surface in 3-D space, in
which z and l/are two spatial dimensions (in the cartesian coordinates space) and t is
the temporal dimension. We modify Deriche's edge detector for this goal. Another ap-
proach could be to generalize Deriche's detector with spatio-temporal functions as in [6].
We denote G, and Gy the two spatial components of the gradient vector and I(z, y, t)
the 3-dimensional grey level function. Let D be the Deriche differentiation filter and L
the associated smoothing filter:
where the subscripts are used to explain along which axis the corresponding filter is
applied. Each component is obtained by differentiation in the associated direction and
filtering in the other spatial direction and in the temporal direction.
The norm of the gradient is defined by:
y, t) = + .
The edges are obtained as local maxima of the gradient norm in the direction of the 2D
gradient vector. The temporal dimension is only used to smooth the result. This produces
a significant image enhancement in regions that are not moving too fast.
We denote a~, ay and at the filtering parameters of the Deriche filters (cf. Equ. 5
and 6) for the respective axes z, y and t. Since the 2D space is homogeneous, we can
choose a~ = ay. The value of at is independent and must be chosen according to the
temporal resolution.
4 Temporal tracking
At this stage, we assume that we will work on ultrasound cartesian images and
on edges represented in a cartesian space, whatever the methods used to obtain this
information.
Our objective is to perform temporal tracking of a pre-selected anatomical structure
by combining different kinds of information. First, we want to obtain an approximate
segmentation of the structure by using simple deterministic processing. Secondly we
want to use the edges computed directly on the raw data. These will be combined by a
regularisation process that takes an initial segmentation and deforms it from its initial
position to make it better conform to the pre-detected edges. This approach is the idea
behind the use of deformable models.
- A first order opening eliminates the small bright structures on dark background.
- The dual operation (first order closing) suppresses the small dark structures.
After these operations, a simple thresholding gives an image C where all the cardiac
cavities are represented in white. This detection can be refined b y the use of higher level
information. The specialist points out, using a computer mouse, the chosen cavity on the
first image of the sequence. The whole cavity is then obtained by a conditional dilatation
which begins at this point.
50
4.2 Use o f a d e f o r m a b l e m o d e l
The previous operations usually provide an approximately-correct but locally-
inaccurate positioning of the structure boundaries. In order to improve this crude seg-
mentation to an accurate determination of the boundaries, we use the deformable models
of [3], in the spirit of [8].
The deformable model is initialized in the first image by the crude approximation of
the structure boundary. It evolves under the action of image forces, which are counter-
balanced by its own internal forces to preserve its regularity. Image forces are computed
as the derivative of an attraction potential related to the previously computed spatio-
temporal edges. Typically, the potential is inversely proportional to the distance of the
nearest edge point.
Deformable models may be used independantly on each frame or iteratively on the
sequence: once the model has converged in the first frame, its final position is used as
the initial one in the next frame, and the process is repeated.
result is then used to initialize the second frame. The parameters are the same and the
process is repeated sequentially through all frames, as it can be seen on Fig. 12.
For structures moving fast (mitral valve), deformable models are applied indepen-
dantly on each frame and results of these applications may be seen on Fig. 11.
We summarize the advantages of using deformable models to analyze echocardio-
graphic data:
The methods presented in this paper were applied to four different sequences obtained
from two different echographs. The data presented here were obtained in a polar coor-
dinate form on a V I G M E D echograph at Henri Mondor hospital in Creteil, France. A
sequence contains 38 images from a cardiac cycle. Fig. 9 shows a cartesian representation
of the original data. (Only one image in four is displayed.) The left heart cavities (auricle
and ventricle) and the mitral valve are visible in a typical image. Our aim is to track
them. Tracking of the latter structure is successfully achieved in this example due to the
fact that the edges were obtained from sonar-space filtering. Other methods (bilinear
interpolation followed by edge detection) generally do not give accurate edges for the
deep structures and cannot therefore be used for further temporal tracking. Edges are
shown in Fig. 10, and temporal tracking is presented in Figs. 11 and 12.
7 Conclusions
are extracted can be enhanced. This approach is more flexible because it allows a variable
level of smoothing to be chosen according to the actual resolution of the original data.
This is not the case when an additional smoothing is required after a conversion by other
algorithms. We showed the enhancement produced on edge detection by our approach.
Finally, we demonstrated the effectiveness of this approach by solving a complete
application. We used morphological operators to initialize a deformable model in the first
image of a time sequence. Then we applied our edge detector and we let the deformable
55
model converge toward the detected edges. Using the solution as an initialization in the
following image, we tracked the left auricle b o u n d a r y in a sequence of 38 images.
Our future research will concentrate on the generalization of these methods to be
applied to 3-D ultrasound images produced in spherical coordinates.
8 Acknowledgements
W e gratefully acknowledge Gabriel Pelle (INSERM, CHU Henri M O N D O R , FRANCE)
for providing the data and for helpful discussions, and Robert H u m m e l for a significant
improvement of the final manuscript.
This work was partially supported by M A T R A Espace and Digital Equipment Cor-
poration.
References
I. N. Ayache, J.D. Boissonnat, L. Cohen, B. Geiger, J. Levy-Vehel, O. Monga, and P. Sander.
Steps toward the automatic interpretation of 3-D images. In H. Fuchs K. Hohne and
S. Pizer, editors, 3D Imaging in Medicine, pages 107-120. NATO ASI Series, Springer-
Verlag, 1990.
2. A.J. Buds, E.J. Delp, J.M. Meyer, J.M. Jenkins, D.N. Smith, F.L. Bookstein, and B. Pitt.
Automatic computer processing of digital 2-dimensional echocardiograms. In Amer. J.
GardioL, volume 51, pages 383-389, 1983.
3. L.D. Cohen and I. Cohen. A finite element method applied to new active contour models
and 3D reconstruction from cross sections. In Proceedings o/the International Conference
on Computer Vision, Osaka, Japan, December 1990.
4. R. Deriche. Using Canny's criteria to derive a recursively implemented optimal edge de-
tector. International Journal o/Computer Vision, 1 (2), May 1987.
5. F. Faure, J.P. Gambotto, G. Montserrat, and F. Patat. Space medical facility study. Tech-
nical report, ESA, 1988. final report, 6961/86/NL/PB.
6. T. Hwang and J.J. Clark. A spatio-temporal generalization of Canny's edge detector. In
iOth International Conference on Pattern Recognition, Atlantic City, New Jersey, USA,
June 1990.
7. J.M. Jenkins, O. Qian, M. Besozzi, E.J. Delp, and A.J. Buda. Computer processing of
echocardiographic images for automated edge detection of left ventricular boundaries. In
Computers in Cardiology, volume 8, 1981.
8. Michael Kass, Andrew Witkin, and Demetri Tersopoulos. Snakes: Active contour models.
In Proceedings of the First International Conference on Computer Vision, pages 259-268,
London, June 1987.
9. A. Macovski. Medical Imaging Systems. Prentice Hall, 1983.
10. O. Monga and R. Deriche. 3D edge detection using recursive filtering:application to scan-
ner images. Technical Report 930, INRIA, November 1988.
11. V.L. Newhouse. Progress in Medical Imaging. Springer Verlag, 1988.
12. J. Serfs. Image analysis and mathematical morphology. Academic Press, 1982. London.
13. T. Taxt, A. Lundervold, and B. Angelsen. Noise reduction and segmentation in time-
varying ultrasound images. In iOth International Con/crence on Pattern Recognition, At-
lantic City, New Jersey, USA, June 1990.
14. M. Unser, L. Dong, G. Pelle, P. Brun, and M. Eden. Restoration on echocardiagrams using
time warping and periodic averaging on a normalized time scale. In Medical Imaging,
number III, 1989. January 29 - February 3, Newport Beach.
15. L.F. Zhang and E.A. Geiser. An approach to optimal threshold selection on a sequence
of two-dimensional echocardiographic images. In IEEE Transactions on Biomedical Engi-
neering, volume BMB 29, August 1982.
56
The problem of separating figure from ground is a central one in computer vision. O n e
aspect of this problem is the problem of separating shape from noise. Two-dimensional
shapes are the input data of high-level visual processes such as recognition. In order to
maintain the complexity of recognition as low as possible it is important to determine at
an early level what is shape and what is noise. Therefore one needs a definition of shape,
a definition of noise, and a process that takes as input image elements and separates
them into shape and noise.
In this paper we suggest an approach whose goal is twofold: (i) it groups image
elements that are likely to belong to the same (locally circular) shape while (ii) noisy
image elements are eliminated. More precisely, the method that we devised builds a cost
function over the entire image. This cost function sums up image element interactions and
it has two terms, i.e., the first enforces the grouping of image elements into shapes and
the second enforces noise elimination. Therefore the shape/noise discrimination problem
becomes a combinatorial optimization problem, namely the problem of finding the global
minimum for the cost function just described. In theory, the problem can be solved by
any combinatorial optimization algorithm that is guaranteed to converge towards the
global minimum of the cost function.
In practice, we implemented three combinatorial optimization methods: simulated
annealing (SA), mean field annealing (MFA), and microcanonical annealing (MCA) [3].
Here we concentrate on mean field annealing.
The figure-ground or shape/noise separation is best illustrated by an example. Fig. 1
shows a synthetic image. Fig. 2 shows the image elements that were labelled "shape" by
the mean field annealing algorithm.
The interest for shape/noise separation stems from Gestalt psychologists' figure-
ground demonstrations [6]: certain image elements are organized to produce an emergent
figure. Ever since the figure-ground discrimination problem has been seen as a side effect
of feature grouping. Edge detection is in general first performed. Edge grouping is done
* This research has been sponsored in part by "Commissariat ~ l'Energie Atomique," in part by
the ORASIS project, and in part by CEC through the ESPRIT-BRA 3274 (FIRST) project.
59
Fig. 1. A synthetic image with 1250 elements. Fig. 2. The result of applying mean field an-
Circles, a straight line, and a sinusoid are nealing to the synthetic image. The elements
plunged into randomly generated elements, shown on this figure were labelled "shape".
We consider a particular class of combinatorial optimization problems for which the cost
function has a mathematical structure that is analog to the global energy of a complex
physical system, that is a interacting spin system. First, we briefly describe the state of
such a physical system and give the mathematical expression of its energy. We also show
the analogy with the energy of a recursive neural network. Second we suggest that the
figure-ground discrimination problem can be cast into a global optimization problem of
the type mentioned above.
The state of an interacting spin system is defined by: (i) A spin state-vector of N
elements ~r = [ a l , . . . , ~N] whose components are described by discrete labels which cor-
respond to up or down Ising spins: ai E { - 1 , + l } . The components o'i may well be viewed
as the outputs of binary neurons. (ii) A symmetric matrix J describing the interactions
between the spins. These interactions may well be viewed as the synaptic weights be-
60
tween neurons in a network. (iii) A vector 6 = [(~1, 9 9 (~N] describing an external field in
which the spins are plunged.
Therefore, the interacting spin system has a '~natural" neural network encoding asso-
ciated with it which describes the microscopic behaviour of the system. A macroscopic
description is given by the energy function which evaluates each spin configuration. This
energy is given by:
INN N
E(~ ~ "'O'N) = ---2 E E Jij(ri~ - E ~iO'i (1)
i=1 j = l i=1
The main property of interacting spin systems is that at low temperatures the number
of local minima of the energy function grows exponentially with the number of spins.
Hence the adequation between the mathematical model of interacting spin systems and
combinatorial optimization problems with many local minima is natural.
We consider now N image elements. Each such element has a label associated with
it, Pi, which can take two values: 0 or 1. The set of N labels form the state vector
P = ~I,''',PN]. We seek a state vector such that the "shape" elements have a label
equal to 1 and the "noise" elements have a label equal to 0. If clj designates an interaction
between elements i and j, one may write by analogy with physics an interaction energy:
E N N
saliency(P) = -- E i = I Ej=I cijpipj
Obviously, the expression above is minimized when all the labels are equal to 1. In
order to avoid this trivial solution we introduce the constraint that some of the elements
in the image are not significant and therefore should be labelled "noise": Eeonstraint(P) =
(E/N_--I pl) 2
The function to be minimized could be something like the sum of these energies:
An image array contains two types of information: changes in intensity and local geome-
try. Therefore the choice of the image elements mentioned so far is crucial. Edge elements,
or edgels are the natural candidates for making explicit the two pieces of information
just mentioned.
An edgel can be obtained by one of the many edge detectors now available. An edgel
is characterized by its position in the image (xi, Yl) and by its gradient computed once
the image has been low-pass filtered. The x and y components of the gradient vector are:
Let i and j be two edgels. We want that the interaction between these two edgels en-
capsulates the concept of shape. That is, if i and j belong to the same shape then their
interaction is high. Otherwise their interaction is low. Notice that a weak interaction
between two edgels has several interpretations: (i) i belongs to one shape and j belongs
to another one, (ii) i belongs to a shape and j is noise, or (iii) both i and j are noise.
The interaction coefficient must therefore be a co-shapeness measure. In our approach,
co-shapeness is defined by a combination of cocircularity, proximity, and contrast.
The definition of cocircularity is derived from [8] and it constrains the shapes to be
as circular as possible, or as a special case, as linear as possible. Proximity restricts
the interaction to occur in between nearby edgels. As a consequence, cocircularity is
constrained to be a local shape property. The combination of coeireularity and proximity
will therefore allow a large variety of shapes that are circular (or linear) only locally.
Contrast enforces edgels with high gradient module to have a higher interaction coefficient
than edgels with a low gradient module.
Following [8] and from Fig. 3 it is clear that two edgels belong to the same circle if
and only if: Ai + )~j = r. In this formula, hi is the angle made by one edgel with the line
joining the two edgels. Notice that a circle is uniquely defined if the relative positions
and orientations of the two edgels verify the equation above. This equation is also a local
symmetry condition consistent with the definition of local symmetry of Brady & Asada.
Moreover, linearity appears as a special case of cocircularity, namely when hl = 0 and
Aj = I x .
From this cocircularity constraint we may derive a weaker constraint which will mea-
sure the closeness of a two-edgel configuration to a circular shape: Aij =[ Ai + Aj -- lr [.
Aij will vary between 0 (a perfect shape) and ~ (no shape). Finally the cocircularity
coefficient is allowed to vary between 1 for a circle and 0 for noise and is defined by the
formula: ci~~ = ( 1 - A~j/~r ~) exp (-A~ffk). The parameter k is chosen such that the
cocircularity coefficient vanishes rapidly for non-circular shapes.
The surrounding world is not constituted only by circular shapes. Cocircularity must
therefore be a local property. That is, the class of shapes we are interested to detect at a
given scale of resolution are shapes that can be approximated by a sequence of smoothly
connected circular arcs and straight lines. The proximity constraint is best described by
multiplying the cocircularity coefficient with a coefficient that vanishes smoothly as the
two edgels are farther away from each other: ~fox = exp ( - d 5 / ( 2 ~ ) ) where dlj is the
distance between the two edgels and ad is the standard deviation of these distances over
the image. Hence, the edgel interaction will adjust itself to the image distribution of the
edgel population.
A classical approach to figure-ground discrimination is to compare the gradient value
62
at an edgel against a threshold and to eliminate those edgels that fall under this threshold.
An improvement of this simply-minded nonlinear filtering is to consider two thresholds
such that edgel connectivity is better preserved [1]. Following the same idea, selection of
shapes with high contrast can be enforced by multiplying the interaction coefficient with
a term whose value depends on contrast: e- Sc.~
3
= gigj/g2ma~ where gma~ is the highest
gradient value over the edgel population. Finally the interaction coefficient between two
edgels becomes:
cocir fox contrast
The states reachable by the system described by eq. (1) correspond to the vertices of a N-
dimensional hypercube. We are looking for the state which corresponds to the absolute
minimum of the energy function. Typically, when N = 1000, the number of possible
configurations is 2 g ~ 10 3~ The problem of finding the absolute minimum is complex
because of the large number of local minima of the energy function and hence this
problem cannot be tackled with local minimization methods (unless a good initialization
is available).
We already mentioned that the functional to be minimized has the same structure as
the global energy of an interacting spin system. To find a near ground state of such a
physical system we will use statistical methods. Two analysis are possible depending on
the interaction of the system with its environment: either the system can exchange heat
with its environment (case of the canonical analysis) or the system is isolated (case of
the microcanonical analysis) [3]. We will consider here the canonical analysis.
This analysis makes the hypothesis that the physical system can exchange heat with
its environment. At the equilibrium, statistical thermodynamics shows that the free en-
ergy F is minimized. The free energy is given by: F = E - TS, where E is the internal
energy (the energy associated with the optimization problem) and S is the entropy (which
measures the internal disorder). Hence, there is a competition between E and S. At low
temperatures and at equilibrium, F is minimal and T S is close to zero. Therefore, the
internal energy E is minimized. However the minimum of E depends on how the tem-
perature parameter decreases towards the absolute zero. It was shown that annealing is
a very good way to decrease the temperature.
We are interested in physical systems for which the internal energy is given by eq. (1).
The remarks above are expressed in the most fundamental result of statistical physics,
the Boltzmann (or Gibbs) distribution:
P r ( E ( a ) = Ei) = exp ( - E i / ( k T ) )
Z(T) (4)
which gives the probability of finding a system in a state i with the energy Ei, assuming
that the system is at equilibrium with a large heat bath at temperature k T (k is the
Boltzmann's constant). Z(T) is called the partition function and is a normalization factor:
Z ( T ) = ~ , exp ( - E , / ( k T ) ) . This sum runs over all possible spin configurations. Using
eq. (4) one can compute at a given temperature T the mean value over all possible
configurations of some macroscopic physical parameter A:
We introduce now the following approximation [12]: The system composed of N in-
teracting spins is viewed as the union of N systems each composed of a single spin.
Such a single-spin system {ai} is subject to the mean field (~i) created by all the other
single-spin systems. Let us study such a single spin system. It has two possible states:
{ - 1 } or {+1}. The probability for the system to be in one of these states is given by the
Boltzmann distribution law, eq. (4):
a~
exp ( - ( 4 , ) a0 e { - 1 , 1} (7)
P(X, = a ~ ~ p ( ) ~ ~'+ P"(-~i ~- -1)'
Xi is the random variable associated with the value of the spin state. Notice that in the
case of a single-spin system the partition function (the denominator of the expression
above) has a very simple analytical expression. By combining eq. (5), (6), and (7), the
mean state of ai can now be easily derived:
(ai)~ (+i) e x p ( - ( * i ) / T ) + ( - 1 ) e x p ( ( ~ i ) / T ) = t a n h ( ~ = l J i j ( a j ) + ~ i )
exp ( (#i) /T) + exp (-(Oi) /T) T (8)
We consider now the whole set of single-spin systems. We therefore have N equations
of the form:
#i = tanh ()-~JN=I Jij pj + 6i)
T (9)
where Pi = (ai). The problem of finding the mean state of interacting spin system at
thermal equilibrium is now mapped into the problem of solving a system of N coupled
64
non-linear equations, i.e., eq. (9)9 In the general case, an analytic solution is rather dif-
ficult to obtain 9 Instead, the solution for the v e c t o r / t = [ / J l , " ' , P N ] m a y well be the
stationary solution of the following system of N differential equations:
where r is a time constant introduced for homogeneity. In the discrete case the temporal
derivative term can be written as:
where / ~ is the value of/Ji at time tn. By substituting in eq. (10) and by choosing
7- = At, we obtain an iterative solution for the system of differential equations described
by eq. (10):
- Synchronous mode9 At each step of the iterative process all the /t~'s are updated
using t h e / ~ - 1 , s previously calculated.
- Asynchronous mode9 At each step of the iterative process a spin #n is randomly
selected and updated using the/z n - l ' s .
In practice, the asynchronous mode produces better results because the convergence
process is less subject to oscillations frequently encountered in synchronous mode. In
order to obtain a solution for the vector ~r from the vector it, one simply looks at the
signs of the/z~ 's. A positive sign implies that the probability that the corresponding spin
has a value of 3-1 is greater than 0.5: if 1/2(1 3- #~) > 0.5 then o'i = 3-1 else ai = -1.
A practical difficulty with mean field approximation is the choice of the temperature
T at which the iterative process must occur. To avoid such a choice one of us [5] and
other authors [13] have proposed to combine the mean field approximation process with
an annealing process giving rise to mean field annealing. Hence, rather than fixing the
temperature, the temperature is decreased during the convergence process according to
two possible annealing schedules:
65
- Initially the temperature has a high value and as soon as every spin has been updated
at least once, the temperature is decreased to a smaller value. Then the temperature
continues to slightly decrease at each step of the convergence process. This does not
guarantee that a near equilibrium state is reached at each temperature value but
when the temperature is small enough then the system is frozen in a good stable
state. Consequently, the convergence time is reduced since at low temperatures the
convergence to a stationary solution is accelerated. This strategy was successfully
used to solve hard np-complete graph combinatorial problems [4], [5];
- Van den Bout & Miller [13] tried to estimate the critical temperature Tr At this
temperature, some of the mean field variables pi begin to move significantly towards
either - 1 or +1. Hence their strategy consists performing two sets of iterations: one
iteration process at this critical temperature until a near state equilibrium is reached
and another iteration process at a temperature value that is close to 0. However, the
critical temperature is quite difficult to estimate.
We currently use the first of the annealing schedules described above.
!:-.'-'777 I.'7 j
P.Jk,t(
W
Fig. 4. Set of edgels obtained with no noise Fig. 5. The result of applying MFA to the im-
elimination, age o n t h e left.
~ t t lattl 9 ~Jto
a recursive neural network allows one to assert that the MFA algorithm proposed here is
implementable on a fine-grained parallel machine. In such an implementation, a processor
is associated with an edgel (or a spin, or a neuron) and each processor communicates
with all the other processors.
References
1. J. Canny. A Computational Approach to Edge Detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAM[-8(6):679-698, November 1986.
2. R. Deriche. Using Canny's criteria to derive a recursively implemented optimal edge de-
tector. International Journal of Computer Vision, 1(2):167-187, 1987.
3. L. Hdrault and R. Horaud. Figure-ground discrimination: a combinatorial optimization
approach. Technical Report RT 73, LIFIA-IMAG, October 1991.
4. L. H6rault and J.J. Niez. Neural Networks and Graph K-Partitioning. Complex Systems,
3(6):531-576, December 1989.
5. L. H6rault and J.J. Niez. Neural Networks and Combinatorial Optimisation: A Study of
NP-Complete Graph Problems. In E. Gelembe, editor, Neural Networks: Advances and
Applications, pages 165-213. North Holland, 1991.
6. W. Kohler. Gestalt Psychology. Meridian, New-York, 1980.
7. H. Orland. Mean field theory for optimization problems. Journal Physique Lettres, 46:L-
763-L-770, 1985.
8. P. Parent and S.W. Zucker. Trace Inference, Curvature Consistency, and Curve Detection.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(8):823-839, August
1989.
9. C. Peterson. A new method for mapping optimization problems onto neural networks.
International Journal of Neural Systems, 1(1):3-22, 1989.
10. T. J. Sejnowski and G. E. Hinton. Separating Figure from Ground with a Boltzmann
Machine. In Michael Arbib and Allen Hanson, editors, Vision, Brain, and Cooperative
Computation, pages 703-724. MIT Press, 1988.
11. A. Sha'ashna and S. Ullman. Structural Saliency: The Detection of Globally Salient Struc-
tures Using a Locally Connected Network. In Proc. 1EEE International Conference on
Computer Vision, pages 321-327, Tampa, Florida, USA, December 1988.
12. H.E. Stanley. Introduction to Phases Transitions and Critical Phenomena. Oxford Univer-
sity Press, 1971.
13. D.E. Van den Bout and T. K. Miller. Graph Partitioning Using Annealed Neural Networks.
In Int. Joint Conf. on Neural Networks, pages 521-528, Washington D.C., June 1989.
Deterministic Pseudo-Annealing :
Optimization in M a r k o v - R a n d o m - F i e l d s
A n Application to Pixel Classification
1 Introduction
Since the seminal paper of Geman [6], which popularized the Hammersley-Clifford the-
orem, Markov Random Fields (M.R.F.) have been increasingly for the last few years
for many low-level tasks in image processing and interpretation, and many heuristic al-
gorithms have been proposed to solve them: iterated conditional modes [2], simulated
annealing [6], dynamic programming [4] etc.
Starting from Relaxation Labeling [5], we propose here Deterministic Pseudo Anneal-
ing (D.P.A.), a variation on annealing, which shares some common flavors with mean-field
approximation [8], as well as Graduated-non-Convexity [3]. The basic idea is to extend
the probability of a labeling (a function defined on a discrete set) to a merit function
defined on continuous labelings (a subset of RJr a polynomial with non-negative coef-
ficients. The only extrema of this function, under suitable constraints, occur for discrete
labelings.
D.P.A. consists of changing the constraints so as to convexify this function, find its
unique global maximum, and then track down the solution, by a continuation method,
until the original constraints are restored, and a discrete labeling can be obtained.
We describe in Section 2 the optimization scheme. In Section 3, we relate an applica-
tion to image quantization, or segmentation, with comparisons with other methods.
Let ,9 = Si, 1 < i < N be a set of sites (pixels in this paper), each of which may take
any label from 1 to M. A global discrete labeling L assigns one label Li to each site Si
68
(from now on, _ denotes equality up to a constant scale factor). Thus P ( L / Y ) also
derives from a M.R.F., obtained by incorporating cliques of order 1 corresponding to the
P(yi/Li)'s, and the problem at hand is strictly equivalent to maximizing :
:(L) = (3)
eEC
It is important to notice that the W's can always be made positive by shifting, without
changing the solution.
We propose to cast this combinatorial optimization problem into a more comfortable
maximization problem in a compact subset of T~N. Let 7~"r : f .~ 7~ be defined by:
deg(e)
f(X) = E E Wel H xcj,~oj (4)
eEC I9162 j=l
where cI denotes the j,h site of clique c, and le~ the label assigned to this site by It. f is
a polynomial in the zi,k's, linear with any zi,k; its degree is the maximum degree of the
cliques.
Let us now restrict X t o ~oNM, defined by:
M
Vi, k : xi,~ > 0 & Vi : E xi,k = 1
k=l
Thus, any maximum of f o n PNM directly yields a discrete labeling, and the absolute
maximum of f (which has many local maxima) yields the solution to our problem.
The basic idea in D.P.A. is to maximize f on a subset on which it is concave, and
to track the maximum while slowly restauring the original subset. Let Q NM,d be the
compact subset of T~NM defined by:
M
Vi, k :xl,k_>0 & Vi : Z x ~ k =1
k----1
This simply means that, at each iteration, we select on the pseudo-sphere of degree
d the point where the normal is parallel to the gradient of f . Obviously, the only stable
point is singular, and thus is the maximum we are looking for. We have only proved
experimentally that the algorithm does converge very fast to this maximum.
This procedure, already suggested in [1] yields a maximum which, as in the case d -- 2,
is inside QNM,d (degeneracies apart), and thus does not yield a discrete labeling. So we
actually track down the solution, maximizing f on successive QlVM,~,s, with fl decreasing
from d to 1, starting from the last maximum.
This iterative decrease of fi can be compared, up to a point to a cooling schedule, or
better to a Graduated Non-Convexity strategy.
It is important to notice that, though shifting the coefficients does not change the
discrete problem nor the maximization problem on pNM, it changes it on Q~M,d, and
thus there is no guarantee that the same solution is reached. Besides it is not guaranteed
that the process converges toward the global optimum; actually, it is not difficult to build
simple counterexamples on toy problems. Experiments show nevertheless that, on real
problems, a very good solution is reached.
Finally, experiments have shown that the speed whith wich fi is decreased is not
crucial : typically, 5 to 10 steps are enough to go from 2 to 1.
We want to quantize an image into nicely connected areas, so that isolated pixels, or
small isolated areas with a grey level different from their background are eliminated. The
sites are the pixels, and the labels are the quantized grey-levels (typically 2 to 5). P(yl/Li
is modeled by N(mz, or). The clique potentials (corresponding to the observations) are
the logs of these quantities, suitably shifted to become positive.
The World Model (cliques of order 2) favours similar classes for neighbouring pixels,
and penalizes different labels. For example, if two neighbouring sites have the same
label, the energy is 0, else it is -1. This actually means that the ratio between the a priori
probability of having the same labels, to the a priori probability of having different labels
is exp(1), i.e. approx. 2.7).
Figure 1) shows the results on an indoor scene, with five classes, of mean values:
O,a,2a, 3a,4a, where a = 63.75 and o" = a/2.
70
4 Conclusion
The method presented here is a new deterministic alternative to recent stochastic meth-
ods for combinatorial optimization problems. It certainly is heavier than standard image
processing techniques, but compares favorably with these optimization methods. More
precisely, the experiments so far show that this method leads to results as good as G.N.C.,
another deterministic method, and is faster. It thus offers a better cost/performance
trade-off thml simulated annealing. Actually, we ran experiments on small graph label-
ing problems, and found out that the results were much better than realistic runs of
simulated annealing (i.e. with fast cooling schedules), and were not far from the ideal
solution (found by exhaustive search based on dynamic programming).
Determinism may well prove to be an important advantage, when a massive par-
allelization is realized. This has still to be investigated from a theoretical as well as
71
practical point of view. Other application domains (stereo, graph matching) are also to
be considered.
References
1. M. Berthod. Definition of a consistent labeling as a global extremum. In Proceedings ICPR6,
Munich, pages 339-341, 1982.
2. J. Besag. On the statistical analysis of dirty pictures. Jl. Roy. Statis. Soc. B., 1986.
3. A. Blake. Comparison of the efficiency of deterministic and stochastic algorithms for visual
reconstruction. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1:2-12, 1989.
4. H. Derin, H. E. R. Cristi, and D. Geman. Bayes smoothing algorithms for segmentation
of binary images modeled by markov random fields. IEEE trans, on Pattern analysis and
roach, intei., Vol 6, 1984.
5. O. Faugeras and M. Berthod. Improving consistency and reducing ambiguity in stochastic
labeling: an optimization approach. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence, 3(4):412-423, 1981.
6. S. Geman and D. Geman. Stochastic relaxation, gibbs distributions and the bayesian
restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:721-
741, 1984.
7. P. S. Marc, J. Chen, and G. Medioni. Adaptive smoothing : A general tool for early vision.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(6):514-529, 1991.
8. J. Zeruhia and R. Chellappa. Mean field approximation using compound gauss-markov ran-
dom field for edge detection and image restoration. Proc. ICASSP, Albuquerque, USA, 1990.
Abstract.
We present an approach to contour grouping based on classical tracking
techniques. Edge points are segmented into smooth curves so as to min-
imize a recursively updated Bayesian probability measure. The resulting
algorithm employs local smoothness constraints and a statistical descrip-
tion of edge detection, and can accurately handle corners, bifurcations, and
curve intersections. Experimental results demonstrate good performance.
1 Introduction
The set of image contours produced by objects in a scene encode important information
about their shape, position, and orientation. Image contours arise from discontinuities in
the underlying intensity pattern, due to the interaction of surface geometry and illumi-
nation. A large body of work, from such areas as model-based object recognition [8] and
contour motion flow [7], depend critically on the reliable extraction of image contours.
The contour grouping problem[I, 10, 6, 12] involves assigning edge pixels produced by
an edge detector [4, 3] to a set of continuous curves. Associating edge points with contours
is difficult because the input data (from edge detectors) is noisy; there is uncertainty in the
position of the edge, there may be false and/or missing points, and contours may intersect
and interfere with one another. There are four basic requirements for a successful contour
segmentation algorithm. First, there must be a mechanism for integrating information in
the neighborhood of an edgel to avoid making irrevocable grouping decisions based on
insufficient data. Second, there must be a prior model for the smoothness of the curve
to base grouping decisions on. This model must have an intuitive parameterization and
sufficient generality to describe arbitrary curves of interest. Third, it must incorporate
noise models for the edge detector, to optimally incorporate noisy measurements, and
detect and remove spurious edges. And finally, since intersecting curves are common, the
algorithm must be able to handle these as well. We believe our algorithm is the first
unified framework to incorporate these four requirements.
We formulate contour grouping as a Bayesian multiple hypothesis "tracking" problem.
Our work is based on an algorithm originally developed by Reid [11] in the context
of military and surveillance tracking of multiple targets (aircraft) in the presence of
noise. The algorithm has three main components: A "dynamic" contour model that
encodes the smoothness prior, a measurement model that incorporates edge detector
noise characteristics, and a Bayesian hypothesis tree that encodes the likelihood of each
possible edge assignment and permits multiple hypotheses to develop in parallel until
sufficient information is available to make a decision.
Section (2) develops the Bayesian hypothesis tree, and associated probability calcu-
lations. Section (3) describes some important implementation details, and is followed by
a presentation of experimental results in Section (4). We conclude in Section (5) with a
discussion of directions for future work.
73
of false alarms and new targets is not restricted. In this manner we impose the dual
constraints that (1) a measurement can originate from only one contour (disjointness)
and (2) that a contour produces only one measurement per iteration (this is guaranteed
by our adaptive search strategy).
Following [2, 9], we derive a recursive expression for the likelihood of a given seg-
mentation hypothesis conditioned on a set of edge measurements. A given hypothesis
at iteration k, ~gkm,is composed of a current event and a previous hypothesis resulting
from measurements up to and including iteration k - 1: ~ )k-1
= {O~(m),0m(k } We wish
to calculate the probability P { O ~ I Z k} = P ~O,n(k),O~,,~lZ(k),Zk-l~. Our deriva-
tion, presented in detail in [5], is based on two assumptions. First, the numbers of false
alarms and new edgels are Poisson distributed with densities AF and AN, respectively,
while contour initiations are distributed uniformly. Second, measurements assigned to
a given contour are Gaussian distributed, while false edge measurements are uniformly
distributed over the surveillance volume. Under these conditions, we obtain [2]:
rn k
-~-~--f~,F(r
p { o ~ l z k} = 1 +!~! =~- v H[N"[~'(k)IY'
i=l
3 Implementation
In this section, we briefly describe five important components of the segmentation al-
gorithm: the sCrnctnre and pruning of the hypothesis tree, contour and measnrement
models, a termination probability model, the surveillance volume, and a post-proce~ing
stage called contour fusion.
The hypothesis tree organizes segmentation alternatives and their associated like-
lihoods. Efficient implementation of this tree is critical to the practical success of the
algorithm. A single hypothesized contour may be present in many global hypotheses.
Rather than replicate this contour for each hypothesis with the associated memory and
computational overheads, Kurien [9] proposed the construction of a contour (or track)
tree, in which the root denotes the creation of a new contour and each branch denotes an
alternative measurement assignment. Each node in the global hypothesis tree contains a
set of pointers to track trees. Each set represents a different permutation of contour leaf
nodes from different contour trees, i.e. the global hypotheses enforce the assumptions of
disjoint partitions. The contour tree provides considerable savings, and is discussed in
detail in [9].
Pruning is an essential part of any practical contour segmentation algorithm. In our
implementation, pruning is based on a combination of the "N-scan-back" algorithm [9]
and a simple lower limit probability threshold. The "N-scan-back" algorithm assumes
that any ambiguity at iteration k is resolved by iteration k + N. Then, if hypothesis
~9~ at iteration k has m children, the sum of the probabilities of the the leaf nodes is
calculated for each of the m branches. The branch with the highest probabilty is retained
and all others are pruned. This gives the tree a particular structure: below the decision
node it has a depth of N, while above the node it has degenerated into a simple list of
assignments. In our experiments, N is set to 3. While this may seem quite small, previous
tracking results [9] suggest that even N = 2 can provide near optimum solutions. After
this procedure, the number of leaf nodes can still be very high. A second phase of pruning
removes all nodes whose probability is less than a lower limit, which is currently set to
0.01.
75
decisions must be based on a prior model for curve smoothness. Second, the difficulty
of the segmentation problem can be expressed by sensor/environmental statistics such
as mean rates of false edgels and new contours. The resulting algorithm is physically
grounded, i.e. all free parameters are physical quantities of the sensor and or scene that
can be physically measured. We believe the algorithm is the first unified framework to
incorporate a measurement noise model, scene statistics, optimal state estimation for
the contour model, statistical distance measures to quantify "closeness" and Bayesian
decision trees in a recursive formulation that is independent of the image sampling grid.
There are several avenues for future work. First, it would be useful to investigate other
contour models, such as a piecewise constant curvature model. It would also be interesting
to tightly couple the edge detection and contour grouping stages, with the latter providing
the former with expectations of where to look for edges. This unification may improve
upon standard techniques for curve enhancement, such as hysteresis [4]. Finally, we would
like to extend the Bayesian formulation to incorporate application specific segmentation
requirements. In a recognition scenario, for example, partial identification of the scene
could modify the prior probabilities associated with the contour grouping algorithm.
The authors would like to thank Y. Bar-Shalom, H. Durrant-Whyte, T. Kurien and
J. J. Leonard for valuable discussion on issues related to target tracking.
References
1. D. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recog-
nition, 13(2):111-122, 1981.
2. Y. Bar-Shalom and T. E. Fortmann. Tracking and Data Association. Academic Press,
1988.
3. R. A. Boie and I. J. Cox. Two dimensional optimum edge detection using matched and
wiener filters for machine vision. In 1EEE First International Conference on Computer
Vision, pages 450-456. IEEE, June 1987.
4. J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Analysis and
Machine Intelligence, 8(6):34-43, 1986.
5. I. J. Cox, J. M. Rehg, and S. Hingorani. A bayesian multiple hypothesis approach to con-
tour grouping. Technical report, NEC Research Institute, Princeton, USA, 1991.
6. C. David and S. Zucker. Potentials, valleys, and dynamic global coverings. Int. J. Com-
puter Vision, 5(3):219-238, 1990.
7. E. Hildreth. Computation underlying the measurement of visual motion. Artificial Intelli-
gence, 27(3):309-355, 1984.
8. D. Kriegman and J. Ponce. On recognizing and positioning 3-d objects from image con-
tours. 1EEE Trans. Pattern Analysis and Machine Intelligence, 12(12):1127-1137, Decem-
ber 1990.
9. T. Kurien. Issues in the design of practical multitarget tracking algorithms. In Y. Bar-
Shalom, editor, Multitarget-Multisensor Tracking: Advanced Applications, pages 43-83.
Artech House, 1990.
10. A. Martelli. An application of heuristic search methods to edge and contour detection.
Communications of the ACM, 19(2):73-83, 1976.
11. D. B. Reid. An algorithm for tracking multiple targets. 1EEE Transactions on Automatic
Control, AC-24(6):843-854, December 1979.
12. A. Shashua and S. Unman. Grouping contours by iterated pairing network. In Richard P.
Lippmann, John Moody, and David S. Touretzky, editors, Advances in Neural Information
Processing Systems 3. Morgan Kaufmann, 1991. Proc. NIPS'90, Denver CO.
This article was processed using the I$,TEX macro package with ECCV92 style
77
1 t /
UinL'~lr
#
lin~,/~ ]~
ln~dil=
Prcdlcfed Edges tlypo|hessJ .~latrzx
eammmm
(t)
] OtJ~t'v~d EdjeJ
(a)
'
(c) (b)
Fig.3. Contour groupings for (a) fork, (b) mouse and (c)
cutter. Circles denote contour end points.
Detection of General Edges and Keypoints *
1 Introduction
The interpretation of static, monocular grey-valued images is usually based on the hy-
pothesis that the loci of strong intensity variation are tightly coupled to physical events
such as 3D discontinuities (foreground/background) or changes of surface orientation.
However, the corresponding intensity variations normally differ from ideal edges. They
often have complex profiles and may not be perfectly straight. Therefore, edge detection
has to be a truly 2D process and linear operators based on idealized edge-models (e.g.
Canny [4]) seem to be inadequate for detecting complex intensity distributions.
Perona & Malik [16] have pointed out that, in general, linear operators cannot detect
and localize correctly edges and lines simultaneously. This problem becomes relevant in
real images as many "natural" edges are neither a pure edge or a line, but rather have
complex intensity profiles. Second order non-linearities in the form of local energy may
provide a solution to the problem [6] [1] [13] [12] [15] [16].
Yet, energy models and their "linear" precursors are intrinsically one-dimensional.
They cannot account for another important class of image features: corners, vertices,
terminations, junctions etc.. These two-dimensional intensity variations indicate, for ex-
ample, strong variations in contour orientation, terminations occurring in occlusion sit-
uations and many other relevant 2D features.
In this paper we propose a dual processing scheme which emphasizes the detection
of 1D signal variations on the one hand, and of points of strong 2D variations on the
other. We present a method which allows (1) to derive a valid indicator for the presence
of 1D edges with arbitrary profiles and (2) to detect and localize complex 2D intensity
variations. We use the term general edge (GE) for regions of 1D intensity variation and the
term keypoint for points of strong 2D intensity variation. The concept of a general edge
allows to develop a fully 2D filter model based on linear filters which are polar separable
in the Fourier domain. Even and odd filter outputs are then combined to oriented energy.
Two aspects of our approach are new: (a) The use of a contrast independent measure
for deviations from a general edge; this enables us to limit the application of the edge
model to points that qualify as general edge points. The local maxima of oriented energy
* The research described in this paper has been supported by the Swiss National Science Foun-
dation, Grant no. 32-8968.86.
79
can then be used to localize the edges [16],[12]. (b) The use of differential geometry ap-
plied to oriented energy maps, yielding a representation of strong 2D intensity variations
(keypoints).
The work presented here was partially motivated by our interest in biological mech-
anisms of contour processing [20] [19] [17] [7].
-=0FF
Fig. 1. Edge quality maps.
The top row shows a s0anple
vertex, corrupted by increas-
ing levels of noise (from left
to right: no noise, 20dB, 10dB
and 5dB SNR). The center row
!iii!i!!!!i
i i1,iii!!lI!!iii!ii!IiI!+!i! iiiiiiiiiiill
l ilfII !+ii+ii shows a plot of local orienta-
tion and edge quality, and the
bottom row shows the result
of edge detection. The com-
putimg was done on 128x128
pixel images to avoid border
effects. Shown are 32x32 pixel
cuts from the central part of
the images.
4 Keypoints
In the previous section we have described a method of finding GEs using a measure of
quality. On the other hand, there is a class of important image features with pronounced
2D variation of intensity such as line endings, corners, junctions etc. (keypoin~s). In this
section we present a method for detecting these keypoints. It is based on the oriented
energy maps and does not rely on an explicit model of any particular 2D intensity dis-
tribution.
The basic idea is to exploit the fact that deviations from a GE result in changes
of local energy magnitude along the edge. Deviations may be induced by all sort of
image features such as (a) a loss of contrast (e.g. line ending), (b) two or more edges of
different orientation meeting at one point (e.g. corner, vertex) or (c) continuous changes
of orientation (curvature). Directional derivatives in the orientation of a contour seem to
be a straight-forward way for detecting keypoints. Local extrema of the first directional
derivatives would indicate features like line-endings, corners and junctions. For strong
curvature and blobs the second directional derivatives would be more appropriate.
On general edges, the derivatives along the edge orientation are zero. At keypoints
a unique orientation cannot be assigned and thus derivatives in a single orientation are
inadequate for representing such features. Using the property that oriented energy sepa-
rates orientational components, we propose to take for each energy channel the directional
derivatives parallelto its orientation. We expect keypoints to have local extrema in deriva-
tive magnitudes. However, the directional derivatives are non-zero, also on GEs, for all
81
orientations that differ from the edge orientation. We show that these "false responses"
can be selectively eliminated by a compensation scheme that makes use of the systematic
nature of these errors. This compensation scheme is based on derivatives orthogonal to
each oriented energy channel. We use the terms p-derivative and o-derivative for direc-
tional derivatives parallel and ovthogonal to the orientation of a channel, respectively.
Assuming N filters with orientations given by O,~ = ~ and a GE with orientation 8,
we define 8, as unit vector parallel to filter orientation and 8,,j_ as unit vector orthogonal
to filter orientation. At location r we define
p(1)(r)=
dEn (r)
~
[ dUE,
and P ( = ) ( r ) = _ -
(r) 1+ , w i t h
~--~5 -j [~]+=max(0,~) (2)
as the gradient magnitude parallel to filter orientation (lst p-derivative) and as an esti-
mate of negative curvature of the magnitude of local energy along filter orientation (2nd
p-derivative). The latter corresponds to ' b u m p ' s ' of local energy along filter orientation.
Since we are not interested in local minima along oriented energy, p(2)(r) is defined
to be zero for positive values of the second directional derivative. Fig. 2 shows oriented
energy En and its directional p-derivatives for a sample corner. Each column represents
one orientation channel. With above definitions and in analogy to local energy we may
define a scalar keypoint map K(r):
/f(r) = max ~ p ( 1 ) ( r ) 2 + p(2)(r)2
n=0,N--1
In Fig. 3 the raw keypoint m a p / ~ is depicted for a sample corner, a line ending and a
T-junction. As mentioned above, the 1st and 2nd p-derivative will be zero on a GE only if
(8 - On) = 0. For (8 - On) ~ O, 1st p-derivatives will be zero only at the exact location of
a GE and 2nd derivatives will always be > 0 with a local maximum on the GE. Because
of these "false responses" to GEs, /~ is insufficient for detecting keypoints selectively.
The systematic nature of these false responses allows to construct a compensation map
C(r): +
Using the properties of a separable filter (orientation selectivity given by cos 2P (8 - Or,))
and the properties of GEs (the local energy of a GE is a GE too), it is easy to show the
following relations:
pO) (r) = s(r) cosup (8 - On) sin (8 - On) , P(=) (r) = s(r) cos uP (8 - On) sin = (8 - 8.)
(a)
where s(r) depends only on the profile of the GE and on the distance from its center.
Eqn. (3) suggests to use the directional derivatives orthogonal to the filter orientation as
the compensation signal for the systematic error of K. In analogy to the p-derivatives
we define the 1st and 2nd o-derivatives as
0En0~aj.(r) 0(2) .[ 02E, (r)] +
= and (')= 0- 2/
On a GE, the following relation holds:
O(n1) (r) = $(r) COS2p+1 (8 -- On) and O(n2) (r) m s(r) cos 2p+2 (8 -- On) (4)
For all O of a GE, the maxima of the 1st and 2nd o-derivatives are greater than the
maxima of the 1st and 2nd p-derivatives. Therefore,
Using the sum over all orientations as compensation has the advantage t h a t it is robust
in discrete implementations and we can define the compensation m a p as
N-1
~u): E (o~~ (~)+ o~~)(~)) ~,~ ~ou): [K(r)- ~U)] +
k----1
However, C is not zero at keypoints (e.g. at a line ending) as can be seen in Fig. 3.
Keypoints are characterized by the fact t h a t the orientation distribution of local energy
differs significantly from the distribution on a GE. This fact can be used to implement a
correction mechanism by combining orthogonal pairs of O i 2) (r) to form a new m a p R:
N/2-1
R(r) = E ~/oF)(r) oi~ (r) (5)
k=0
Fig. 3 shows/~ for the three sample keypoints. O b v i o u s l y / ~ > 0 on GEs, but it remains
almost constant with varying 0 (substitute (4) into (5)).
K C R
ORIENTATION
K C R
K C R
NNN
D DDFrI I K C R
k ~ k
F i g . 2. Responses to a 900 corner (top im- Fig. 3. Keypoint detection using a corner, a
age) of oriented energy E~, first p-derivative line-end and a T-junction (32x32 pixels). Top
p(1), second p-derivative p(2), first o- rows: original image, final keypoint map (K),
derivative 0 (1) and second o-derivative O~ ). corrected compensation map (C), corrected
Image dimensions axe 32 x 32 pixels; filter combination of o-derivatives (R). Bottom rows:
paxameters axe p = 2, cr = 3. uncompensated keypoint map ( / 0 , raw com-
pensation map (~ and unco~ected R map (/~).
The extrema/~ and ~ n i n differby less t h a n 10 percent of ~D~n~ (using filters with
N = 6 and p = 2). It is interesting to note t h a t , supposing t h a t N is even and N > ( p + l ) ,
the sum over all 2nd o-derivatives is constant and does not depend on 0. Therefore we
e s t i m a t e the error o f / ~ ( r ) on GEs with the sum over all 2nd o-derivatives and define
An estimate of 7 can be easily derived setting 0 -- 0, substituting (4) into (6) and
resolving the resulting equation for R(r) = 0. With this definition of R(r) we finally
define the following compensation map C:
This compensation map fulfills all requirements: (1) it cancels successfully all systematic
errors of the raw keypoint map at general edges, and (2) it is zero at the location of
keypoints. Fig. 4 shows the different steps of keypoint detection on a simple gray-valued
image. The keypoint detection scheme has been tested on a wide variety of complex
natural scenes. Results will be shown in the next section.
(D) (E)
5 E x p e r i m e n t a l results
I m p l e m e n t a t i o n : Convolutions with the twelve filter kernels were carried out in the
Fourier domain. Six maps of oriented energy were generated by quadrature pair sum-
mation of even and odd filter convolution outputs (we took the squareroot of oriented
energy to reduce signal dynamics to those of the original filter outputs). Binary edge
maps were generated by finding local maxima orthogonal to the orientation of the best
responding energy channel at each location.
To compute the derivatives in the various directions oriented energy maps were sam-
pled at discrete offset positions around center pixels. First derivatives corresponded to
the difference in value at two offset pixels and second derivatives to their average minus
the value of the center pixel.
84
Fig. 5. Application of the keypoint scheme to an outdoor scene: (A) input image. (B)
keypoint map superimposed on a contrast reduced version of the original image. Image size
is 512 x 512 pixels and filter parameters are p = 2 and ,r = 2.
Fig. 6. Edge map (A) and keypoint localization (B) for a subsection of image in Fig. 5.
Keypoint positions (pixel accuracy) axe indicated by crosses.
85
The dark blobs correspond to the keypoint map superimposed on a contrast reduced
version of the original image. Note that the keypoint strength is a function of local image
contrast. Thus weaker markings do not necessarily indicate a weaker evidence for the
presence of a corner, vertex etc.. It can be seen that no markings, whatsoever, occur on
straight contour segments. This proves our compensation scheme to be effective also with
more complex 2D intensity configurations. Fig. 6A shows, for a part of the image, the
contour map extracted with the non-maximum suppression scheme described above. A
threshold of 8% of the maximal oriented energy response was applied. One can see that
while straight parts of contour are well represented we often find gaps or distortions in
the neighbourhood of keypoints. Fig. 6B shows the location of keypoints indicated by
crosses (threshold again 8% of the global keypoint maximum). As both general edge and
keypoint location have been derived from oriented energy, the maps are commensurable
and complementary.
6 Discussion
We have presented a computational framework for extracting (1) intensity discontinuities
which can be described as 1D variations (general edges) and (2) keypoints with a true
2D intensity distribution (corners, vertices, terminations etc.).
Using oriented filters with even and odd symmetry we combine their convolution
outputs to oriented energy. This quadrature pair summation has the advantage that edges
and lines and combinations thereof are treated in a unified way and can be unambiguously
localized [12], [15], [16]. The edge quality measure we derived is used to select for the
edge map only those pixels that exceed a predefined quality. This way we can be sure
edge detection is valid at these locations.
The detection scheme for keypoints represents a novel approach to the problem of
detecting and localizing image features like corners, junctions or terminations. This is
more difficult than detecting edges. The abundant richness of two-dimensional intensity
variations seems to prohibit approaches of the form of simplified model prototypes as,
for example, the Heaviside function used in edge models. What seems to be important
is to reduce the dimensionality of the problem by generating invariant representations.
Oriented energy is invariant with respect to the polarity and the type of edge [15].
We propose to detect keypoints by taking first and second derivatives on the energy
maps in the filter direction (p-derivatives). The idea is that true 2D features produce
strong variations in the energy signal parallel to its orientation. However, markings also
occur for general edges if its orientation and the direction of the derivative differ. We
have introduced a scalar compensation map that selectively suppresses these unwanted
derivative signals.
It seems that differential geometry is an adequate way to attack two-dimensional
intensity variations. However, compared to other approaches that also use methods of
differential geometry (e.g. [3], [10], [5]) , our model is not gray-level based (smoothed
version of the original image) but uses oriented energy maps which have the advantage
of representing different edge types in a unified way. Furthermore, our approach does
not contain any specific model of keypoints, as for example a corner [9], [14], [18], vertex
[5], T- or L-junctions. In this respect we cannot expect selectivities for these specific
2-D intensity variations. Our scheme detects and accurately localizes corners (di-, tri-,
tetra-hedral junctions of different angles and contrasts) as well as line-terminations, T-
junctions, strong curvature and blobs. However, the information given by the first and
second p-derivatives in different orientations may be used to classify the keypoints. We
are currently working on a processing scheme to classify keypoints paying attention to
occlusion situations and to the distinction between foreground/background structures.
86
References
1. Adelson, E. H. & Bergen, J. R.: Spatio-temporal energy models for the perception of motion.
Journal of the Optical Society of America A 2 (1985) 284-299
2. Barrett, H. B., & Swindell, W.: Analog Reconstruction for Transaxial Tomography. Pro-
ceeding of the IEEE 65 (1977) 89-107
3. Beaudet, P. R.: Rotationally invariant image operators. 4th International Joint Conference
on Pattern Recognition, Kyoto, Japan (1978) 578-583
4. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence 8 (1986) 679-698
5. Girandon, G. & Deriche, R.: On corner and vertex detection. IEEE Proc. CVPR'91, Mani,
Hawai (1991) 650-655
6. Granhnd, G. H.: In search of a general picture processing operator. Computer Graphics
and Image Processing 8 (1978) 155-173
7. Heitger, F., Rosenthaler, L., yon der Heydt, R. Peterhans, E. and Kfibler, O.: Simulation of
neural contour mechanisms: From simple to end-stopped cells. Vision Research 32 (1992)
in press
8. Hubel, D. H. & Wiesel, T. N.: Receptive fields and functional architecture of monkey striate
cortex. Journal of Physiology, London 195 (1968) 215-243
9. Kitchen, L. & Rosenfeld, A.: Gray level corner detection. Pattern Recognition Letters 1
(1982) 95-102
10. Koenderink, J. J. & van Doom, A. J.: Representation of local geometry in the visual system.
Biological Cybernetics 55 (1987) 367-376
11. Mart, D. & ttildreth, E.: Theory of edge detection. Proceedings of the Royal Society, London
Series B 207 (1980) 181-217
12. Morrone, M. C. & Burr, D. C.: Feature detection in human vision: A phase-dependent
energy model. Proceedings of the Royal Society, London Series B 235 (1988) 221-245
13. Morrone, M. C. ~ Owens, R. A.: Feature detection from local energy. Pattern Recognition
Letters 6 (1987)303-313
14. Noble, J. A.: Finding corners. Image Vision and Computing 6 (1988) 121-128
15. Owens, R., Venkatesh, S. ~ Ross, J.: Edge detection is a projection. Pattern Recognition
Letters 9 (1989) 233-244
16. Perona, P. ~z Malik, J.: Detecting and localizing edges composed of steps, peaks and roofs.
UCB Technical Report, UCB/CSD90/590 (1990)
17. Peterhans, E. & yon der tteydt, R.: Mechanisms of contour perception in monkey visual
cortex. II. Contours bridging gaps. Journal of Neuroscience 9 (1989) 1749-1763
18. Rangarajan, M., Shah & Brackle, D. V.: Optimal corner detector. Computer Vision Graph-
ics and Image Processing 48 (1989) 230-245
19. yon der Heydt, R. & Peterhans, E.: Mechanisms of contour perception in monkey visual
cortex. I. Lines of pattern discontinnity. Journal of Neuroscience 9 (1989) 1731-1748
20. vonder t~eydt, R., Peterhans, E. & Baumgartner, G.: Illusory contours and cortical neuron
responses. Science 224 (1984) 1260-1262
Distributed Belief Revision
for Adaptive Image Processing Regulation*
V i t t o r i o M u r i n o , M a s s i m i l i a n o F. Peri, a n d C a r l o S. R e g a z z o n i
1 Introduction
Image understanding ~ can be represented according to three principal phases: low-level lXO-
cessing, feature extraction and descriptive primitives grouping, and high-level reasolmig [HRI]. Each of
these phases may affect the goodness of final results. Little has been proposed in the literature for the
adaptive regulation of pr(r.e&~g lxuameters. The nearer and main effort has been that of I-lmtsou and
Riseman which designed VISIONS [HR2], a IZalowledge Based (.KB) system able for recognizing 3D ob-
jects, supervising low-level processing phases. However, this approach is neilher dynamic (i.e., it may
only be operated after the end of an interpretation cycle), nor distributed (i.e., it is based on the interpreta-
tion phase, and not on the knowledge available to each local low-level processing unit). In this paper, we
p ~ t a distributed algorithm for adaptively regulating image-processing parameters at multiple levels:
at each level, the loss of information occurring while mapping data input into local data representation is
minimized. The stone regulalion strategy is applied to any level, so allowing the design of a general proto-
type for the regulation module. This prototypical inference engine is specialized at each level, in accor-
dance with the different peculiarities oftbe processing units. Processing-parameter regulation is based on
evaluations of local data quality, and is performed by mapping quality features into the acttud values of
the Ixwamelers to be regulated. Results on a set of indoor hrtages ,'u~ewovidcd ill o~der to assess ihe v~did-
it), of the p r o ~ approach.
2 Problem Statement
In some applications, the architecture of an image understanding system can be represented as a singly
connected network (Fig.l) to which distributed problem-solving teclmiques, using probabilistic r ~ z f i n g
as an inference lxu'adigm, can be efficiently applied [Pel]. According to this model, each node of the
network is associated with a variable to be estimated, and is comlected with parent and son nodes by bidi-
rectional communication channels. An intermediate (not terminal) node can receive two types of mes-
sages: the evidence X, that is, the information coming flom lower level nodes, and the expectation rr.,
coming from parent ~rxles. If xi is the variable to be estimated by the i-th node, a Belief functioti [Pel]
can be defined as: BEL(xi)=ct.L(xi).n(xi), where ct is a normalizing constant. By maximizing BEL ovex
all received messages, the kxally optimum xi value can be obtained. Moreover, messages based on the
new variable settings can be computed and propagated to the neighboming nodes hi a distributed fashion.
When the estimation of the optimal variables is achieved, a stable status is reached (i.e., no message is
* This work was carried out and supported within the fnunework of the MOBIUS project (no. MAST-0028-C), which
is included in file CEC MArine Science and Teclmology (MAST) programme.
88
present in the network). Different kinds of messages can be used, depending on the type of problem con-
sidemd. In the used case of Belief Revision (BR) [Pel], the optimization criterion is the Most Probable
Explanation (MPE), which is a generalization of the Maximum A-Posteriori (MAP) probability criterion.
By using this criterion, the joint probability ofx i is made explicit by using the Bayes nile:
Pr(x i]e)=maXxi,xi,xi+Pr(xiJxi).Pr(xi).Pr(xi.Jxi)=13.maXxi~k(xi).~xi).~(xi) (1)
where ~xi)=maxxi.Pr(xiJxi), rc(xi)=maXxi+Pr(xiJxi),~(xi)=Pr(xi), and xi_, xi+ are lower and Ifigher levels
variables, respectively. In our application, each node is considered as a virtual sensor whose "acquisition"
phase depends on specific sets of parameters, and which produces an oulput representation to be sent to
higher-level modules. The status of each module can be identified by the set of parameters Pi regulathlg
the k~al transformation process. This means that, once the w.al scene (do) to be considered has Ix~n
fixed, the datum considered as input at each level i, di, with i;~0, can be univcr.ally determined if the
parameters Pj, j<i are known. As a consequence, by estimating tlie optimal set of processing p,'umneters,
an optimal set of image representations di can be obtained. If we pose x=P, we can write equation (1) as:
max pi{Pr(Pi) [max Pi_lPr(Pi_l/Pi)] [max pi+lPr(Pi+l/Pi)] } = max Pi {qt(Pi)~k(Pi)x(Pi)}
The BU message ;Lis computed by the lower level module, and indicates the op~nal condilional prob-
ability distribution for each Pi-I when the conditioning factor is Pi. The TD rc message represents an anal-
ogous disllibution: it is computed by the higher-level module, and indicates the optimal COladitional-prob-
ability distribution for each P~q when the conditio~lg factor is Pi ~The term ~,represents local regul;wiz-
ing knowledge for the optimization of the l~.al status variable Pi. This amounts to defumlg default pa-
rameter values (specific for each level) that are usually expected to give good iecal results. Depending on
higher- and lower-level status, further kr.al tunings may become necessary to improve the quality of the
global data flow in the processing chain. The same reasoning mechaJfism can be extended directly to the
SlmCe-varying case by performing regulation-parameter tunings to improve p,'ulicular subimages di~a~e-
garding possible effects on the rest of the image. By applying the MPE criterion to all the subwindows
separately, the ~ data quality at all levels of abstraction may be obtained for each subwindow.
3(P.~=Pr{Pi}=(Zi)'l.exp(-Qi(Pi)), where Z i is the partition function at level i. In this way, in absence of in-
put messages, the parameter which gives the highest quality, is regarded as being the most probable on
the basis of the local regularizing knowledge. Every module propagates a BU evaluated dalum (a datum
associated with its "energy"): on the basis of this message, the receiving module can compute a proper ;L
as: X(Pi) = (Zi)-1. exp(-Umi(Pi)), k/Pi, where mi=Di(Qi_l(Pi.l)) is a function mapping, by means of the dis-
cretization process Di, lhe le.~'llenergy values Qi-I in the number m i. Such munher is selected in a discrete
set (mi ~ {1,..,Mi}) and a d ~ the probability distribution at level i, according to the quality assess-
ment at level i-l, Umi(Pi) (in our case, m i ~ {1,..,3} corresponding to the set {Low, Medium, High} qual-
ity assesmaents). As propagated datum is the one correslx)nding to the best quality assessment, the maxi-
mization over all possible Pid is implicitly performed at level i- 1, and ;~Pi) may he considered as one of a
set of possible fixed distributions at level i. At the moment, as TD messages, we propagate requests for
focusing the atlention on particular subareas. In the bootstrap phase the r: messages may be viewed as:
x(Pi)=(dim{Pi})-l, being dim{Pi} the number of,all possible different parameters settings at level i. This
means that local regulation parameters are equally probable ('capable of obtai|fing the s~une quality
judgemen0 in absence of a-priori expectations. If focus of attention suggestions ,are propagated all plevi-
ously used regulation parameters, say {~}, wer~ not suited with respect to higher level's quality ~scss-
ments: thus, their associated probabilities may be reassigned among all not yet used regulation p~munetet~
at level i. This way, the maximization over all possible Pvl has no longer effect, that is:
n(Pi) = (dim {Pi}-dim{~})-I ifpi ~ {~1 and r4T'i) = 0 ifPi ~ {~}. In the pro[xxsed system, every module
receives this way fixed ~ and rc messages: the maximization of'i(Pi) over all possible Pi is the otdy one to
be performed.
from reaching a saturation state. Preprocessing results are not shown, as the application of the Perona-Ma-
lik filter to nearly noiseless images (like the analyzed ones), yields negligible results, according to a pure
visual criterion.
The third module consists in an edge-detection filtering performed by means of the CaJmy algorithm
[Cal]. For this algorithm, we regulate the two thresholds for the hysteresis process that has to be applied
to the detecled edge points. Heuristic criteria have been chosen for quality ~ e n t , such as the number
of edges, the number of edge points, the number of long edges, and the number of connections among
edges. These criteria give a quantitative judgement on the ufform,ative content of the image. Fig.5 gives
the results of the edge-extraction process performed on one of the images (aerial__2); they were obtained
by using two different settings of the Canny algorithm thresholds. In the right image, the parameters have
been relaxed by the regulation process in order to take into accom~t also less "informative" edges.
The main goal of the Line-Detection module is the production of the "best" scene representation by
means of straight lhw.s extracted through the Hough tr',utsform [1KI], on the "basisof the dill~i~enfly pro-
cessed edge images obtained by the edge-extractor module. Actually, this module can neither propagate
evaluated data nor receive TD suggesfious from higher modules, and caJmot not yet adaptively regulate
its local processing parameters. The goal ks reached by performing a data fusion on all the binary edge
images based on their relaled quality ,'~sessments (i.e., accumulators in the Hough space are i n c ~ e d by
a value proportional to the quality judgement o,i the window contahmig the relalcd edge). Fig.6 shows the
results of the Line-Deteclion module after having fused in the Hough space and backpi~jccted in the
~ian space one (righ0 and three of the edge images (the edges of aerial_4 image and the ones of
aerial_2 images, respectively obtained by using default and relaxed settings). An improvement in the de-
tection of rectilinear segments can be noted; this is particularly evident after the third fusion (in the right
image), where meaningful segments appear, though with no substantial modificalions to file l~ll ~cenc
sll'UCture.
6 Conclusions
A framework for the regulation of image-processing phases h~ been presented. The actual system in-
cludes four regulation modules, each coupled to a different processing step. This subdivision allows one
to define a set of intermediate abstraction levels at which to perform the matc~ags necessary to gradually
obtain a robust image interpretation. Each module ks provided with a-priori knowledge represented by
regulation parameters for controlling local data transformalions, and by quality parameters for a quantita-
tive assessment of transformed data. Estimation of the best transformation parameters to be progressively
applied to input data is based on belief revision theory, according to the MPE criterion. When a solution
to the BR problem ks fotmd, a suboptimal estimation of regulatibn parameters at all system levels ks
reached. As a single prototype has been developed for each module, the system can he implemented on
parallel architecture.
Acknowledgment
The authors would like to thank the Thomson CSF - LER (Rennes, France) and the Thomsoq Sinwa
ASM (Brest, France) for providing all the images processed for this work.
91
References
[HRI ] A.R.l-k'mson,F_.M.Riseman:A Smlm~azyof hnagc Undcrsh'ulding Reseatdl at the University of M~L,~adms.~.'L~.COINS'redl0k~d
Report 83-85,DelX.of Comlxner toldInfom~,'u/onScience,University of Massadlussels (1985)
['lfl~..2]A.R. lianson, E.M. RJs~'n,"Ul:The VISIONS bnage Underst;mding System-86. Advmmes in Cmnpuler Vis~l, Edb,aum (1987)
[Pc I ] J.Pead: ~otxd)ilistic Re,x=.oe,ing in Mtelllgent Systml: Network of Raust'ble Inference. Morgml Kaufinann Ptlbl. inc., S,'ul Mateo, CA,
(1988)
[Kd ] " ] ~ P , ~ o v : Act/ve Cotr~xlter Vision by Coop2nltiVe I~x.'us~u',dS~en.,o,Sp~g,zr.Ved;tg, New-Yolk, NY (10g0)
[I Io | ] B.K.P. I Iota: I;tv2u.,~ing,"l~dadcal RL'poll, M;kmr lu.tas~.'sl Im.'filul~of'I L'dav)k)gy ( I tX~8)
[PMI ] P.Pemm, JA4al~: Scalo-SIx~r and Edge Dek~aion Using Anisotmpic Diffuse1 IFI ~ Tra==.on PAM112, No. 7 (1990) 629-639
[Ca I] J.Canny: A Computational At~:xldt to Edge Dett.'ct~L IEEE Tttms, on PAMI 8, No. 6 (1986) 679-698
OK 1] J.115ngwo~z, J. Kitllee A survey of die Hough trm~sfonn.Computer VLs~I, GraltdCs mid hnage Pmce~i,tg 44 (I 988) 87- I 16
[MRI ] V.Mu~lo, C.SJ~gazzoni: A Distributed Algoddml for Adaptive ReguL'dkmof hnage I~uct-ssing l);mut~:mrs. IEI'2E InL CoftJ: oil
Syslcln, M;m, mid Cybernetics, C]~afloll~villc, VA, USA ( 199 l) 25%2/..-4
k d4=T4(P4.d3)
( RECI"ILINEARSEGMENi"~
leveld ~, DETECIION MODULE..)
~ O3=I3(P3.d2)
level 3 I EDGE~oXID~LCTION1
~I d2=T2(P2.clI)
T
level 0 treolscene O0
qi-I.Ni. [ ( ~ wi'l,Ni-I
Fig.4 - Profile of the ,cost function at the careen, level
Fig.2 - ~ message generation on all windows of d=e image of Fig. 3 (aerial_4)
Fig.5 - Outcome of the edge-extraction module of the Fig.6 - Outcome of line-dctectlon module after the
aerial_2 image with different so!tings of SI and $2 backprojcclion of one (loll) and three (righl)
thresholds (right, after regulation) images after cdge-r regulation
Finding Face Features *
1 Introduction
We describe a program to recognise and measure human facial features. Our original
motivation was to provide a way of indexing police mugshots for retrieval purposes.
Another use comes when identifying points on a face with a view to cartooning the face
[2] or to obtain a more useful Principal Component Analysis of face images [3], leading
to a compact representation suitable for matching. The work on blink rate is part of a
P R O M E T H E U S (CED2 - Proper Vehicle Operation) study by the Ford Motor C o m p a n y
on driver awareness, for which blink rate is a useful indicator. The aim is to correlate
the blink rate, obtained during normal driving conditions, with the output from other
sensors. Ideally this would enable blink rate to be predicted from measurements which
can be made more cheaply as part of the normal sensor input available from a modern
car.
The system aims to locate a total of 40 feature points within a grey-scale digitized full
face image of an adult male. We initially choose to ignore glasses and facial hair, although
such images are Occasionally used to test the robustness of the system. The points chosen
are those described in Shepherd [8]; thus allowing us to utilise data originally recorded by
hand. A total of 1000 faces were measured, and the locations of the 40 points on each face
were recorded. This data has been normalized and forms the basis of the model described
below. Identification is confirmed by overlaying the points on the image; the points are
usually linked with straight lines to form a wire frame face as shown in figure 1.
Our system is in two distinct parts:
- a control structure, driven by a high level hypothesis about the location of the face,
which invokes the feature finding modules in order to support or refute its current
hypothesis.
The program endeavours to confirm that a face is located within the image in a two
part process. Possible coarse locations for the face are sought ab initio providing contexts
for further search, using modules capable of providing reliable, although not necessarily
accurate locations even when given a wide search area. Contexts are then refined and
assessed. In this phase, feature finding modules are usually called with very restricted
search areas, determined by statistical knowledge of the relative positions of features
within the face. When all the required feature points are located in a single context, this
identifies the face itself; and the features' locations provide mutual support.
2 Feature Experts
Feature experts obtain information directly from the image. Our current set range from
simple template matchers to much more elaborate deformable template models. Imple-
menting a template matcher is trivial, and execution is rapid, but problems arise from
their heavy dependence on scale and orientation, and multiple responses from many
parts of the image: however FindFace is normally confirming a good working hypothesis;
our templates are generated dynamically, tuned to the expected size of the feature, and
applied to a small search area.
We describe methods designed primarily for initial location as "global methods" as
opposed to the "local methods" used to assess or verify a location proposed by the existing
context. So, for example the global methods for locating a single eye look either for dark,
compact blobs completely surrounded by lighter regions or for areas of the image with
a substantial high frequency component. Locally the eyes use a probabilistic eye locater
rather like the outline finder we describe below, or even the blob detector confined to a
small area, if the uncertainty of the location is still high. A full list of feature experts can
be found in [4]
A number of algorithms have been proposed for finding the outline of a head, dating
at least from Kelly [6]. Our approach is inspired by the work of Grenander et al.[5] and [7],
in which a polygonal template outline is transformed at random to fit the data, governed
by our detailed statistics [1]. The advantage of this approach is that the background
can be cluttered (cf figure 3), and the initial placing of the outline is not required to be
outside the head.
The approximate location, scale and orientation of the head is found by repeatedly
deforming the whole template at random by scaling, rotation and translation, until it
matches best with the image whilst remaining a feasible head shape. This feasibility is
determined by imposing statistical constraints on the range of allowable transformations.
The optimisation, in both stages, is achieved using simulated annealing; although this
means the method is not rapid, it appears to be particularly reliable.
A further refinement is then achieved by transforming the individual vectors within
the polygon under certain statistical constraints. Consider the outline as a list of vectors
I v 1 , . . . , vn] with vi E IR2 for each i, where the representation is obtained by regarding
each vector as based at the head of the previous one in the list. Since the outline is closed.
vl + ... + vn = 0; a new outline may be generated by applying (n - 1) transformations
from the group 0(2) x US(2), where O is the orthogonal group and US is the group of
uniform scale change, to the first n - 1 vectors in the list. We represent elements of this
94
Fig. 1. Wire frame model. Fig. 2. The template for a Fig. 3. Identification in a
head outline. cluttered image.
(_::)
When u = 1 and v = O, each transformation is the identity map, and the resulting
outline is the (initial) average one. Our first approximation to generating a variable
outline is to chose ul E A/'(1,~ru) and vi E Af(0, av) for 1 _< i < n - 1, where the
corresponding variances are calculated from our detailed measurements. This then gives
a shape (behaviour) score of the form
where the constants bu and b~ are associated with independent behaviour at each site.
This alone allows too much uncoordinated variation between neighbouring vectors.
To ensure more coordinated, head-like, polygons, we also place a measure on the change
in transformations generating adjacent vectors, making the assumption that the u's and
v's are realisations of Markov random chains of order 1. This gives a component of the
shape (acceptance) score in the form
A = exp (1(
- ~,, ( ~ - y 2 _ t ) ~ + a~ ~ (v~ - vA_O ~ ))
where the constants au and av scale the bonding relations.
By varying the parameters au, av, bu and by, we control the variability and coordi-
nation of the transformations. High values of au and av force coordination, whilst high
values ofbu and by allow greater variability. The a's represent the variability of particular
vector transformations and allow variability in the distributions from vector to vector.
Our description means that the first point of the outline will remain fixed; in fact we
apply the above procedure only to a sublist of the original outline list (a sweep site in
Grenander's terminology). Since we regard the list as cyclic, the base point can move,
and the outline slowly change location.
95
3 Control
Fig. 4. Initial context des- Fig. 5. Initial context for Fig. 6. Final result obtained
tined for rejection in favour which refinement succeeds. for the context shown in fig-
of the one in figure 5. ure 5.
The model expert is responsible for creating an initial set of contexts Two such
contexts are shown diagramatically in figures 4 and 5. We deliberately generate a number
of contexts at this stage; ultimate success requires a context to be relatively close to the
correct position; in practice we rarely need refine more than three contexts.
The location of the remaining features and subsequent refinement of all features forms
the second phase of the operation of FindFace. The feature's location in the model is
transformed into image coordinates; a search area is also defined based on the model
feature's variance and the current context residual. Each feature expert in the list is
consulted in turn until one returns a positive result. Since the residual decreases mono-
tonically apart from when each of a finite number of feature points is added to the
context, convergence necessarily occurs; in fact it happens quite quickly. A completed
context is shown in figure 6.
4 Results
The FindFace system has successfully been demonstrated on many interested visitors
to the department, and their faces often include attributes FindFace is not designed to
work with - - glasses, beards or females. More rigorous testing has been performed on
random batches of 50 images from our library of faces, including subjects with beards
and glasses. Another test used a sequence of 64 images of a moving subject, at about 8
frames per second. In a typical batch,
- the head position is correctly located in all images, with the outline completely de-
tected in 43 cases - - the region normally missing from the remainder is the chin;
96
- the absence of feature experts for the eyebrows reduces the number of possible feature
point locations to 1462, of which the system claims to identify 1292;
- of the located points, 6% were inaccurately or incorrectly identified - - again the
mouth and chin region were usually in error.
The problem with the mouth and chin region is partially attributable to the inclusion
of subjects with beards and moustaches; somewhat surprisingly, glasses do not interfere
as much as originally anticipated. On the sequence of 64 images, the overall success rate
increased to greater than 95%. This result was achieved by processing each image ab
initio; better results would have been obtained by using a priori knowledge obtained
from the previous image(s).
5 Working Faster
A second implementation based on the same design philosophy aims to make detailed
measurements of the eye region from a real time video sequence, and hence detect eye-
lid separation and so blink rate. The other points in the modei need only be located
for corroboration purposes; valuable since the image sequence originates from a camera
mounted on the dashboard of a car, pointing towards the driver. The resulting images
suffer from poor contrast, and a variety of different noise elements caused by vehicle mo-
tion, changing lighting conditions etc. We now incorporate facilities for initialising the
system with a new subject, and tracking movement between frames. A system has been
successfully demonstrated that tracks the eye movement at approximately 5 frames per
second, although the detailed eye measurements were not being produced.
References
1. A. D. Bennett and I. Craw. Finding image features using deformable templates and detailed
prior statistical knowledge. In P. Mowforth, editor, British Machine Vision Conference 1991,
pages 233-239, London, 1991. Springer Verlag.
2. P. J. Benson and D. I. Perrett. Perception and recognition of photographic quality facial car-
icatures: Implications for the recognition of natural images. European Journal of Cognitive
Psychology, 3(1):105-135, 1991.
3. I. Craw and P. Cameron. Parameterising images for recognition and reconstruction. In
P. Mowforth, editor, British Machine Vision Conference 1991, pages 367-370, London and
Berlin, 1991. British Machine Vision Association, Springer Verlag.
4. I. Craw, D. Toek, and A. Bennett. Finding face features. Technical Report 92-15, Depart-
ments of Mathematical Sciences, University of Aberdeen, Scotland, 1991.
5. U. Grenander, Y. Chow, and D. Keenan. Hands: A Pattern Theoretic Study of Biological
Shapes. Research Notes in Neural Computing. Springer-Verlag, New York, 1991.
6. M. Kelly. Edge detection in pictures by computer using planning. In B. Meltzer and
D. Michie, editors, Handbook of research on face processing, pages 397-409. Edinburgh Uni-
versity Press, Edinburgh, 1971.
7. A. Knoerr. Global models of natural boundaries: Theory and applications. Pattern Analysis
Technical Report 148, Brown University, Providence, RI, 1988.
8. J. W. Shepherd. An interactive computer system for retrieving faces. In H. D. Ellis, M. A.
Jeeves, F. Newcombe, and A. Young, editors, Aspects of Face Processing, chapter 10, pages
398-409. Martinus Nijhoff, Dordrecht, 1986. NATO ASI Series D: Behavioural and Social
Sciences - No. 28.
This article was processed using the ISTEX macro package with ECCV92 style
D e t e c t i o n of Specularity Using Color and Multiple
Views *
Abstract. This paper presents a model and an algorithm for the detection
of specularities from Lambertian reflections using multiple color images from
different viewing directions. The algorithm, called spectral differencing, is
based on the Larabertian consistency that color image irradiance from Lam-
bertian reflection at an object surface does not change depending on view-
ing directions, but color image irradiance from specular reflection or from
a mixture of Lambertian and specular reflections does change. The spectral
differencing is a pixelwise parallel algorithm, and it detects specularities
by color differences between a small number of images without using any
feature correspondence or image segmentation. Applicable objects include
uniformly or nonuniformly colored dielectrics and metals, under extended
and multiply colored scene illumination. Experimental results agree with the
model, and the algorithm performs well within the limitations discussed.
1 Introduction
Recently there has been a growing interest in the visual measurement of surface re-
flectance properties in both basic and applied computer vision research. Most vision
algorithms are based on the assumption that visually observable surfaces consist only of
Lambertian reflection. Specularity is one of the major hindrances to vision tasks such
as image segmentation, object recognition and shape or structure determination. With-
out any means of correctly identifying reflectance types, image segmentation algorithms
can be easily misled into interpreting specular highlights as separate regions or as dif-
ferent objects with high albedo. Algorithms such as shape from shading and structure
from stereo or motion can also produce false surface orientation or depth from the non-
Lambertian nature of specularity. Therefore it is desirable to have algorithms for esti-
mating reflectance properties as a very early stage or an integral part of many visual
processes. In many industrial applications, there is a great demand for visual inspection
of surface reflectance which is directly related to the quality of surface finish and paint.
Although the measurement of surface reflectance properties in applied physics has
been the topic of many research efforts, only a few attempts in computer vision have
been made until recently. There has been an approach to the detection of specularity
with a single gray-level image using the Lambertian constraints by Brelstaff and Blake
[BB88]. They attempted to extract maximal information from a single gray-scale image.
* This work was partly supported by E. I. du Pont de Nemours and Company, Inc. and partly by
the following grants: Navy Grant N0014-88-K-0630, A F O S R Grants 88-0244, A F O S R 88-0296;
A r m y / D A A L 03-89-C-0031PRI; NSF Grants C I S E / C D A 88-22719, IRI 89-06770, and A S C
91 0813. W e thank Ales Leonardis at University of Ljubljana, Slovenija, for his collaboration
on our color research during his stay at the G R A S P Lab. Special thanks to Steve Sharer at
Carnegie Mellon University for helpful discussions and comments.
100
information can be obtained. For low-level vision problems of shape or structure, it has
been demonstrated that many ill-posed problems become well-posed if more information
is collected by active sensors [AB87]. Although the paradigms for shape or structure
based on feature correspondence cannot be directly applied to the study of reflectance
properties, the idea of an active observer motivates the investigation of new principles by
physical modeling in obtaining more information. A question to be answered is what kind
of extra spectral information can be obtained by a moving camera without considering
object geometry. If there is any, it may alleviate the limiting assumptions required for
color segmentation approaches and provide higher confidence in detecting specularities.
In this paper, a model is presented for explaining extra spectral information from
two or more views, and a specularity detection algorithm, called spectral differencing, is
proposed. The algorithm does not require any assistance from image segmentation since
it does not rely on the dichromatic model. The algorithm only exploits the variation of
different spectral composition of reflection depending on viewing directions, therefore it
does not require any geometric manipulation using feature correspondence. An important
principle used is the the L a m b e r t i a n c o n s i s t e n c y that the Lambertian reflection does not
change its brightness and spectral content depending on viewing directions, but the
specular reflection or the mixture of Lambertian and specular reflections can change.
Basic spectral models for reflection mechanisms are introduced in Sect. 2, and Sect.
3 explains how the measured color appears in a three-dimensional color space. A model
is also established in Section 3 for explaining the spectral difference between different
views for uniform dielectrics under singly colored illumination. The detection algorithm
of spectral differencing is also described in Sect. 4, and Sect. 5 discusses the spectral
differencing for various objects that include nonuniformly colored dielectrics and metals,
under multiply colored illumination. Experimental results are presented in Sect. 6.
2 Reflection Model
Physical models for light-surface interaction and for sensing are crucial in developing the
algorithms for detection of specularity. Several computer vision researchers have intro-
duced useful models based on the physical process of image-forming [TS67] [BS63] [Sha85]
[LBS90] [HB89]. Although there are certain approximations, the models introduced in
this section are generally well accepted in computer vision for their good approximation
of the physical phenomena.
2.1 R e f l e c t i o n Type
There are two physically different types of reflections for dielectric materials according
to the dichromatic model proposed by Shafer [Sha85], interface or surface reflection and
body or sub-surface reflection. Reflection types are summarized in Fig. 1. The surface or
interface reflection occurs at the interface of air and object surface. When light reaches
an interface between two different media, some portion of the light is reflected at the
boundary, resulting in the interface reflection, and some refracted into the material. The
ratio of the reflected to the refracted light is determined by the angle of incidence and
the refractive indices of the media. Since the refractive indices of dielectric material are
nearly independent of wavelength (A) over the visible range of light (400 n m to 700 n m of
wavelength), interface reflectance of dielectrics can be well approximated as flat spectrum
as shown in Fig. 1 [LBS90].
102
The refracted light going into a sub-surface is scattered from the internal pigments and
some of the scattered light is re-emitted randomly resulting in the body or sub-surface
reflection. Thus the reflected light has the Lambertian property due to the randomness
of the re-emitted light direction. The Lambertian reflection means that the amount of
reflected light does not depend on the viewing direction, but only on the incident light.
Depending on the pigment material and distribution, the reflected light undergoes a spec-
tral change, i.e., the spectral power distribution (SPD) of the reflected light is the product
of the SPD of the illumination and the body reflectance. The fact that the interface and
the body reflections are often spectrally different is the key concept of the dichromatic
model, and central to many detection Mgorithms by color image segmentation.
For metals, electromagnetic waves cannot penetrate into the material by more than
skin depth because of the large conductance that results in large refactive index. Therefore
all the reflections occur at the interface, and due to the lack of the body reflection, metals
are unichromatic [HB89]. Interface reflections from most metals are white or grey, e.g.,
from silver, iron, aluminum. However, there are reddish metals such as gold and copper.
surface) i. ...-. . . . .
".~. iI -- J I
]
:i Reflectmn
" i' ~ "~,~ t="" " l " 400 7~(~ k [nm] "I ,
wavelength | i
i9 ii (LamberUan) Ii I9 ii
! ..................... : ........................................................ --.. .................................... I, ........................................ -L- ................................... i
Reflections can also be categorized as glossy specular and nonglossy diffuse reflections
depending on the appearance. This categorization depends on the degree of diffusion in
the reflected light direction. Body reflection can be modeled as a perfectly diffuse re-
flection, i.e., Lambertian reflection. Specularity results from interface reflection, and the
reflected direction of the specularity depends both on the illumination direction and sur-
face orientation. The specularity is diffused depending on the surface roughness. There
have been some models that describe the scattering of light by rough surfaces. The
physical modeling of Beckman and Spizzichino [BS63] is based on the electromagnetic
scattering of light waves at rough surfaces. The simpler geometric modeling by Torrance
and Sparrow is widely accepted in computer vision and graphics as a good approximation
of the physical phenomenon [TS67]. The Torrance-Sparrow model assumes that a sur-
face is composed of small, randomly oriented, mirror-like microfacets, and the Gaussian
function is used for the distribution of the microfacets. A rougher surface has a wider
103
distribution of the microfacets, and the direction of the reflected light is more diffuse, as
is illustrated in Fig. 1.
Diffusion in the direction of the interface reflection may result from extended and
diffuse illumination as shown in Fig. 1. When illumination is extended and diffuse, the
incident angle of light to a surface patch is extended and interface reflections at even a
smooth surface appear diffusely. When the surface is rough, the reflection is more diffuse.
In this paper, "specular reflection" is used to denote interface reflection, while "Lam-
bertian reflection" is used to denote body reflection. Use of "surface reflection" for denot-
ing the interface reflection is avoided, since, in a wider sense, it means all the reflections
from surface and sub-surface. In this paper, "surface reflection" is used in the wider sense.
When "diffuse reflection" is used, it can be either diffuse interface or body reflection.
2.2 R e p r e s e n t a t i o n a n d Sensing
For singly colored illumination e(~), whether geometrically collimated or extended, scene
radiance is given as the product of illumination and reflection, i.e.,
Lr(~) = e(~)s(~). (1)
where s(A) is the reflection, and A is the wavelength of light. The surface reflection is the
linear combination of specular and Lambertian reflections with the different geometric
weighting factors, i.e.,
= ps( )Gs(0r, + pB( )G8 (2)
where ps(A) and pB(A) are the specular and the Lambertian refiectances, i.e., Fresnel re-
flectance and albedo, respectively, (0r, Cr) denotes the reflection direction, and Gs (Or, Cr)
and GB are the purely geometric factors which are independent of spectral information.
The geometric factors are determined by illumination and viewing directions with respect
to surface orientation. Observation of the specular reflection is highly dependent both
on the viewer and on the illumination directions, while observation of body reflection
depends only on the illumination direction.
Note that, for metals, pB(A)GB is 0, and for dielectrics, Gv is independent of the
viewing angle (0~, r It has been reported that the spectral composition of Lambertian
reflection slightly changes when the incident light direction approaches 90 ~ with respect
to surface normal (glancing incidence) [HB89]. However this effect is small even near the
glancing incidence of light, and thus is neglected in the model.
When there are more than one illumination sources with different colors from different
directions, the addition of reflections under different illumination sources
Lr(A) = el(A)sl(A) + e2(A)s2(A) -F . . (3)
is used for establishing models presented in this paper.
The color image sensing is usually performed with a CCD camera using filters of
different spectral responses. With 3 filters (usually R, G and B), the quantum catch or
the measured signal from the camera is given by
where Qk(,~) and qk for k = O, 1, 2 are the spectral response of the k - t h filter, and the
camera output through the k - t h filter, respectively. The wavelengths ~1 = 400 nm and
As = 700 nm cover the range of the visible spectrum.
104
where 7~'s are the scalar weighting factors. The relationship between the sensor responses
qk's and 7/'s is a linear transformation given as
where q = [qo,ql,q2] T, 3' = [70,71, 72]T, V__= A -1, and Aki is the element of A in the
k-th row and i-th column.
T h e vetor q or the linear transformation 7 represents the measured scene radiance that
results from the illumination and reflectance color and from geometric weighting. In
this section, it is explained how the measured q's or 7's from a color image appear in
a general three-dimensional color space, and a model is established for a specularity
detection algorithm using color information from different views. The three-dimensional
spectral space constructed from the RGB values or from the basis functions S0(A)SI(A)
and S~(A) is generally called S space in this paper.
In this section, the spectral scene radiance is considered only for dielectric objects
with uniform reflectance under singly colored illumination. Dielectric materials with re-
flectance variation and metals under multiply colored illumination will be discussed in
Sect. 5.
3.1 L a m b e r t i a n R e f l e c t i o n
For Lambertian surfaces, shading results from the variation in surface orientations rela-
tive to illumination directions. In the S space, the scene radiance generated by shaded
Lambertian reflections form linear clusters.
Scene radiance from Lambertian reflection is given from (1), (2) and (5) by
2
Lr(A) = e(A)pB(A)GB ,= ZTiSI(A). (7)
i=0
S~s- (ev,rJ~z*
9v) t (ez,r
sha ' viewing~ ~ / Ulurninati~
2
(i:i~i:31!i:i~i:iiiii?i:i:
ii?!iiiiiiiiiiiiiii~iil
'~
Xw
(a) (b)
Fig. 2. Shading; (a) linear duster (b) coordinates for simulated images (c) Lambertian shading
(d) linear duster in S space
and used in segmentation [Sha85] [KSK88]. The orientation of the vector 7 is determined
by the Lambertian reflectance and illumination, and independent of geometry.
Examples are shown by simulation with a spherical object for the geometry shown in
Fig. 2 (b). Figure 2 (c) and (d) show a simulated image of a sphere, and its color cluster
in the S space with (Ov,r = (35~ ~ and (0i,r = (0~176 respectively. For the
simulation, a spectrum measured from a real blue color plate is used for the reflectance,
and a linear cluster is shown in the S space for spectrally flat neutral light. The Fourier
basis functions S0(A) = 1, Sz(A) = sin~ and S2(A) = cosA are used for the S space in
Fig. 2 (d).
vie 2 Xhading
Lambertian ~ ~ S=
shading ~ / N,/-~
Fig. 3. Lambertia_n surface from multiple views (a) geometric illustration (b) color clusters in S
space
Figure 3 (a) illustrates Lambertian surfaces at two different views. When illumination
is the same between different views, Lambertian reflections from a surface appear in the
same locations in the S space regardless of the viewing angle. However occlusion of
surfaces by other object surfaces can affect the distribution of color points in the linear
cluster. For the view 1, not all the Lambertian surfaces are visible due to occlusion, and
color clusters from only a part of the object are observable in the S space. The visible color
clusters are shown as a dark solid line in Fig. 3 (b). On the other hand, those invisible
surfaces are disoccluded in the view 0. Disocclusion is the emergence of object points or
patches into visibility from behind occlusion. Depending on the shading of the disoccluded
part, emerging color clusters of the object from occlusions can be included in the linear
cluster, or can appear outside the cluster yet in the extended line, since the disoccluded
part is a part of the same object. An example of spectrM difference between the two views
is shown as the gray lines in Fig. 3 (b). Note that orthographic projections are illustrated
in Fig. 3 (a), but the above explanation also applies to perspective projections.
106
Highlights are due to specular reflections from dielectrics or metals. Since scene radiance
from specular reflection is given by
2
Lr()t) = e(~)ps()t)Gs(t~r,r = ~--~%Si(~). (8)
i=0
Specular reflections alone, e.g., from metals or from black dielectrics, form linear clusters
in the S space like the Lambertian reflections. Because of the neutral reflectance, the
direction of the linear cluster from the dielectrics is the same as the illumination direction
in the S space. On the other hand, the direction of a linear cluster from a metal is
determined by the spectral reflectances and illumination.
For dielectrics, specular reflections are added to Lambertian reflections as shown in
Fig. 4 (a). With extended illumination or with roughened surfaces, the distribution of
specular reflections can spatially vary over a wide area of the shaded surface as shown
in Fig. 4 (a). Therefore specular reflections form planar clusters which include the linear
clusters formed by shading on the S space. The orientation of the plane is dependent on
the illumination color. When illumination is well collimated and the surface is smooth the
color clusters form generally skewed T or L shapes as suggested by Shafer [Sha85], since
the specular reflection is distributed in a small range of shading and forms a linear cluster
connected to a linear cluster of the Lambertian reflections. However, when illumination
is spatially extended or the surface is rough, the color clusters generally form skewed P
shapes, and the color cluster of specular reflections is planar and coplanar with a linear
cluster of Lambertian reflections.
So, .. So so
_, : S1 \F.~r_ambertian -, "I
(a) (b) shading (C) (d)
Fig. 4. Specularity; (a) in S space (a) geometric illustration for multiple views (b) color clusters
in S space for smooth surface (c) for rough surface
so sl
:T:Yf::f::. (ii!+
...... 9,....., .,.
~ii!~!i!;!!i~!ii!i!ii!!i! :~ii:~iill
~i!iiii?'
Fig. 5. Reflection for smooth surface and collimated illumination for (0~, r = (o ~ 0~
(gv, Cv) = (0 ~ 0~ (35 ~ 0~ (70 ~ 0 ~ and relative surface roughness = 0.1
so sl
: :: ::::;:;:;.:.:: : : :
:::::::::::::::::::::::::::::: iiiiiiiiiiiiiii
iiiiiii!il
: . . : : :.:,:+:.:.:.
f~.1 s,
Fig. 6. Reflections for smooth surface and extended illumination for 0 ~ < 81 < 30 ~
0~ < 61 < 360 ~ (8v,r = (0~176 (35~176 (700,0 ~ and relative surface roughness =
0.1
Except for Lambertian occlusions, the spectral difference between the views results
from specularities, although it does not account for all the specularities due to the over-
lap of specularities in the S space. In Fig. 4 (c), all the specularities are the spectral
difference, but in Fig. 4 (d), only part of the specularities is the spectral difference since
there is an overlap between the specular clusters from the two views. Since the amount
of spectral displacement of the specularities is determined by the difference in the view-
ing angles, object shape, variations in object shape and illumination distribution, it is
difficult to predict it in a simple manner for general objects and illumination. However
the general rule is that as the difference in the viewing directions increases, the spectral
overlaps between the specularities decrease. If the object shape varies more geometri-
cally, specularities are likely to change more. Specularities often completely disappear
depending on the views.
A point to note is occlusion by specularity. In some views, specularities can be dis-
tributed such that some Lambertian shading may not be visible at all. In other views,
the Lambertian shading may appear as new clusters in the S space, therefore can be
detected as spectral difference.
$0 Sl .%
iiiii!!ii!iiiiiiiiiiiii i!iliiii!i
iiii!iii
9 ::::+:+;
r/~==
sl
Fig. 7. Reflections for rough surface and collimated illumination for (e~,~;) = (o~ ~
(0v, Cv) -- (0 ~ 0~ (35 ~ 0~ (70 ~ 0 ~ and relative surface roughness =0.3
108
-MSD
image a
s~ l
MSD(a~-[~) vie.
imageI~
(a) (b)
Fig. 8. Spectral differencing (a) images from different views (b) color clusters in S space
For two color images with different viewpoints, the spectral differencing is an algorithm
for finding the color points of one image which do not overlap with any color points
of the other image in a three-dimensional spectral space (e.g., the S space or a sensor
space with RGB values). In order to detect the view-inconsistent color points, the spec-
tral differencing algorithm computes the minimum spectral distance (MSD) images. The
computation of an MSD image is explained as follows with an example shown in Fig. 8.
Let c~ and/~ be two color images obtained from two different views. The notation
MSD(~ ,--/9)
represents the MSD image of c~ from/~. A pixel value of the MSD image MSD(a ~ #)
is the minimum value of all the spectral distances between the pixel in the image c~ and
all the pixels in the image #. The spectral distance is defined as the euclidean distance
between two color points in a three-dimensional spectral space. Any MSD's above a
threshold indicate the presense of specular reflections or Lambertian disocclusions. The
threshold for the MSD image is determined only by sensor noise, and no adjustment is
required for different environments.
Figure 8 (a) illustrates two images of an object with specularity from two different
viewpoints, and the corresponding color clusters in the S space are shown in Fig. 8 (b).
The pixel P in the image a is distantly located from the specular and Lambertian color
points of the image/~, which indicates specular reflection at P. On the other hand, the
Lambertian reflections from the views o~ and fl have the same linear cluster. Since the
pixel R in the region of Lambertian reflection in the image a is close to the Lambertian
points in the image j3 in the S space, it should not be detected by spectral differencing.
The spectral differencing does not detect all the specularities in a view. In Fig. 8 (b),
the color point from the pixel Q is located in the overlapped region between the planar
clusters in the views a and #. Since Q is located within the planar cluster formed by
specular reflection in the view #, it is hard to detect Q as a specular reflection when the
color points in planar cluster in the view/~ is densely populated. The specular reflection
at Q can be detected by this algorithm only when the color points in the planar cluster
in the view ~ are sparsely distributed around Q.
In this paper, no study for finding faster algorithms for spectral differencing is pre-
sented. However, an important point to note is that the algorithm is pixelwise parallel.
Therefore with a parallel machine, the computation time depends only on the degree of
achievable parallelism of the machine.
Spectral differencing is performed for the three simulated images shown in Fig. 7, and
the three images and the MSD images are shown in Fig. 9. The table of image arrange-
ment is also shown in Fig. 9. All the MSD images in Fig. 9 show detected specularities.
109
The detection is always an underestimation of the specular region except for disocclu-
sions. The disoccluded Lambertian reflections are shown in MSD(0#--2), and there is a
region detected due to specular disocclusion in MSD(I~2). In the view 2, the brightest
shading is occluded by specularities.
In the previous sections, the spectral differencing is explained only for dielectrics with
uniform reflectance under singly colored illumination. In this section, it is discussed that
the spectral differencing is effective as well for various objects under multiply colored
illumination.
When the reflectance pB(A) is not uniform in color for a surface, but has gradual variation,
the measured colors from shaded Lambertian surface do not form a linear cluster. The
color cluster is dispersed depending on the degree of variation in the reflectance, as
illustrated in Fig. 10 (a). Some natural surfaces such as wood grains, leaves and human
faces have variation in reflectance.
Figure 10 illustrates the color clusters of a dielectric object with varying Lambertian
reflectance. The Lambertian cluster is not linear due to the variation in pB(A) in (1)
which is written again below
Even with the volume or planar clusters from Lambertian reflection, the Lambertian
consistency applies (except for disocclusion), since the geometric factor GB is independent
of the viewing angle (0r, Cr). On the other hand, specularities are mixed with differently
shaded and colored Lambertian reflections depending on the viewing directions, since the
geometric factor Gs(~r,r for specular reflections varies depending on (gr,r in (9).
Therefore the spectral differencing can detect specularities that have different spectral
values over different views.
110
S= S= S=~
S, = S, ~ S,
5.2 D i e l e c t r i c s u n d e r M u l t i p l y C o l o r e d I l l u m i n a t i o n
Fig. 11. (a) (b) (c) Dielectric under multiply colored illumination from varying viewpoints (d)
inter-reflection
The specular reflection is a linear combination of the two components from the two il-
lumination colors el and e2. Each component as well as the combination varies depending
on the viewing angle (0r, Cr). When the specularities appear in different Lambertian sur-
faces without overlap, they represent illumination colors in two directions separately as
shown in Fig. 11 (a). When the viewing geometry changes, the specularities can be mixed
in a surface and produces new specular points in the S space as shown in Fig. 11 (b) and
111
(c). Therefore spectral differences result from specular reflections except for Lambertian
disocclusions.
Inter-reflection When there are many objects, the object surface of interest receives
not only the light from the illumination sources, but the reflected light from the other
objects. The latter causes a local change of illumination, as illustrated in Fig. 11 (d). The
reflection from more-than-one surfaces is called inter-reflection. Object surfaces for the
first reflection are secondary light sources which are generally extended depending on the
object size. Together with the direction global illumination, the first reflection provides
multiply colored illumination for other surfaces as shown in Fig. 11 (d), and influences
the distribution of color clusters of the other surfaces. In indoor environments, reflections
from walls and ceiling are the major sources of ambient light.
5.3 M e t a l s
Since metals have only specular reflectance, there are no Lambertian reflections. When
there is only a single illumination source for a uniform metallic object without any ambi-
ent light, only a linear cluster appears in the S space. In most cases, however, metals are
observed with reflections from many light sources that include scene illumination sources
and many surrounding objects. Especially shiny metals reflect all the incoming light
from surrounding objects. The mixture of reflected light from direct illumination sources
and inter-reflections changes depending on viewing directions, with different geometric
weighting of the light coming from different directions. Therefore the color changes due
to the different mixture of light can be detected by spectral differencing.
An example is shown in Fig. 12 for the scene radiance under two different illumination
sources. Without any Lambertian components in (10), the scene radiance is a combination
of two specular components as
(11)
When the two components separately appear in a measured image without being mixed,
the color points in the S space form two different linear clusters as shown in (a). Depend-
ing on the viewing directions, the two colors can be differently combined as shown in (b)
and (c), and the spectral differencing can detect different reflections in color.
So So So
~1 m $1
(a) (b) (c)
Fig. 12. Metal under multiply colored illumination from varying viewpoints
112
In order to test the algorithm, some experiments were carried out on various objects
all under multiply colored illumination. Illumination was provided by fluorescent light
in two directions on the ceiling of the room and by tungsten light in another direction
located closer to the objects. Four large fluorescent light tubes were used, two in each
direction, and half of a tungsten light bulb was screened with white paper for diffusing the
light and the remaining half was exposed. White walls and ceiling provide some ambient
illumination. The illumination environment is a normal indoor one, unlike a dark room
with collimated light.
Figure 13 shows images of dielectric objects with smooth reflectance variation. The
arrangement of measured images and MSD images is the same as that in Fig. 9. The
porcelain horse has variation in its Lambertian reflectance, especially near its shoulder
and the saddle. The MSD images show nonzero values where most of the sharp and diffuse
specularities are. The threshold for the MSD images was experimentally determined as
2 in terms of the RGB input values (0-255).
113
Figures 14 shows the results from a metallic object. The MSD images clearly show
most of the sharp specularities, indicating that the spectral movement of the sharp spec-
ularities are large. Some of the diffuse specularities are also detected. For a given shape
of object, the diffuse specularities are better detected with wider angles between the
views. In fact, all the reflections from metals are specular reflections. However the very
diffuse reflections are not detectable when they form densely populated color clusters
like Lambertian reflections in a three-dimensional color space, and the different viewing
geometry does not generate enough spectral differences.
The experimental results with real objects demonstrate that spectral differencing is a
remarkably simple and effective way of detecting specularities without any geometric rea-
soning. The algorithm does not require any geometric information or image segmentation.
Therefore it can provide independent information to other algorithms such as structure
from stereo, structure from motion, or image segmentation algorithms. Since the spectral
differencing does not depend on any image segmentation, there are no assumptions of
uniformly colored dielectric objects and singly colored illumination.
A limitation of the spectral differencing algorithm is that disocclusions are detected
together with specularities and they are indistinguishable. Separation between the spec-
ularity and disocclusion may be achieved with other algorithms such as color image seg-
mentation algorithms [BLL90]. As mentioned above, the spectral differencing algorithm
can be easily intergrated with a color segmentation algorithm, and we are currently
developing some integrated methods.
114
7 Conclusion
In this paper, an algorithm is proposed for the detection of specularities based on physical
models of reflection mechanisms. The algorithm, called spectral differencing, is pixelwise
parallel, and it detects specularities based on color differences between a small number of
multiple color images without any geometric correspondence or image segmentation. The
key contribution of the spectral differencing algorithm is to suggest the use of multiple
views in understanding reflection properties: Although multiple views have been one
of the major cues in computer vision in obtaining object shape or structure, it has
not been used for obtaining reflection properties. The spectral differencing algorithm is
based on the Lambertian consistency, and the object and illumination domains include
nonuniformly colored dielectrics and metals, under multiply colored scene illumination.
The experimental results conform well to our model based on the Lambertian consistency.
References
[AB87] J. Aloimonos and A. Badyopadhyay. Active vision. In Proc. 1st Int. Conj. on Com-
puter Vision, pages 35-54, 1987.
[Baj88] R. Bajcsy. Active perception. Proceedings o] the IEEE, 76:996-1005, 1988.
[BB88] G. Brelstaff and A. Blake. Detecting specular reflections using lambertain constraints.
In Proc. of lEEE Int. Con]. on Computer Vision, pages 297-302, Tarpon Springs, FL,
1988.
[BLLg0] R. Bajcsy, S.W. Lee, and A. Leonardis. Color image segmentation with detection of
highlights and local illumination induced by inter-reflections. In Proc. lOth Interna-
tional Conf. on Pattern Recognition, Atlantic City, NJ, June 1990.
[BS63] P. Beckman and A. Spizzichino. Scattering o] Electromagnetic Waves ]rom Rough
Sur]aces. Pergamon Press, London, UK, 1963.
[Coh64] J. Cohen. Dependency of the spectral reflectance curves of the munsell color chips.
Psychon. Sci., 1:369-370, 1964.
[Ger87] R. Gershon. The Use o] Color in Computational Vision. PhD thesis, Department of
Computer Science, University of Toronto, 1987.
[HB89] G.H. Healey and T.O. Binford. Using color for geometry-insensitive segmentation.
Journal of the Optical Society o] America, 6, 1989.
[KSK88] G.J. Klinker, S.A. Shafer, and T. Kanade. Image segmentation and reflection anal-
ysis through color. In Proceedings of the DARPA Image Understanding Workshop,
pages 838-853, Pittsburgh, PA, 1988.
[LBSg0] H.-C. Lee, E. J. Breneman, and C. P. Schulte. Modeling light reflection of computer
vision. IEEE Trans. PAMI, 12:402-409, 1990.
[MW86] L. T. Maloney and B. A. WandeU. A computational model of color constancy. Journal
o] the Optical Society o] America, 1:29-33, 1986.
[NIK90] S. K. Nayar, K. Ikeuchi, and T. Kanade. Determining shape and reflectance of hybrid
surfazes by photometric sampling. IEEE Trans. Robo. Autom., 6:418-431, 1990.
[Sha85] S.A. Sharer. Using color to separate reflection components. COLOR Research and
Application, 10:210-218, 1985.
[Td91] H.D. Tagare and R. J. deFigueiredo. Photometric stereo for diffuse non-lambertian
surface. IEEE Trans. PAMI, 13:, 1991.
[TS67] K.E. Torrance and E. M. Sparrow. Theory for off-specular relfection from roughened
surfaces. Journal o] the Optical Society of America, 57:1105-1114, 1967.
[Wo189] L.B. Wolff. Using polarization to separate reflection components. In Proc. of IEEE
Conference on Computer Vision and Pattern Recognition, pages 363-369, San Dingo,
CA, 1989.
This article was processed using the IbTEX macro package with ECCV92 style
Data and Model-Driven Selection using Color
Regions *
I Introduction
A key problem in object recognition is selection, namely, the problem of isolating
regions in an image that are likely to come from a single object. This isolation can
be either based solely on image data (data-driven) or can incorporate the knowledge
of the model (task-driven or model-driven). It has been shown that the search in the
matching stage of recognition can be considerably reduced if recognition systems were
equipped with a selection mechanism thus allowing the search to be focused on those
matches that are more likely to lead to a correct solution [3]. Even though selection can
be of help in recognition, it has largely remained unsolved. The lack of knowledge of
illumination conditions and surface geometries of objects in the scene, and the problems
of occlusion, shadowing, specularities, and interreflections in the image make it difficult to
interpret groups of data features as belonging to a single object. Previous approaches to
selection have focused on the problem of data-driven selection by grouping data features
such as edges and lines based on constraints such as parallelism, or collinearity, distance
and orientation, etc.[4][3]. But ensuring the reliability of such grouping has been found
to be difficult, thus restricting their effectiveness in reducing the search complexity in
recognition.
In this paper we present a way of performing data and model-driven selection by
extracting color regions from an image. A color region almost always comes entirely from
a single object, giving, therefore, more reliable groups than existing grouping methods
and this can be useful for data-driven selection. Because objects tend to show color
constancy under most illumination conditions, color when specified appropriately, can be
a stable cue for most appearances of objects in scenes, thus making it also suitable for
model-driven selection.
* This paper describes research done at the AI Lab., M.I.T. Support for the lab's research is
provided in part by Office of Naval Research and in part by the Advanced Research Projects
Agency of the Dept. of Defense. The author is supported by an IBM Fellowship.
116
by (s(R),v(R)), where s(R) = saturation or purity of the color of region R, and v(R) =
brightness, and 0 < s(R),v(R) _< 1.0. And the size is simply the normalized size given by
r(R) : Size(R)/Image-size. Similarly, the color and size contrast were chosen as features
for determining relative saliency. The color contrast measure chosen enhances a region
R's contrast if it is surrounded by a region T of different hue and is given by c(R,T)
below:
( k t d ( C R , CT) if R and T are of same hue
c ( R , T)
%
k2 + kld(CR, CT) otherwise (1)
where kx = ~ and k2 = 0.5, so that 0 _< c(R,T) _< 1.0, and d(CR,CT) is the cie-distance
between the two regions R and T with specific colors as CR = (ro,go, bo) T and CT =
(r, 9, b)T and is given by d(CR, CT) = L/( __._+go
s_a____,
+bo
. ,+~__~); 2 + ( ~ _ ,+g+b)
o 2 " The
size contrast is simply the relative size and is given by t(R, T) = rain \ s i z e ( T ) ' size(R)/"
In both cases the neighboring region T is the rival neighbor that ranks highest when all
neighbors are sorted first by size, then by extent of surround, and finally by contrast (size
or color contrast as the ease may be), and will be left implicit here.
The weighting functions for these features were chosen both from the point of data-
driven selection and the extent to which they reflect our sensory judgments. Thus for
example, the functions for weighting intrinsic color and color contrast, f~(s(R)) and
f2(v(R)) and fa(c(R)) were chosen to be linear (f~(s(R)) : 0.5s(R), and f2(v(R)) =
0.5v(R), and f4(c(R)) = c(R) respectively) to emphasize brighter and purer colors and
higher contrast respectively. The size of a region is given a non-linear weight to deem-
phasize both very small and very large regions. Very small regions are usually spurious
while very large regions tend to span more than one object, making both unsuitable for
selection. The corresponding weighting function f3(r(R)) was found by performing some
informal psychophysical experiments and is given by
__ I n ( I - n ) 0 < n < t1
cl
1 - e -~'~ tl < n <_~2
h(n) = s2 - c d n ( 1 - n + t2) ~2 < n _< ~ (2)
s3e -e~(n-ts) ~3 < n < 7~4
0 t4 < n < 1.0
Figure l d - l f and 2c-2f show the four most distinctive regions found by applying the
color-saliency measure to all the color regions extracted from the scene shown in Figure
l a and 2a respectively. In the experiments done so far, the color-saliency measure was
found to select fairly large bright-colored regions that showed good contrast with their
neighbors, and appeared perceptually significant.
4.2 U s e o f S a l i e n t C o l o r - b a s e d S e l e c t i o n in R e c o g n i t i o n
Data-driven selection based on salient color regions is primarily useful when the object
of interest has at least one of its regions appearing salient in the given scene, since the
119
search for data features that match model features can be restricted to these regions.
Selecting salient regions gives a small number of large-sized groups which were shown to
be very useful for indexing into the library of models [1]. But to recognize a single object,
it is desirable to have small-sized groups. For this, existing grouping techniques can be
applied to the data features found within the color regions to obtain reliable small-sized
groups.
To estimate the search reduction that can be achieved with such a selection mecha-
nism, let (M,N) : total number of features (such as edges, lines, etc.) in the model and
image respectively. Let (MR, NR) : total number of color regions in the model and image
respectively. Let Ns : number of salient regions that are retained in an image. Let g
= average size of a group of data features, within a model or image. Let (GM, G.v) =
number of groups formed (using any existing grouping scheme) in the model and image
respectively. Finally, let GNI be the number of groups in the salient image region i. Using
the alignment method of recognition [3], at least three corresponding data features are
needed to solve for the pose (appearance) of the model in the image. If no selection of
the data features is done, then the brute-force search required to try all possible triples
is O(MSN3). If selection is done by only grouping methods (i.e., without color region
selection), then the number of matches that need to be tried is O(GMGNgSg3) since only
triples within groups need to be tried. When grouping is done within color regions, the
groups obtained are even smaller in number and are more reliable, so that the overall
effect is to reduce search (by as much as a factor of 107). When grouping is restricted to
salient color regions, the number of matches further reduces to O ( ~ jNs
= 1 GNjGMgSgs).
To get an estimate of the number of matches and time taken for matching in real
scenes when color-based selection is used, we recorded the number of color regions,
and the number of data features within regions in some selected models and scenes
(Figure 2 and 3 show typical examples of models and scenes tried). The regions were
ordered using the color saliency measure and the four most salient regions were re-
tained. Then search estimates were obtained using the above formulas, and assuming
a grouping scheme that gives a number of groups within regions that is bounded by
the number of features in a region (which is a good bound using simple grouping
average size of the groups in a region
schemes such as grouping 'g' closely-spaced parallel lines in a region). The result of
such studies is shown in Table I. As can be seen from this table, the number of matches
is always smaller when salient color regions are used for selection.
5.1 M o d e l Description
The color region information in the model (an image or view of the model, that is) is
represented as a region adjacency graph (RAG) MG ---< Vra, Era, Cra, R.m , Sra, B~ra , B,m >,
where Vra = color regions in the model, Era = adjaeencies between color regions, Cra(u)
-- color of region u E Vra, Rm(u,v) -- relative size of region 'v' w.r.t region u. Sra(u) =
size of region u, and B, ra = a bound on the relative size of regions given by R.m, and
B,ra = a bound on the absolute size of regions given by Sra.
This description exploits features of regions such as color and adjacency information
that tend to remain more or less invaxiant in most scenes where the model appears.
Also, the bounds B~ra and B,ra indicate the extent of pose changes and occlusions that
a selection mechanism is expected to tolerate. The description therefore, is fairly rich
and has some structural information about color regions that can be used to restrict
the number of false positives, and some constraints on the relative and absolute size
changes that can be used to restrict the number of false negatives made by the selection
mechanism.
Finally, the color region information in the image is similarly organized as an image
region adjacency graph as I a = < VI, El, CI, RI, SI > where each term has a meaning
analogous to < Vra, Era, Cra, P ~ , Sra > respectively.
5.2 Location Strategy
Given the image region adjacency graph IG, the model object if present in the scene
will form a subgraph in IG. The location strategy, therefore, regards the problem of selec-
tion as the problem of searching for suitable subgraphs that satisfy the model description.
Although the number of subgraphs is exponential, a set of unary and binary constraints
supplied in the model description restrict the subgraphs to a small number of feasible
subgraphs. The perceptual color of a region and its absolute size bound (Bara) were used
as the unary constraints, while region adjacency and relative size were used as the binary
constraints. Specifically, the lack of adjacency between two model regions was used to
prune false matches to two adjacent image regions. The bound Btra in the model was
used to discard matches when the relative size exceeded this bound.
The location strategy searched among the feasible subgraphs for a subgraph (or
subgraphs) that in some sense best matches the given model description. Such a sub-
graph Ig = < Vg,Eg, Cg, Rg, Sg > such that [IVg[[ _< [[Vrall,[[Egl[ _< [IEra[[, has as-
sociated with it a node correspondence vector T = {(ura,ug)[Vura E Vra,ug E Vg U
{_k}, {_k} is a null match} and is chosen to be the one that minimizes the following mea-
sure:
object occurs. The scene shown has several other objects with one or more of the model
colors. Also, the model appears in a different pose, being r o t a t e d to the left a b o u t the
vertical axis. Figure 3d shows the result of applying the unary color constraints, and
Figure 3e, the subsequent use of the absolute size constraint. Finally, the subgraph with
the lowest value of S C O R E is shown in Figure 3f. As can be seen from this figure, a
region containing most of the model object has been identified even with an imperfect
color image segmentation.
5.3 Search R e d u c t i o n u s i n g C o l o r - b a s e d M o d e l - d r i v e n
Selection
The color-based model-driven selection mechanism provides a correspondence of model
regions to some image regions. The matching of model features to image features can
be restricted to within corresponding regions, and when this is combined with grouping
within regions as described in Section 4.2, the number of matches to be tried for recogni-
tion reduces further. To estimate the search reduction in this case, let Nz be the number
of solution subgraphs given by the selection mechanism, and let Ik represent one such
subgraph with the number of nodes = Nk. Let (G,,i,G~) = the n u m b e r of groups in
region uj of the solution subgraph Ih, and region vi of the model R A G t h a t corresponds
to uj as implied by the correspondence vector T associated with Ik. Then assuming, as
before, the average size of the group = g, the number of matches t h a t need to be tried are
0 ( ~ = 1 ~~1~1 G~,#G~,.ga.g3). By trying several models and images of scenes where they
occurred, we recorded the average number of subgraphs generated by the model-driven
selection mechanism. The search estimates were obtained using the above formula for
model-driven selection with grouping, and the formulas for other m e t h o d s mentioned in
Section 4.2. T h e results are shown in Table II. The bound on the number of groups in a
region was the same as used in Section 4.2. As can be seen from the table, the number of
matches using correspondence between model and image color regions is always lower.
6. C o n c l u s i o n s
In this p a p e r we have shown how color can be used as a cue to perform both d a t a and
model-driven selection. Unlike other approaches to color, we have used the intended task
to constrain the kind of color information to be extracted from images. This led to a fast
color image segmentation algorithm based on perceptual categorization of colors which
later formed the basis of d a t a and model-driven selection. Future work will be directed
towards integrating the selection mechanism with a 3D from 2D recognition system to
o b t a i n statistics of false positives and negatives and the actual search reduction due to
selection.
References
I. D.T. Clemens and D.W. Jacobs, "Space and time bounds on indexing 3D models from 2D
images," IEEE Trans. Pattern Anal. and Machine Intelligence, vol. 13, Oct. 1991.
2. T. F. Syeda-Mahmood, "Data and model-driven selection using color regions," AI-Memo
I~70, Artificial Intelligence Lab., M.I.T., 1992.
3. W.E.L.Grimson., Object Recognition by Computer: The Role of Geometric Constraints,
MIT Press: Cambridge, 1990.
4. D.G. Lowe, Perceptual Organization and Visual Recognition, Kluwer Academic: Boston,
1985.
5. G.J. Klinker, S.A. Sharer, and T. Kanade, "A physical approach to color image understand-
ing," Intl. J1. Computer Vision, vol.4, no.1, pp.7-38, Jan. 1990.
6. E. Land, " Recent advances in retinex theory," in Central and Peripheral Mechanisms of
color Vision, T. Ottoson and S. Zekl Ed., pp. 5-17, London:McMillan, 1985.
122
7. L,T. Maloney and B. Wandel, "Color constancy: A method for recovering surface spectral
reflectance," Jl. Optical Society of America, vol.3, 1986, pp.29-33.
8. E. Sternheim and R. Boynton, "Uniqueness of perceived hues investigated with a continuous
judgemental technique," Jl. of Experimental Psychology, voi.72, pp.770-776, 1966.
9. M.J. Swain and D. Ballard, "Indexing via color histograms," Third Int. ConL Computer
Vision, 1990.
Fig. I . Illustration of color region segmentation and color-sallency. (a) Input image consisting
of regions of 3 different colors: red, green and blue against an almost white background. (b)
Result of Step 2 of algorithm with regions colored differently from the original image. (c) Final
segmentation of the image of Fig.3a. (d) m (f) The three most distinctive regions found using
the color saliency measure.
T a b l e 1. Search reduction using color-bared data-driven selection. The last column shows the
m a t c h time when color-based data-driven selection is combined with grouping. The color-based
selection is done by choosing the four most salient regions. Here g = 7, Time per match = 1
microsecond, and the gzouplng m e t h o d is as described in text.
T a b l e 2. Search reduction using color-based model-driven selection, The last column shows the
m a t c h time when model-color-based selection is combined with grouping. Here g -- 7, Time per
m a t c h = 1 microsecond, and the grouping m e t h o d is as described in text.
Fig. 2. Mustration of color region segmentation and color-saliency. (a) Input image depicting a
scene of objects of different materials and having occlusions and inter-reflections. (b) Segmented
image using the color region segmentation algorithm. (c)-(f) The four most distinctive regions
detected using the color-sallency measure. The white portion in the red book appears so because
of the white background. (a) (b)
Fig. 3. mustration of color-based model-driven selection. (a) The object serving as the model.
(b) Its color description produced by the segmentation algorithm of Section 3. (c) A cluttered
scene in which the object appears. (d) Regions selected based on unary color constraint. (e)
Regions of (d) pruned after using the unary size constraint. (f) Regions corresponding to the
best subgraph that matched the model specifications.
This article was processed using the LtTEX macro package with E C C V 9 2 style
Recovering Shading from Color Images *
Abstract.
Existing shape-from-shading algorithms assume constant reflectance
across the shaded surface. Multi-colored surfaces are excluded because both
shading and reflectance affect the measured image intensity. Given a stan-
dard RGB color image, we describe a method of eliminating the reflectance
effects in order to calculate a shading field that depends only on the rel-
ative positions of the illuminant and surface. Of course, shading recovery
is closely tied to lightness recovery and our method follows from the work
of Land [10, 9], Horn [7] and Blake [1]. In the luminance image, R + G + B ,
shading and reflectance are confounded. Reflectance changes are located and
removed from the luminance image by thresholding the gradient of its loga-
rithm at locations of abrupt chromaticity change. Thresholding can lead to
gradient fields which are not conservative (do not have zero curl everywhere
and are not integrable) and therefore do not represent realizable shading
fields. By applying a new curl-correction technique at the thresholded lo-
cations, the thresholding is improved and the gradient fields are forced to
be conservative. The resulting Poisson equation is solved directly by the
Fourier transform method. Experiments with real images are presented.
1 Introduction
Color presents a problem for shape-from-shading methods because it affects the apparent
"shading" and hence the apparent shape as well. Color variation violates one of the
main assumptions of existing shape-from-shading work, namely, that of constant albedo.
Pentland [11] and Zheng [16] give examples of the errors that arise in violating this
assumption.
We address the problem of recovering (up to an overall multiplicative constant) the
intrinsic shading field underlying a color image of a multi-colored scene. In the ideal
case, the recovered shading field would be a graylevel image of the scene as it would have
appeared had all the objects in the scene been gray. We take as our definition of shading
that it is the sum of all the processes affecting the image intensity other than changes in
surface color (hue or brightness). Shading arises from changes in surface orientation and
illumination intensity.
It is quite surprising how well some shape-from-shading (SFS) algorithms work when
they are applied directly to graylevel images of multi-colored scenes [16]. This is encour-
aging since it means that shading recovery may not need to be perfect for successful
shape recovery. Nonetheless, the more accurate the shading the more accurate we can
expect the shape to be.
Consider the image of Fig. l(a) which is a black and white photograph of a color
image of a cereal box. The lettering is yellow on a deep blue background. Applying
Pentland's [11] remarkably simple linear SFS method to the graylevel luminance version
.(i.e. R + G + B ) of this color image generates the depth map in Fig. l(h). Although the
image violates Pentland's assumptions somewhat in that the light source was not very
* M.S. Drew is indebted to the Centre for Systems Science at Simon Fraser University for partial
support; B.V. Funt thanks both the CSS and the Natural Sciences and Engineering Research
Council of Canada for their support.
125
distant and the algorithm is known to do poorly on flat surfaces, it is clear that the
yellow lettering creates serious flaws in the recovered shape. Note also that the errors are
not confined to the immediate area of the lettering.
The goal of our algorithm is to create a shaded intensity image in which the effects of
varying color have been removed in order to improve the performance of SFS algorithms
such as Pentland's. Similar to previous work on lightness, the idea is to separate intensity
changes caused by change in color from those caused by change in shape on the basis
that color-based intensity changes tend to be very abrupt. Most lightness work, however,
has considered only planar "mondrian" scenes and has processed the color channels sepa-
rately. In lightness computations, the slowly varying intensity changes are removed from
each color channel by thresholding on the derivative of the logarithm of the intensity in
that channel. We instead remove intensity gradients from the logarithm of the luminance
image by thresholding whenever the chromaticity changes abruptly. Both the luminance
and the chromaticity combine information from all three color channels.
Many examples of lightness computation in the literature [13, 14, 1, 8, 4] use only
synthetic images. A notable exception is in Horn [7] in which he discusses the problem of
thresholding and the need for appropriate sensor spacing. He conducts experiments on
a few very simple real images. Choosing an appropriate threshold is notoriously difficult
and the current problem is no exception. By placing the emphasis on shading rather
than lightness, however, fewer locations are thresholded because it is the large gradients
that are set to zero, not the small ones. When a portion of a large gradient change
remains after thresholding due to the threshold being too high, the curl of the remaining
luminance gradient becomes non-zero. Locations of non-zero curl are easily identified and
the threshold modified by a technique called "curl-correction."
In what follows, we first analyze the case of one-dimensional images before proceeding
to the two-dimensional case. Then we elaborate on curl-correction and present results of
tests with real images.
2 One-dimensional Case
2.1 C o l o r I m a g e s w i t h S h a d i n g
Let us consider as a starting point the surface described by the one-dimensional depth
map shown in Fig. 2(a). If this surface has Lambertian reflectance and is illuminated by
a point source from an angle of 135~ (i.e., from the upper left), the resulting intensity
distribution will be as shown in Fig. 2(b). So far there is no color variation, so all the
intensity variation is due to shading.
If instead, the surface has regions of different color, each described by its own color
triple (R,G,B) in the absence of shading, (see Fig. 2(c)), then in a color image of the
surface the RGB values will be these original color triples modulated by the shading
field, as shown in Fig. 2(d). The combined effect of color edges and shading edges leads
to discontinuities in the observed RGB values at image locations corresponding to both
types. Fig. 2(d) has both kinds of edges--there are color discontinuities where there are
no shape discontinuities, and there are also shape discontinuities without accompanying
color ones.
To differentiate between the two kinds of edges, we note that if we form the analog
of the chromaticity [15] in RGB space, i.e.,
r = R/(R+G+B), g = G/(R+G+B)
then r and g are independent of the shading (cf. [12, 6, 2]) as can be seen in Fig. 2(e).
This, of course, must be the case because rg-chromaticity is simply a way of normalizing
the magnitude of the RGB triple. Both r and g are fixed throughout a region of constant
color. So long as we can assume that that color edges never coincide with shape edges,
rg-chromaticity will distinguish between them.
2.2 S h a d i n g R e c o v e r y
In a sense, the recovery of shading from color images is the obverse of the recovery
of lightness from graylevel images [7]. In the case of lightness, it is the sharp intensity
126
changes that are taken to represent reflectance changes and are retained, while the small
intensity changes are factored out on the basis that the illumination varies smoothly;
whereas, in the case of shading it is the large reflectance changes that are factored out
and the small ones retained. A significant difference, however, is that the reflectance
changes are identified, not by sharp intensity changes, but by sharp chromaticity changes.
Small chromaticity changes are assumed to be caused by changes in the spectrum of the
illuminant and are retained as part of the shading.
We begin by following the usual lightness recovery strategy [7, 1], but to do so we
first need to transform a color image into graylevel luminance image by forming
I' = R + G + B.
Under the assumption that the luminance is described well as the product of a shading
component S and a color component C, the two components are separated by taking
logarithms: I ( z ) = logI'(z) -- logS'(z) + logC'(z)
Differentiating, thresholding way all components of the derivative coincident with large
chromaticity changes, and integrating yields the logarithm of the shading.
Chromaticity changes (dr, dg) are determined from the derivative of the chromaticity
tur, g) where the threshold function locates pixels with high [dr I or [dg I. The threshold
nction is defined as T ( x ) = 1 at pixels where (dr, dg) is small and T ( z ) = 0 where it is
large. T will be mostly 1; whereas, for lightness T will be mostly 0.
Applying T to the derivative of the luminance image eliminates C = iogC', so
S = logS' can be recovered by integrating the thresholded intensity derivative. In other
words, dS = T ( d I ) , S = f dS , S' = e x p S
It is easy to see from Figs. 2(a)-(e) that this algorithm will recover the shading
properly for the case of perfect step edges and the correct result is in fact obtained by
the algorithm as shown in Fig. 2(f).
2.3 I n t e g r a t i o n b y F o u r i e r E x p a n s i o n
A fast, direct method of integrating the thresholded derivative of the shading field in
the two-dimensional case is to apply Fourier transforms. While not efficient in the one-
dimensional case it is easy to understand how the method works. Firstly, if the discrete
Fourier transform of d S is F ( d S ) , the effect of differentiation is given by
F(dS) = 21riu f(S)
where the frequency variable is u. This expression no longer holds exactly, however, when
the derivative is calculated by convolution with a finite-differences mask. For the case of
convolution by a derivative mask, after both the mask and the image are extended with
zeroes to avoid wraparound error [5], Frankot and Chellappa [3] show how to write the
Fourier transform of the derivative operation in terms of u. Call this transform H. H
is effectively the Fourier transform of the derivative mask, and integration of d S simply
involves dividing by H and taking the inverse transform:
F(S) = F(dS)/H
This division will not be carried out at u = 0, of course, so that integration by this
method does not recover the DC term representing the unknown constant of integration.
To generalize the method to real, two-dimensional images two main problems need to be
addressed: how to properly deal with non-step edges and, since in two-dimensions the
gradient replaces the derivative, how to integrate easily the gradient image?
As in the one-dimensional case, the procedure is to determine a threshold image T
from the chromaticity image, apply the threshold to the derivative of the logarithm of the
luminance image I and integrate the result to obtain the shading field S. The threshold
function T comes from the chromaticity itself, not its log, and for two-dimensional images
is based on the gradient of the chromaticity vector field
127
VL = TVI ~ { V~L = V . T V I
n.VL = n.TWI
on boundary
However, the converse holds only if the field T V I has zero curl:
V2L = V.TVI ]
n . VL = n . T V I [
on boundary ( ==~ VL = T V I
V x (TVI) = 0 J
Blake argues that in theory T V I will have zero curl and thus forms a conservative
field. Furthermore, he points out that if this condition is violated in practice, the best
solution is robust in the sense that it minimizes a least-squares energy integral for L.
The demand that the curl be approximately zero is important because it amounts to the
condition that the recovered lightness (or shading field in our case) be integrable from
its gradient.
The fact that we use a threshold T that is not derived from I itself, but is instead
derived from (r, g), does not make any difference in the proof of Blake's theorem. In fact,
any T will do--the formal statement of the theorem follows through no matter how T
is chosen. The crucial point is that while the theorem holds for continuous images and
step edges, in practice the curl may not be zero because of edges that are not perfect
steps. With a non-step edge, thresholding may zero out only half the effective edge, say,
in which case T V I will not be conservative.
Another situation that can affect integrability is when some chromaticity edges are
slightly stronger than others so that some of the weaker edges are missed by the threshold.
For example, consider the case of a square of a different color in the middle of an otherwise
uniform image. If for some reason the horizontal edges are slightly stronger than the
vertical ones, so that only the horizontal ones are thresholded away, then the curl--
necessarily zero everywhere in the input gradient image---will become non-zero at the
corners of the square. The non-zero curl indicates that the resulting integrated image
will not make sense and we cannot hope to recover the correct, flat shading image from
this thresholded gradient.
What should be done to enforce integrability? Blake further differentiates the thresh-
olded gradient calculating its divergence, which results in the Laplacian of the lightness
field. In essence, Blake's method enforces integrability by mapping the two components
of the gradient image back to a single image L. Differentiating L will clearly result in an
irrotational gradient field.
In the context of shape-from-shading, Frankot and Chellappa [3] enforce integrability
of the gradient image (p, q) by projecting in the Fourier domain onto an integrable set.
This turns out to be equivalent to taking another derivative of (p, q) and assuming the
resulting sum equals the Laplacian of z [13]. For the lightness problem, then, integrating
by forming the Laplacian and inverting is a method of enforcing integrability of T V I . The
most efficient method for inverting the Laplacian is integration in the Fourier domain,
as set forward in [13].
While these methods of projecting T V I onto an integrable vector field generate the
optimal result in the sense that it is closest to the non-integrable original, in the case of
the thresholded shading gradient "closest" is not necessarily best. For example, consider
128
the luminance edge associate with the color change shown in Fig. 3(a) and its gradient
image Fig. 3(b) (using one dimension for illustrative purposes). Since the chromaticity
edge is not a perfect step, we can expect thresholding to eliminate only part of the edge
as shown in Fig. 3(c). The projection method of integration by forming a Laplacian and
inverting uses the integrable gradient that is best in the sense that it is closest to Fig. 3(c).
Fig. 3(d) shows the result after integration. The problem is that while the gradient of
Fig.3 (d) may be curl-free, a lot of the edge that should have been screened out remains.
We would prefer a method that enforces integrability while also removing more of the
unwanted edge.
3.1 C u r l - C o r r e c t i o n
If the thresholding step had succeeded in zeroing the entire gradient at the edge, then
the resulting image would have had zero curl. To accomplish the dual goals of creating
an integrable field and of eliminating the edge, we propose thresholding out the gradient
wherever the curl is non-zero. This must be done iteratively, since further thresholding
may itself generate more pixels with non-zero curl. Iteration continues until the maximum
curl has become acceptably small. Since the portion of the edge that was missed by the
initial thresholding created the curl problem, the thresholded region will expand until
the whole edge has been removed.
An alternative curl-correction scheme is to distribute the contributions of the x and
y partial derivatives of the gradient that make the curl non-zero among the pixel and its
neighboring pixels so that the result has zero curl. As an example of this type of scheme
one can determine which part of the curl, the x derivative of the y-gradient or the y
derivative of the x-gradient, is larger in absolute value. Then the larger part is made
equal to the other by adjusting the larger gradient value contributing to the curl. Tests
with this method did not show that it worked any better than the simpler scheme of
simply zeroing the gradient. Although it might work better in some other context, all
results reported in this paper simply zero the gradient.
3.2 B o u n d a r y C o n d i t i o n s
Blake imposes Neumann boundary conditions in which the derivative at the boundary
is specified. The process of integration by Fourier expansion is simplified slightly if,
instead, Dirichelet boundary conditions are used in which the values at the border are
fixed. Surrounding the image with a black border accomplishes this.
In the case of lightness, the Dirichelet conditions will not work because the intensity
variation removed via thresholding does not balance out from one edge of the image to
another. For shading recovery, however, as long as the color changes are contained within
the image, what is thresholded does balance across the image. Color changes are generally
completely contained within the image as, for example, with colored letters on a colored
background. For convenience, our current implementation uses Dirichelet conditions, but
could straightforwardly be changed to Neumann conditions if necessary.
3.3 A l g o r i t h m
To summarize the above discussion, the shading-recovery algorithm is as follows:
I~lF~176
1. Find color edges.
The luminance image derived from the cereal box, color image of Fig. 1 is shown in
Fig. 5 (a). The corresponding chromaticity images r, g (scaled) are Figs. 5 (b,c). Apply-
ing the gradient operator to these chromaticity images and thresholding at 40% of the
gradient yields the initial threshold image in Fig. 5 (d) Reducing the maximum curl to
60% of its original maximum via curl correction generates the extended threshold image,
Fig. 5 (e). The number of curl-correction iterations required was 5. The recovered shading
image is Fig. 5 (f) with the difference between figures (a) and (f) shown in Fig. 5 (g).
In order to compare the algorithm's performance with "ground truth," we also con-
sidered the image of Fig. 6 (a), which was created by Lambertian shading of a laser
range-finder depth map of a plaster bust of Mozart. 2 Fig. 6 (b) overlays this shading
field with color by multiplication with the colors measured in the cereal box image. Thus,
both the shading and the color edges come from natural objects, but in a controlled fash-
ion, so the result is a synthetic image constructed from real shapes and colors including
noise. The image shown is actually the luminance image derived from the color image.
To take into account the color and not shape of the box, the colors were extracted from
the chromaticity images Fig. 5 (a,b), with the b image formed as 1 - r - g, rather than
2 The laser range data for the bust of Mozart is due to Fridtjof Stein of the USC Institute for
Robotics and Intelligent Systems.
130
using the I~GB directly. So that the intensity image would not simply equal the original
shading image, the r, g, b components were multiplied by unequal amounts.
The chromaticity images of the input color image are thus precisely those of Fig. 5
(a,b) (because Fig. 6 (a) contains no pixels that are exactly zero). The initial chromaticity-
derived threshold function, with a threshold level of 30%, is shown in Fig. 6 (c). Requiring
the maximum curl value to be reduced to 60% of its original value took a single iteration--
lowering the initial threshold further, from 40% in Fig. 5 to 30% substantially speeds up
the curl-correction step.
Curl-correction extends the threshold function as in Fig. 6 (d). Applying the algo-
rithm to the luminance image Fig. 6 (b), results in Fig. 6 (e) Comparing to Fig. 6 (a),
the shading is recovered well in that the difference between figures (a) and (e) is negligible.
5 A s s u m p t i o n s and L i m i t a t i o n s
Stated explicitly the assumptions and limitations of the algorithm are:
- Color edges must not coincide with shading edges.
- All color edges must involve a change of hue/chromaticity, not just brightness (e.g.
not orange to dark orange, or perfect gray to another shade of perfect gray).
- Surfaces are Lambertian reflectors. Strong specularities will be mistaken for re-
flectance changes, while weak specular components will be attributed to shading.
- The spectral power distribution of the illumination should be constant, but of course
its intensity can vary. Gradual changes will be attributed to the shading to the
extent that they affect the luminance image. Abrupt changes in intensity are allowed
and will be correctly attributed to shading because they will not cause an abrupt
chromaticity change. This is unlike retinex algorithms, which will be fooled by sharp
intensity changes because they treat each color channel separately.
- The shading is recovered up to an overall multiplicative scaling constant.
6 Conclusions
Color creates problems for shape-from-shading algorithms which assume that surfaces
are of constant albedo. We have implemented and tested on real images an algorithm
that recovers shading fields from color images which are equivalent to what they would
have been had the surfaces been all one color. It uses chromaticity to separate the surface
reflectance from surface shading and involves thresholding the gradient of the logarithm
of the image luminance. The resulting Poisson equation is inverted by the direct, Fourier
transform method.
References
1. A. Blake. Boundary conditions for lightness computation in Mondrian world. Computer
Vision, Graphics, and Image Processing, 32:314-327, 1985.
2. P. T. Eliason, L. A. Soderblom, and P. S. Chavez Jr. Extraction of topographic and spectral
albedo information from multispectral images. Photogrammetric Engineering and Remote
Sensing, 48:1571-1579, 1981.
3. R. T. Frankot and R. Chellappa. A method for enforcing integrability in shape from shading
algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10:439--451,
1988.
4. B.V. Funt and M.S. Drew. Color constancy computation in near-Mondrian scenes using a
finite dimensional linear model. In Computer Vision and Pattern Recognition Proceedings,
pages 544-549. IEEE Computer Society, June 1988.
5. R. C. Gonzalez and P. Wintz. Digital Image Processing. Addison/Wesley, 2nd edition,
1987.
6. G. Healey. Using color for geometry-insensitive segmentation. J. Opt. Soc. Am. A, 6:920-
937, 1989.
131
Figure 1. (a) Black and white photograph of a color image of a cereal box--blue background
with yenow lettering. (b) Recovered depth image using shape from shading algorithm of [11].
(a) (b) (c) (d) (e) (f)
If
f . .t.. :..:
Figure 2. One-dimensional case: (a) Initial depth'map. (b) Shading field for (gray) Lambertian
surface illuminated by a point source from upper left. (c) Same surface with colored stripes
(red--solid, green--dotted, blue---dashed). (d) Colors in image (c) multiplied by shading field
of image (b). (el Chromaticities formed from observed camera values. (f) Shading field recovered
by algorithm.
(a) (b) (c) (d)
Figure 3. Thresholding: (a) Smooth step. (b) Derivative. (c) Thresholded derivative. (d) Integra-
tion of (c)--in 2 dimensions would be integration of gradient images under integrable projection.
132
1 Introduction
The shape from shading problem is classical in vision; E. Mach (1866) was perhaps the
first to formulate a formal relationship between image [1] and scene domains, and to
capture their inter-relationships in a partial differential equation. Horn set the modern
approach by focusing on the solution of such equations by classical and numerical tech-
niques [2, 3, 4], and others have built upon it [5, 6]. Nevertheless, problems remain which
are not naturally treated in the classical sense, especially those related to discontinuities
and shadows. We present a new approach to the shape from shading problem motivated
by modern notions of fibre bundles in differential geometry [9]. The global shape from
shading problem is posed as a coupled collection of "local" problems, each of which at-
tempts to find that local scene element (or scenel3) that captures the local photometry,
and which are then coupled together to form global piecewise smooth solutions.
The paper is written in a discursive style to convey a sense of the "picture" be-
hind our approach rather than the formal treatment. We begin with an overview of the
classical formulation, then proceed to define the structure of a "scener' and our new
conceptualization.
1.1 T h e "Classical" S h a p e f r o m S h a d i n g P r o b l e m
We take the classical setting in computer vision for shape from shading to be the following:
a point light source at infinity uniformly illuminates a smooth matte surface of constant
albedo whose image is formed by orthographic projection.
The matte surface is traditionally modeled with Lambert's reflectance function so the
image irradiance equation is
I(z, y) = pAL. N(x, y)
3 c]. Pixel, voxel . . . . scenel.
136
where I(z, y) is the intensity of an image point (z, y); p, the albedo of the surface, i.e.
the fraction of the shining light which is reflected; ~, the illumination, i.e. the amount
of shining light; L, the light source direction; N ( z , y), the normal at the surface point
corresponding to an image point (z, y).
The literature cited above describes the various attempts to solve this (or a closely
related) problem 4 from first principles. We emphasize, however, that, to make these
approaches tractable, certain parameters are assumed known (e.g. typically p, ~ and L).
Operationally this decouples problems; e.g., it decouples the shape from shading problem
from light source estimation problems[10].
1.2 P i e c e w i s e S m o o t h S h a p e f r o m S h a d i n g P r o b l e m s
We submit that such decoupling, while appropriate for certain highly engineered situa-
tions, is not always necessary; moreover, it can make shading analysis impotent precisely
when it should be useful. For example, a human observer confronted with a static, monoc-
ular view of a scene will succeed in obtaining some estimate of the shapes of the surfaces
within it even when some of the classical setting's constraints are relaxed. The presence
of a shadow, a diffuse light source, or even a patterned surface does not necessarily in-
terfere with our ability to recover shape from shading. Thus the classical constraints can
be relaxed in principle; but how far, and once relaxed by what mechanism can solutions
be found? These are precisely the questions with which we shall be concerned.
We retain the basic assumption that smooth variation in intensity is entirely due to
smooth variation in surface orientation; thus:
(where the albedo varies continuously), a projected slide (where the illumination on the
screen varies continuously) and the scene itself.
The key idea underlying our approach is to consider the shape from shading problem as
a coupled family of local problems. Each of these is a "micro"-version of the shape from
shading problem in which a lighting and surface model interact to produce the locally
observed shading structure. We call each of these different models a scene element, or
scenel, and, since many different scenels may be consistent with the local image structure,
utilize fibre bundles to provide a framework to couple them together.
.~\l t /
sS
m
(x,y) a#
oS
e#
st
oS
oS
ot
N ,
sS
s~
$6'
B.)
Fig. 1. Depiction of an abstract scene element, or scenel, corresponding to an image patch (A).
The scenel (B) consists of a surface patch, described by its image coordinates, surface normal,
and curvature. Its material properties (albedo) are also represented. Finally, a virtual fight source
completes the photometry.
_-r
F i g . 2. Depiction of a Scenel Bundle over an image. At each point in the image there are many
possible scene elements, or scenels. Each of these scenels is depicted along a fibre, or vertical space
above each image coordinate. The union of scenel fibres over the entire image is called a scenel
bundle. The shape from shading problem is formulated as determining sections through the
scenel bundle. Such a section is depicted by the shaded sceneis, and represents a horizontal slice
across the bundle. Scenel participation in a horizontal section is governed by surface smoothness
and material and light source constancy constraints.
on surfaces.
We define what is meant by consistency shortly; first, we introduce the shading flow
field as our initial data.
Observe that a sensitivity issue arises in the scenel framework; spatial quantization of
the image induces a quantization of the scene domain. Analogously to the manner in
which integer solutions are not always possible for algebraic equations, we begin with
"quantized" initial data as well. In particular, we derive our initial estimates from the
shading/low field instead of directly from the intensity image. This field is the first order
differential structure of the intensity image expressed as the isoluminance direction and
gradient magnitude (Fig. 3(a)); we supplement it with the intensity "edge" image (Fig.
3(b)). We suggest that dealing with uncertainties at the level of the shading flow field
will expose more of the natural spatial consistency of the intensity variation, and will
thus lead to more robust processing than the raw intensities. The shading flow field ideas
are related to Koenderink's isophotes [14].
Traditionally, the gradient of an image is computed by estimating directional deriva-
tives by
OI
a=
- - Go(
# X )cocy),
at
-a~
- I , Co() X t
o(y) .
The gradient estimate follows immediately, and the isoluminance direction is simply
perpendicular to the gradient. The one limitation of this approach is that (depending on
the magnitude of ~r) it always infers a smooth gradient field,even when the underlying
image is non-continuous. W e have investigated methods of obtaining stable, discontinuous
shading flow fieldsusing logical/linearoperators [15, 16].
Our motivation for starting from the shading flow field is also biological. W e take
shading analysis to be an inherently geometric process, and hence handled within the
same corticalsystems that provide orientation selection and texture flow analysis. Shad-
ing flow is simply a natural extension.
The coupling between the local scenel problems dictates a consistency relationship over
them, and derives from three principle considerations:
I. A SURFACE SMOOTHNESS CONSTRAINT, which states that the surface normal and
curvatures must vary according to a Lipschitz condition between pairs of scenels
which project to neighbouring points in the image domain. This notion is subtle
to implement, because it involves comparison of normal vectors following parallel
transport to the proper position (see [8]).
2. A SURFACE MATERIAL CONSTRAINT, which states that the surface material (albedo
and reflectance) is constant between pairs of scenels which project to neighbouring
points in the image domain.
3. A LIGHT SOURCE CONSTRAINT, which states that the virtual light source is constant
for pairs of scenels which project to neighbouring points in the image domain.
140
Fig. 3. Typical shading flow (a) and edge (b) fields The left shading flow field depicts the first
order differential structure of the image intensities, and the right is an edge map. Our shading
analysis is based on these data, and not on the raw image intensities. This is to focus on the
geometry of shape from shading analysis, and perhaps to capture something implicit in the
biological approach.
Viewed globally, the solution we seek consists of sections in which a single (equivalent)
light source illuminates a collection of surface patches with constant material properties
but whose shape properties vary smoothly. The above constraints are embedded into a
functional, and consistent sections through the sccnel bundle are stationary points of
this functional. More specifically, the constraints are expressed as compatibility relation-
ships between pairs of neighbouring estimates within a relaxation labelling process. As
background, we next sketch the framework of relaxation labelling.
2.3 I n t r o d u c t i o n to R e l a x a t i o n L a b e l l i n g
jEN(i) I'Egj
141
The final labelling is selected such that it maximizes the average local support
2.4 O v e r v i e w o f t h e P a p e r
The paper is organized as follows from this point on. We first formally define a scene
element. We then show how to find those scene elements that are consistent with the
shading flow field, and solve the forward problem of calculating the shading flow field
expected for each scene element. The remainder of the paper is concerned with the
inverse problems of inferring those scenels that are consistent with the shading flow
field. The subtle interaction between expressing the scenel variables and the relaxation
compatibilities is described, and scene element ambiguity addressed. An advantage of
our technique is that, since both surface and lighting geometry are estimated, different
types of shadow and illumination discontinuities can be handled; these are discussed in
the Sect. 5.
We adopt a coarse coding of surface and light source attributes by quantizing the range
of values of each attribute. Each scene element is defined by an assignment of a value
to each of the scene attributes. The set of scene elements is viewed as a set of existence
hypotheses of a surface patch of a fixed shape and orientation at a fixed image position,
illuminated from a fixed direction with a fixed product of albedo and illumination.
3.1 T h e S c e n e E l e m e n t
The attributes we consider are
1. IMAGE POSITION: The image pixels are themselves a set of discrete values of z and
y position.
2. VIRTUAL ILLUMINANTDIRECTION:We take the light sources in the scene as a set of
M distant point sources 6, {A(0L(0 : 1 _< i < M} where A(0 is the intensity and L(0
is a unit vector. Let Vi(z, y) be a binary "View" function such that Vi(z, y) = 1 if
and only if light source L(0 directly illuminates the surface element corresponding
to pixel (x, y).
We define the virtual point source at (x, y) by the two attributes A and L such that
The possible virtuM light source directions m a p onto a unit sphere. We sample this
sphere as uniformly as possible to get a discrete set of virtual illuminant directions.
In viewer-centered coordinates, this unit vector is given as
L = (Lx,L~,L~) .
3. MATERIAL PROPERTIES or the product pA: We need only consider the product of
the albedo and the illuminance (see (1)). Usually the imaging process normalizes to
some m a x i m u m value, so we can assume a range between zero and one, and discretely
sample this range.
4. SURFACE SHAPE DESCRIPTORS: The two principal curvatures (~,, ~ ) describe the
shape up to rotation. Two angles slant a and tilt r are needed to describe the surface
tangent plane orientation with respect to the viewer's coordinate frame. An additional
angle ~b is needed to describe the principal direction of the Darboux frame in the
surface tangent plane.
(a) The two angles needed to orient the surface tangent plane in space describe the
surface normal
N = (Nx,Ny,Nz) = (cosvsinc,,sinvsin~,cosa) .
The set of all such normals form a unit sphere. The surface of this sphere is
sample as uniformly as possible to derive a discrete set of normals. Of these, only
the ones in the hemisphere facing the viewer are used; the others axe not visible;
see Fig. 4.
Fig. 4. The shape of surface patches is represented as a function of the principle curvatures
mapped onto an abstract sphere through "curveness" and "shape index" measures (see text).
Nearby positions on the sphere indicate smooth changes in either the shape or orientation of a
scene1. This representation on the sphere facilitates the definition of scene1 compatibilities later
in the text.
143
(b) The two principal curvatures (~l, ~:2) are mapped into a curveness measure
c = m a x ( I , t l , [tc2[)
a = COS- 1 2C J
These are analogous to Koenderink's curveness and shape index [21], with the
choice of norm for the curveness and the spreading function for the shape index
modified slightly.
The angles 2~ and s are, respectively, the longitude and latitude of a spherical
coordinate system covering shape variation in the tangent plane. As we stated
previously, the angle ~b represents the principal direction, while s values 0 and
represent umbilic surfaces where principal directions are not defined. Any smooth
curve on the (2~b,s) sphere represents a smooth deformation or rotation of the
surface. We define a unit vector K as follows
Thus by sampling the surface of the sphere uniformly we derive a discrete set of
parameters which cover all variations of smooth, oriented shape in the tangent
plane. Augmenting this with the curveness index provides a complete, discretely
sampled shape descriptor.
Given the discrete sampling of the scene attributes as defined above, we derive a set
of scenel labels
I = {z, y, pA, L, N, e, K}
which represent all potential assignments of these scene attributes. Thus each i represents
the hypothesis that the scene can be locally described by the scenel (zi, yi, (p~)~, L~, N~,
ei, Ki). To relate this to the relaxation labelling paradigm [17], we distribute a measure
pi over each scenel i representing confirmation of the hypothesis. The first step is to
obtain an initial estimate for the confidence measure Pl from the shading flow field. The
second is to extract locally consistent sets of hypotheses by relaxation labelling. The
third step is to prune these sets by imposing appropriate boundary conditions. This is
a large scale parallel computation, and we are currently implementing it on a massively
parallel machine (MasPar MP-1).
3.2 T h e E x p e c t e d S h a d i n g Flow F i e l d
For each scenel i, we need an initial estimate for the associated weight pl- We take this
weight to reflect the match between the local properties of the shading flow field and the
EXPECTED VALUES for the scene element i.
The EXPECTED SHADING FLOW FIELD is obtained by computing the light intensity
gradient for a surface locally described by the scenel i. The paraboloid is an arbitrarily
curved surface with the local parametric form
K1 U2 -(- K2V 2
(u, v, w) where w =
2
144
The u and v axes correspond to the two principal directions. The normal to the surface
at a point (u, v) is given by:
N(u,v) = (Nu(u,v),Nv(u,v),Nw(u,v))
= (,~u~+,~lv2+l)l' ~ ~ + ,~,,
(~lU 2 2 + 1) 89 ( , ~ u ' + ,~,,~ + 1)89
Two rotations relate the viewer's coordinate frame to the paraboloid local frame.
The first rotation takes care of the surface orientation and the second, of the principal
directions. In matrix form
M = M.Mc
where
3.3 S c e n e E l e m e n t A m b i g u i t y
It should be clear that a local shading flow field will not select a unique scenel. In general,
for an arbitrary shading flow field, several scene elements will be assigned a significant
weight for each image position. Identical flow fields can be generated by surfaces of
different shapes because of
- Intrinsic ambiguity. For example, the cases of convex (Icz, 62), concave ( - 6 1 , - 6 2 ) ,
hyperbolic ( - 6 1 , ~2) or ( 6 z , - ~ 2 ) surfaces facing the viewer and the light source all
have the same intensity profile.
- Accidental correspondence between light source and surface orientations. For exam-
ple, a surface with ~1 = ~2 will generate concentric circular isoluminance lines in
the plane facing the light source. If the curvature is small, the isoluminance lines
projected in the image plane form concentric ellipses. Such isoluminance lines can
also be seen when an elliptic surface faces the viewcr and the light source.
The example of elliptic isoluminance lines is particulary revealing. Such lines could be
due to the shading of a spherical patch directly facing the light source but slanted away
from the viewer; or it could be due to the shading of an elliptical (convex or concave)
patch directly facing the light source and the viewer. Here, two phenomena are closely
coupled: the formation of the isoluminance lines on the surface (related to the slant of the
surface with respect to the light source) and the projection on the image plane (related
to the slant of the surface with respect to the viewer).
However, these local ambiguities are not always accompanied by global ambiguities.
Although, as Mach observed in 1866, "many curved surfaces may correspond to one light
surface even if they are illuminated in the same manner" [1], these global ambiguities are
often finite [18]. In the Sect. 4, we propose a relaxation labelling process to disambiguate
the local surface geometry.
R~call that we are imposing three constraints on our surfaces: locally constant albedo,
locally constant lighting conditions, and locally smooth geometry. For the relaxation
labelling process, these translate into the following:
- A scenel j is compatible with the scenel i if they have the same constant pA and
the same virtual illuminance direction L, and if scenel j ' s surface descriptors fall on
scenel i's extrapolated surface at the corresponding relative position.
- A scenel j is incompatible with the scenel i if they have the same constant pA and the
same virtual illuminance direction, and if there exists another scenel j ' , neighbouring
scenel j along the fibre, that better fits the extrapolated surface from scenel i than
scenel j. Observe that this incompatibility serves to localize information along each
fibre.
- otherwise a scenel j is unrelated to the scencl i.
Using these guiding principles, we assign a value to the compatibility r!j between two
scenels i and j. This compatibility will be positive for compatible hypotheses, negative
for incompatible hypotheses, and zero otherwise. In general, variation in rli is assumed
to be smoothly varying between nearby points in the parameter space Z. The process is
illustrated in Fig. 5.
146
\~ I//
j,~ J
(x,y) ~ ' tl S
- j' f
I't
A.) I
I
I
t
/
j,
B . ) ~ the osculating
Fig. 5. Illustration of the compatibility relationship for scenel consistency. Two scenels axe
shown on the fibre at image location (x S,ys), and are evaluated against the scenel (i) at (x, y).
The surface represented in scenelx.u is modeled by the osculating paxaboloid, and extended to
(x ~, y~). It is now clear that one scenel (j~) at (z', y~) is consistent, because its surface patch
lles on this paraboloid and light source and albedo agree. The other scenel (j) is inconsistent,
because its surface does not match the extended paraboloid. Such osculating paraboloids are
used to simulate the parallel transport of scenel=,,~, onto scenel=,u.
Consider the unique paxaboloid Si(u,v) such that the neighbourhood of Si(0,0) is
described by the surface parameters of scenel i. The compatibility of a scenel j with i (r#)
is then defined in terms of the relationship between scenel j and the point Si(u, v) E Si
which is "closest to" scenel d. This operation is equivalent to the minimization of the
distance measure
. u (c~ ~
(=j - ~TCu, v)) ~ + (y~ - y~ ( , ~))~ + \ ~TM )
+ ( c~ (N"))'/'ark
"Kt (u' / + (~'('~ - ,,))'~'ao
~,'.(u, /
over (u, v). Here, Z~TJV,ATK, Z~c axe the distance between neighbouring scenels for the
given attribute and (x*, yT, N*, K*, c7) are the surface descriptors for the point Si(u, ~).
147
So by physical consideration, the compatibility between scenel i and and scenel j can
be expressed as
We have described a framework for computing sections of the scenel fibre bundle that
are constrained to have a constant virtual light source, constant albedo, and smooth
surface geometry. Any discontinuity in these attributes will demarcate a section boundary.
Discontinuities can arise in many ways: the albedo can change suddenly along a smooth
surface; a shadow can be cast across a smooth surface; the surface normal can change
abruptly along a contour. In all three of these cases there will be a discontinuity in the
image intensity. In this Sect., we address the question of how seenel discontinuities are
manifested as discontinuities in the image. We restrict our attention to two types of image
discontinuities : intensity discontinuities (edges) and shading flow discontinuities.
5.1 I l l u m i n a t i o n D i s c o n t i n u i t i e s ( S h a d o w s )
Shadows are produced by variations in the virtual point source along a single surface
patch (recall Sect.3.1 for the definition of AL). These variations are the direct result of
a discontinuity in some Vi(z, y). We say that there is a shadow boundary along this
discontinuity. The image intensity cannot satisfy our model (1) in the neighbourhood of
a shadow boundary since the image intensity on each side of the shadow boundary is
due to a different virtual point source. We now briefly examine the two types of shadow
boundary.
In general, an attached shadow boundary lies between two nearby pixels (z0, Y0) and
( z l , y l ) when for some point source, L(j), it is the case that N(x0,Y0) 9 LU) > 0 and
N ( z l , y l ) . L ( j ) < 0.
The image intensity is continuous across the attached shadow boundary even though
Vj is discontinuous since N 9 L ( j ) -- 0 at the boundary. However, the shading flow is
typically not continuous at the attached shadow boundary. The shading flow due to the
source L(j) will be parallel to the shadow boundary on the side where Vj = 1, but it will
be zero on the side where Vj -- 0. The shading flow due to the rest of the light sources
will be smooth across the boundary but this shading flow will not typically be parallel
to the boundary. Hence the sum of the two shading flows will typically be discontinuous
at the boundary.
A cast shadow boundary is produced between two nearby points (zo, Yo) and (xl, Yl)
when for some point source, L(j), it is the case that both N(z0, Y0) 9 LU) > 0 and
N ( Z l , y l ) . L(j) 0 , and either Vj(z0,Y0) - 0 or V j ( z l , y l ) -- 0 (but not both).
Examining (2), we note that for cast shadows, the discontinuity in Vj results in a
discontinuity in image intensity, since the N 9 L(j) > 0. Furthermore, there is typically a
148
Fig. 6. An illustration of shadow boundaries and how they interact with flow fields. In (a)
shadow is cast across the hood of a car. In (b) we show 9 subimage of the cast shadow on the
left fender, and in (c) we show the shading flow field (represented as a direction field with no
arrowheads) and the intensity edges (represented as short arrows; the dark side of the edge is
to the left of the arrow). Observe how the shading field remains continuous (in fact, virtually
constant) across the intensity edge. This hold because the surface is cylindrical, and the shading
flow field is parallel to the axis of the cylinder.
149
discontinuity in the shading flow since the virtual light source defined on the side of the
boundary where Vj = 0 will usually produce a different shading flow across the boundary
than is produced by L(j) on the side of the boundary where Vj = 1.
In the special case of parabolic (e.g. cylindrical) surfaces, the shading flow remains
continuous across both cast and attached shadow boundaries because the flow is parallel
to the axis of the cylinder. Note however that the attached shadow is necessarily parallel
to the the shading flow field. This case is illustrated in Fig. 6.
To summarize, the image intensity in the neighbourhood of a pixel (z0, Y0) can be
modelled using a single virtual point source as long as there are neither attached nor cast
shadow boundaries in that neighbourhood. Attached shadow boundaries produce con-
tinuous image intensities, but discontinuities in the shading flow. Cast shadows produce
both intensity discontinuities and shading flow discontinuities.
5.2 G e o m e t r i c D i s c o n t i n u i t i e s
There are two different ways that the geometry of the scene can produce discontinuities
in image. There can be a discontinuity in N along a continuous surface, or there can
be a discontinuity in the surface itself when one surface occludes another. In the latter
case, even if there is no discontinuity in the virtual light source direction there will still
typically be a discontinuity in N which will usually result in both discontinuities in the
image intensity and in the shading flow.
5.3 M a t e r i a l D i s c o n t i n u i t i e s
If there is a discontinuity in the albedo along a smooth surface, then there will be a
discontinuity in luminance across this material boundary. However, the shading flow will
not vary across the boundary in the sense that the magnitude of the luminance gradient
will change but the direction will not.
5.4 S u m m a r y o f D i s c o n t i n u i t i e s
In summary, shading flow discontinuities which are not accompanied by intensity discon-
tinuities usually indicate attached shadows on a smooth surface. Intensity discontinuities
which are not accompanied by shading flow discontinuities usually indicate material
changes on a smooth surface. The presence of both types of image discontinuities indi-
cates that either there is a cast shadow on a smooth surface, or that there is a geometric
discontinuity.
6 Conclusions
We have proposed a new solution to the shape from shading problem based on notions
from modern differential geometry. It differs from the classical approach in that light
source and surface material consistency are solved for concurrently with shape proper-
ties, rather than independently. This has important implications for understanding light
source and surface interactions, e.g., shadows, both cast and attached, and an example
illustrating a cast shadow is included.
The approach is based on the notion of scenel, or unit scene element. This is defined
to abstract the local photometry of a scene configuration, in which a single (virtual) light
source illuminates a patch of surface. Since the image irradiance equation typically admits
150
many solutions, each patch of the image gives rise to a collection of scenels. These are
organized into a fibre space at that point, and the collection of scenel fibres is called the
scenel bundle. Algebraic and topological properties of the scenel bundle will be developed
in a subsequent paper.
The solution of the shape from shading problem thus reduces to finding sections
through the scenel bundle, and these sections are defined by material, light source, and
surface shape consistency relationships. The framework thus provides a unification of
these different aspects of photometry, and should be sufficently powerful to indicate the
limitations of unification as well.
References
1. Ratliff, F.: Mach Bands: Quantitative Studies on Neural Networks in the Retina,
Holden-Day, San Francisco (1965)
2. Horn, B.K.P.: "Obtaining Shape from Shading Information," P.H. Winston. Ed. in The
Psychology of C o m p u t e r Vision, McGraw-Hill, New York (1975)
3. Ikeuchi, K. and Horn, B.K.P.: "Numerical Shape from Shading and Occluding Boundaries,"
Artificial Intelligence, 17 (1981) 141-184
4. Horn, B.K.P. and Brooks, M.J.: "The Variational Approach to Shape from Shading," Corn-
put. Vision, Graph. Image Process., 33 (1986) 174-208
5. Nayar, S.K., Ikeuchi, K. and Kanade, T.: ~Shape From Interreflections," International
Journal of Computer Vision, 6 (1991) 173-195
6. Pentland, A.: "Linear Shape From Shading," International Journal of Computer Vision, 4
(1990) 153-162
7. Husemoller, D.: Fibre Bundles, Springer, New York (1966)
8. Sander, P., and Zucker, S.W.: "Inferring Surface Trace and Differential Structure from 3-D
Images," IEEE Trans. Pattern Analysis and Machine Intelligence, 9 (1990) 833-854
9. Spivak, M.: A Comprehensive I n t r o d u c t i o n to Differential Geometry, Publish or
Perish, Berkeley (1979)
10. Pentland, A.: "Finding the Illuminant Direction," J. Opt. Soc. Amer, 72 (1982) 448-455
11. Koenderink, J.J.: "What Does the Occluding Contour Tell us about Solid Shape?," Percep-
tion, 13 (1984) 321-330
12. Koenderink, J.J. and van Doorn, A.J.: "The Shape of Smooth Objects and the Way Contours
End," Perception, 11 (1982) 129-137
13. Biederman, I.: "Recognition-by-Components: A Theory of Human Image Understanding,"
Psychological Review, 94 (1987) 115-147
14. Koenderink, J.3. and van Doorn, A.J.: "Photometric Invariants Related to Solid Shape,"
Optica Acta, 27 (1980) 981-996
15. Iverson, L. and Zucker, S.W.: "Logical/Linear Operators for Measuring Orientation and
Curvature," TR-CIM-90-06, McGill University, Montreal, Canada (1990)
16. Zucker,S.W., Dobbins, A., and Iverson, L.: "Two Stages of Curve Detection Suggest Two
Styles of Visual Computation," Neural Computation, 1 (1989) 68-81
17. Hummel, A.R. and Zucker, S.W.: "On the Foundations of Relaxation Labeling Processes,"
IEEE Trans. Pattern Anal. Machine ]nteli., 5 (1983) 267-287
18. Oliensis, J.: "Shape from Shading as a Partially Well-Constrained Problem," Cornp. Vis.
Graph. Ira. Prac., 54 (1991) 163-183
19. Hopfield, J.J.: "Neurons with Graded Response Have Collective Computational Properties
like Those of Two-State Neurons," Proc. Natl. Acad. Sci. USA, 81 (1984) 3088-3092
20. Miller, D.A. and Zucker, S.W.: ~Eflicient Simplex-llke Methods for Equilibria of Nonsym-
metric Analog Networks," TR-CIM-91-3 McGill University, Montreal, Canada (1991);
Neural Computation, (in press)
21. Kovnderink, J.J.: Solid Shape, MIT Press, Cambridge, Mass. (1990) p. 320
T e x t u r e : P l u s ~a c h a n g e , ...*
Margaret M. Fleck
Department of Computer Science, University of Iowa, Iowa City, IA 52242, USA
1 Introduction
The input to an edge finding algorithm consists of a 2D array of values for one or more
properties, e.g. raw intensities, color, texture features (e.g. striping orientation), or stereo
disparities. Its goal is to model these property values as a set of underlying property
values, plus a pattern of fast variation in these values (e.g. camera noise, fine texture)
(Fig. 1). The underlying property values are reconstructed as varying "smoothly," i.e.
obeying bounds on their higher derivatives, except at a sparse set of locations. These
locations are the boundaries in the input.
Fig. 1. A sequence of input property values (left) is modelled as a sequence of underlying values
plus a pattern of fine variation (right).
Currently-available edge finders work robustly only when the fast variation has a
known distribution that is constant across the image. This assumption is roughly correct
for "blocks world" type images, in which all fast variation is due to camera noise, but it
* The research described in this paper was done at the Department of Engineering Science,
Oxford University. The author was supported by a Junior Research Fellowship funded by
British Petroleum.
152
fails when there is non-trivial surface texture within each region, because the amount of
fast variation depends on the particular surface being viewed. The amount of variation in
texture feature values (i.e. the amount of mismatch between the texture and the feature
model) also varies from surface to surface, as does the amount of error in stereo disparity
estimates.
There have been many previous attempts to extend edge finders to these more general
conditions, but none can produce output of the quality needed by later processing on the
wide range of inputs encountered in typical vision applications. Some make implausible
assumptions about their inputs: [2] assumes that each image contains only two textures,
[8] and [9] require that the number of textures in the image be known and small, [13]
and [15] provide their algorithms with training samples for all images present. Others
produce poor quality or blurred boundaries [2, 20, 31, 33] or seem difficult to extend to
2D [16, 17].
This paper presents a new algorithm which estimates the scale of variation, i.e. the
amplitude of the fine variation, within image regions. It depends on two key ideas:
1. Minimize the scale estimate over all neighborhoods of the target location, to prevent
corruption of scale estimates near boundaries, and
2. Use a robust estimate for each neighborhood, to prevent scale estimates from being
corrupted by outliers or boundary blur.
Given reliable scale estimates, underlying values can be reconstructed using a standard
iterative edge-preserving smoother. Boundaries are trivial to extract from its output. The
method extends easily to multi-dimensional inputs, such as color images (Fig. 2) and sets
of texture features (Fig. 3), producing good-quality preliminary output. 2 The iterative
smoother is also used to detect outliers, values which differ from those in all nearby
regions. Previous texture boundary finders have looked only for differences in average
value between adjacent regions. The phase change and contrast reversal images in Fig. 4
have traditionally proved difficult to segment [2, 20, 19] because the two regions have the
same average values for most proposed texture features. However, as Fig. 4 illustrates,
these boundaries show up clearly as lines of outliers.
The basic ideas behind the scale estimator are best presented in 1D. Consider estimating
the scale of variation for the slice shown in Fig. 1. Let Nw (x) be the neighborhood of
width pixels centered about the location x. The most obvious estimate for the scale
at x is the standard deviation of the (2w + 1) values in Nw(x). The spatial scale of the
edge finder output is then determined by the choice of w: output boundaries can be no
closer than about 2w + 1. Unfortunately, if x lies near a boundary, Nw (x) will contain
values from both sides of the boundary, so the standard deviation computed for Nw(x)
will be far higher than the true scale of the fine variation. This will cause later processing
(iterative smoothing and boundary detection) to conclude that there is no significant
boundary near x.
Therefore, the scale estimate at x should be computed from some neighborhood Nw (y)
containing x that does not cross a boundary. Such a neighborhood must exist because, by
definition, output boundaries are not spaced closer than about 2w+1. Neighborhoods that
do not cross boundaries generate much lower scale estimates than neighborhoods which
2 See appendix for details of texture features.
153
Fig. 2. A 300 by 300 color image and boundaries extracted from it (w = 8). Log intensity is at
the top left, red vs. green at the top right, and blue vs. yellow at the bottom left.
cross boundaries. Therefore, we can obtain a scale estimate from a neighborhood entirely
within one region by taking the m i n i m u m scale estimate from all neighborhoods Nw (y)
which contain ;v (where w is held fixed and y is varied). 3 Several authors [16, 17, 31]
use this minimization idea, but embedded in complex statistical tests. Choosing the
average value from the neighborhood with m i n i m u m scale [13, 24, 30] is not equivalent:
the m i n i m u m scale is well-defined but the neighborhood with m i n i m u m scale is not.
Even the best neighborhood containing x may, however, be corrupted: it m a y overlap
the blur region of the boundary or it m a y contain extreme outlier values (e.g. spots,
highlights, stereo mismatches, see Fig. 1). Since these outliers can significantly inflate
the scale estimates, the standard deviation should be replaced by a method from robust
statistics [11, 10, 12, 26] which can ignore small numbers of outliers. Simple robust filters
(e.g. the median) have been used extensively in computer vision and more sophisticated
methods have recently been introduced [14, 25, 27]. Because I expect only a small number
of outliers per neighborhood, the new scale estimator uses a simple a - t r i m m e d standard
deviation: remove the 3 lowest and 3 highest values and then compute the standard
deviation. The combination of this estimator with choosing the m i n i m u m estimate over
all neighborhoods seems to work well and is, I believe, entirely new.
3 This also biases the estimates downwards: calculating the amount of bias is a topic of on-going
research.
154
.....L, .-L..,-L,I /
J J .
Fig. 3. Boundaries from texture features: a natural textured image (256 by 256, w = 8), a pair
of textures from Brodatz's volume [3] normalized to the' same mean log intensity (200 by 200,
w = 12), and a synthetic test image containing sine waves and step edges (250 by 150, w = 8).
[ ]I
Fig. 4. The thin bar generates outliers in intensities. The change in phase and the contrast
reversal generate outliers in various texture features. White and black occupy equal percentages
of the contrast-reversal image. The images are 200 by 100 and were analyzed with W -- 8.
There are robust estimators which can tolerate neighborhoods containing up to 50%
outlier values [10, 11]. However, despite some recent suggestions [22, 29], it is not pos-
sible to eliminate the m i n i m i z a t i o n step by using such an estimator. The neighborhood
centered about a location very close to a b o u n d a r y typically has more t h a n 50% "bad"
values: values from the wrong region, values from the blur area, and r a n d o m wild outliers.
This effect becomes worse in 2D: the neighborhood of a point inside a sharp corner can
contain over 75% "bad" values. Furthermore, real patterns of variation have bimodal or
even binary distributions (e.g. a sine wave of period 4 can digitize as binary). Robust es-
timators tolerating high percentages of outliers are all based on medians, which perform
very poorly on such distributions [1, 32].
155
I am currently exploring three possible ways of extending this scale estimator to 2D. In
2D, it is not practical to enumerate all neighborhoods containing the target location z, so
the estimator must consider only a selection. Which neighborhoods are considered deter-
mines which region shapes the edge detector can represent accurately. At each location
x, the current implementation computes 1D estimates along lines passing through x in 8
directions. The estimate at x is then the median of these 8 estimates. Although its results
(see Fig 2-4) are promising, it cannot match human ability to segment narrow regions
containing coarse-ish texture. This suggests it is not making full use of the information
contained in the locations near x. Furthermore, it rounds corners sharper than 90 degrees
and makes some mistakes inside other corners.
Another option would be to compute scale estimates for a large range of neighborhood
Shapes, e.g. the pie-wedge neighborhoods proposed in [17]. Such an algorithm would be
reliable but very slow, unless tricks can be found to speed up computation. Finally, one
might compute scale only for a small number of neighborhoods, e.g. the round neigh-
borhood centered about each location x, and then propagate good scale estimates to
nearby locations in the spirit of [18]. The difficulty here is to avoid growing very jagged
neighborhoods and, thus, hypothesizing jagged region boundaries.
5 if I v i - V I <_ 3S;
wi= 10,(1- I ~ '6-SV l ~ J i f 3 S <- -[ v i - V ] < 6 S- - ;
0, otherwise.
This is a one-step W-estimate, a convenient way of approximating an M-estimate.(The
multi-step versions are asymptotically equivalent.) [10]. 5 A wide variety of weighting
4 This is the smallest smoothing neighborhood that still allows smoothing to "jump over" a
thin (one cell wide) streak of outliers.
5 Note that in this method of repeatedly applying the estimator and smoothing, the scale
estimates converge to zero, because information is diffused across the image. A traditional
multi-step estimator [10, 11, 14] is very different: scale estimates converge to a non-zero value.
156
functions are possible (e.g. see [10, 11]). This one has a shape similar to the better
behaved ones (e.g. a smooth cutoff) but is easy to compute.
In order to eventually identify outliers, the smoother also computes a second field
which measures how much each value resembles the neighboring ones. Initially, these
strengths are set to a constant value (currently 200). In each i t e r a t i o n , the strength at
location x is replaced by a weighted sum of the strengths in a =1:3 by -t-3 of x:
~-~ 8 i w i
49
where st is the input strength at location i in the neighborhood and the wi are the same
weights used in smoothing.
In the current implementation, smoothing is repeated 3 times. 6 Boundaries 7 are then
detected by locating sharp changes and outlier values. First, scale is re-estimated at all
locations. Two values are then considered significantly different if they differ by more
t h a n 6 times the smaller of their associated scales. If the values at two adjacent cells are
significantly different, a b o u n d a r y is marked between the cells. If there is a significant
difference between two opposite neighbors of some cell A, b u t not between A and either
one of them, the whole cell A is marked as p a r t of the boundaries. A n y cell with strength
less t h a n 20 is m a r k e d as an outlier. This outlier m a p is then pruned to remove all outliers
t h a t are not either (a) in the middle of a sharp change or (b) in a b a n d of outliers at
least 2 cells wide. Any cell still marked as an outlier after pruning is m a r k e d as part of
the boundaries.
T h e new edge finder extends easily to vector-valued images, e.g. color images or sets of
t e x t u r e features. I assume t h a t the p a t t e r n of variation in the vectors can be accurately
represented by the scale of variation in each individual dimension. This assumption seems
plausible for most computer vision applications and removing it seems to be very difficult
(cf. [26]). T h e current i m p l e m e n t a t i o n uses an L ~176 metric ( m a x i m u m distance in any
component), because it simplifies coding.
Specifically, in the vector algorithm, scale is e s t i m a t e d separately for each feature, s In
each smoothing iteration, the weighted average is c o m p u t e d s e p a r a t e l y for each feature,
b u t using a set of weights c o m m o n to all features. Specifically, the weights are first
c o m p u t e d for each feature m a p individually and the m i n i m u m of these values is used
as the c o m m o n weight. The c o m m o n weights are also used to c o m p u t e a single strength
m a p , so only one c o m m o n set ofoutliers is detected. Sharp changes, however, are detected
in each m a p individually and AND-ed into the c o m m o n b o u n d a r y m a p .
It is essential to use a c o m m o n set of weights for outlier detection if boundaries in
different features m a y not be exactly aligned. Suppose t h a t a change from A to A ~ in
one m a p occurs, rapidly followed by a change from B to B ' in another. Cells in the two
regions have value A B or A ' B ~, but cells in a tiny strip between the regions have value
s This seems empirically to be sufficient, but more detailed theoretical and practical study of
convergence is needed.
See the theoretical model in [4, 5, 6]. The boundaries are a closed set of vertices, edges between
cells, and entire cells. An arbitrary set of boundary markings can be made closed by ensuring
that all edges of a boundary cell are in the boundaries and all vertices of a boundary edge are
in the boundaries.
s Note that the minimum scale for different features can be different.
157
A~B. These cells will not appear to be outliers in either map individually, but they stand
out clearly when the maps are considered jointly.
Interesting issues arise when one feature is available at higher resolution than another.
For example, people see intensities at much higher resolution than hue or saturation of
color. As it stands, the algorithm may not localize a boundary to the full resolution
available from the highest resolution feature, but may report that other locations near
the boundary also have outlier values (i.e. due to the blurring in the low-resolution
features). In a sense, this is correct, because values at these locations are genuinely
inaccurate and should not be averaged into estimates of region properties. However, it
might be useful to add further algorithms that would refine boundary locations using
the more reliable feature values, making appropriate corrections to the corrupted values
from other features as cells are removed from the boundaries.
6 Conclusions
This paper has presented a new method for estimating the scale of variation in values
within image regions. These estimates were used to extract boundaries at high resolution
from both color images and textured images. Compared to previous edge finders for tex-
tured data, these results are very promising. In particular, because it can detect outliers
as well as step changes in values, the new algorithm can segment a new class of examples
(as in Fig 4). This poses an interesting problem for studies of human preattentive texture
discrimination: if the human segmentation algorithm also detects outliers then being able
to preattentively segment a pair of textures does not automatically imply that some fea-
ture assigns different values to them. This implies that traditional texture discrimination
experiments may require additional controls, but it also opens up possibilities for new
types of experiments that would examine which sorts of texture mis-matches do, and do
not, generate visible outliers.
My ultimate goal is to bring the new edge finder's performance up to the standards
of conventional edge finders. By this yardstick its performance is far from perfect and
barely approaching the point where it would be suitable for later applications, such as
shape analysis and object recognition. It is quite slow. Many of the algorithm details,
particularly parameter settings and the 2D extension of the scale estimator, need further
tuning. Many theoretical issues (e.g. bias in the scale estimator, convergence) still need
to be examined. I believe that there is much scope for further work in this area.
Acknowledgements
Mike Brady and Max Mintz supplied useful comments and/or pointers.
Appendix: Details of T e x t u r e F e a t u r e s
The features used for the texture examples are a new set currently under development.
The new features were chosen because they have small support and reasonable noise
resistence, and they return a constant output field on their ideal input patterns. The
closely related features proposed in [2, 20, 23] have either much larger support or large
fluctuations in value even on ideal input patterns. Comparative testing of texture features
is well beyond the scope of this paper and I make no claims that these features are the
best available.
158
This m e t h o d models the texture as a sine wave and the five features measure its mean
intensity, orientation, frequency, and amplitude. The first feature is log intensity L:
Eo = x / D 2 D 2 + D1D 3, Fo = ~ / D 3 D 3 + D~D 4
X=Eo-E90, Y=E4~-E135
References
1. Jaakko Astola, Pekl~ Heinonen, and Yrj6 Neuvo (1987) "On Root Structures of Median and
Median-Type Filters," IEEE Trans. Acoust., Speech, Signal Proc. ASSP-35/8, pp. 1199-
1201.
2. Alan C. Bovik, Marianna Clark, and Wilson S. Geisler (1990) "Multichannel Texture Anal-
ysis using Localized Spatial Filtering," IEEE Trans. Patt. Analg. and Mach. Intell.12/1,
pp. 55-73.
3. P. Brodatz (1966) Textures, Dover.
4. Margaret M. Fleck (1988) "Boundaries and Topological Algorithms," Ph.D. thesis, MIT,
Dept. of Elec. Eng. and Comp. Sci., available as MIT Artif. Intell. Lab TR-1065.
5. Margaret M. Fleck (1991) "A Topological Stereo Matcher," Inter. Jour. Comp. Vis. 6/3,
pp. 197-226.
6. Margaret M. Fleck (1990) "Some Defects in Finite-Difference Edge Finders," OUEL Report
No. 1826/90, Oxford Univ., Dept. of Eng. Science, to appear in IEEE Trans. Patt. Analy.
Mach. InteU.
7. Davi Geiger and Federico Girosi (1991) "Parallel and Deterministic Algorithms from
MRF's: Surface Reconstruction," IEEE Trans. Patt. Analy. and Mach. Intell. 13/5, pp.
401-412.
8. S. Geman and D. Geman (1984) "Stochastic Relaxation, Gibbs distributions, and the
Bayesian Restoration of Images," IEEE Trans. Patt. Analy. and Mach. Intell. 6/6, pp.
721-741.
9. Donald Geman, Stuart Geman, Christine Graffigne, and Ping Dong (1990) "Boundary
Detection by Constrained Optimization," IEEE Trans. Part. Analy. and Mach. Intell. 12/7,
pp. 609-628.
10. Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, and Werner A. Stahel (1986)
Robust Statistics: The Approach Based on Influence Functions, John Wiley, New York.
11. David C. Hoaglin, Frederick Mosteller, and John W. Tukey, eds.(1983) Understanding Ro-
bust and Exploratory Data Analysis, John Wiley, New York.
12. Peter J. Huber (1981) Robust Statistics, Wiley, NY.
159
13. John Y. Hsiao and Alexander A. Sawchuk (1989) "Supervised Textured Image Segmentation
using Feature Smoothing and Probabilistic Relaxation Techniques," IEEE Trans. Putt.
Analy. and Much. Intell.11/12, pp. 1279-1292.
14. Rangasami L. Kashyap and Kie-Bum Eom (1988) "Robust Image Modelling Techniques
with an Image Restoration Application," IEEE Trans. Acoust., Speech, Signal Proc. ASSP-
36/8, pp. 1313-1325.
15. Kenneth I. Laws. (1979) "Texture Energy Measures," Proc. DARPA Ira. Underst. Work.
1979, pp. 47-51.
16. Yvan Leclerc and Steven W. Zucker (1987) "The Local Structure of Image Discontinuities
in One Dimension," IEEE Trans. Part. Analy. and Much. Intell. 9/3, pp. 341-355.
17. Yvan Leelerc (1985) "Capturing the Local Structure of Image Discontinuities in Two Di-
mensions," Proc. IEEE Conf. on Comp. Vis. and Patt. Recogn. 1985, pp. 34-38.
18. Ale~ Leonardis, Alok Gupta, and Ruzena Bajcsy (1990) "Segmentation as the Search of the
Best Description of an Image in Terms of Primitives," Proc. Inter. Conf. on Comp. Vis.
1990, pp. 121-125.
19. Joseph T. Maleson, Christopher M. Brown, and Jerome A. Feldman, "Understanding Nat-
ural Texture," Proc. DARPA Ira. Underst. Work. 1977, Palo Alto, CA, 19-27.
20. Jitendra Malik and Pietro Perona (1990) "Preattentive Texture Discrimination with Early
Vision Mechanisms," Journ. Opt. Soc. Amer. A 7/5, pp. 923-932.
21. Jitendra Malik and Pietro Perona (1990) "Scale-Space and Edge Detection using
Anisotropie Diffusion," IEEE Trans. Patt. Analy. and Much. Intell. 12/7, pp. 629-639.
22. Peter Meer, Doron Mintz, Dong Yoon Kim, and Azriel Rosenfeld (1991) "Robust Regression
Methods for Computer Vision: A Review," Inter. Jour. Comp. Vis. 6/1, pp. 59-70.
23. M. Coneetta Morrone and Robyn Owens (1987) "Feature Detection from Local Energy,"
Pattern Recognition Letters 6/5, pp. 303-313.
24. Makota Nagao and Takashi Matsuyama, "Edge Preserving Smoothing," Comp. Graph. and
Ira. Proc. 9/4 (1979) 394-407.
25. Carlos A. Pomalaza-Raez and Clare D. McGillem (1984) "An Adaptive, Nonlinear Edge-
Preserving Filter," IEEE Trans. Acoust., Speech, Signal Proc. ASSP-32/3, pp. 571-576.
26. Peter J. Rousseeuw and Annick M. Leroy (1987) Robust Regression and Outlier Detection,
John Wiley, New York.
27. Schunck, Brian G. (1989) "Image Flow Segmentation and Estimation by Constraint Line
Clustering," IEEE Trans. Patt. Analy. and Much. Intell. 11/10, pp. 1010-1027.
28. Philippe Saint-Mare, Jer-Sen Chen, and G'erard Medioni (1991) "Adaptive Smoothing:
A General Tool for Early Vision," IEEE Trans. Patt. Analy. and Much. Intell. 13/6, pp.
514-529.
29. Sarvajit S. Sinha and Biran G. Schunck (1992) "A Two-Stage Algorithm for Discontinuity-
Preserving Surface Reconstruction," IEEE Trans. Patt. Analy. and Much. Intell. 14/1, pp.
36-55.
30. Fumiaki Tomita and Saburo Tsuji (1977) "Extraction of Multiple Regions by Smoothing
in Selected Neighborhoods," IEEE Trans. Syst., Man, Cybern. 7/2, 107-109.
31. Richard Vistnes (1989) "Texture Models and Image Measures for Texture Discrimination,"
Inter. Jour. Comp. Vis. 3/4, pp. 313-336.
32. S. G. Tyan (1981) "Median Filtering: Deterministic Properties," in T.S. Huang, ed., Two-
Dimensional Digital Sign al Processing 1I: Transforms and Median Filters, Springer-Verlag,
Berlin, pp. 197-217.
33. Harry Voorhees and Tomaso Poggio (1987) "Detecting Textons and Texture Boundaries in
Natural Images," Proe. Inter. Conf. on Comp. Vis. 1987, London, pp. 250-258.
This article was processed using the I.*TEX macro package with ECCV92 style
TEXTURE PARAMETRIZATION METHOD
FOR IMAGE SEGMENTATION
The research described in this paper has been supported partially by SKIDS (ESPRIT - 1560)
161
Fig.1 shows the affinities between different textures clearly defined and the six
texture parameters considered.
With these six parameters which are not mutually excluding, it is defined a texture
vector T[~1.... x6], which quantifies the degree of affinity of every region of an image to
each one of these parameters.
These selected p a r a m e t e r s are not essentially different from some of the classic
ones utilized such as: Coarseness, directionality, regularity ..... On the other hand, the
natural texture patters defined by Brodatz [Bro] are also defined from these parameters,
some of them can be defined in terms of exclusively one parameter, while others need
up to three different ones.
The use of these parameters, as they are defined, don't provide information such as:
regularity/irregularity, or symmetry/asymmetry, which correspond to a more global
scope t h a t surpass the region evaluated and t h a t has to be obtained in a higher
processing level.
Once the six texture parameters have been obtained for every pixel of the image,
tacking into account its 8x8 pixels environment, a new lower resolution image is
generated. This new texture image of 118 resolution with respect to the original one, is
oriented to contribute to image segmentation, mainly complementing the information
provided by the contours and solving the conflicts produced by the contours lack of
continuity.
2 Quantification Algorithms
The defined parameters are quantified both by their module and their argument, that
correspond to the main direction of the pattern considered, except the p a r a m e t e r s
blurriness and granularity that are considered isotropic.
The six parameters ~i are quantified by applying to each 8x8 pixels, organized in 4
subregions of 4x4 pixels size, the following algorithms:
(A - D) + (C - B)
Blurriness: ~B = K B
I+Z~j
where A,B,C and D are the average gray levels corresponding to the four subregions
what constitute a texel, and Pij are the pixels belonging to a contour within this
region.
Granularity: ~c = ~ Z Qij
being Qij the pixels contained in the 8x8 region, which are detected as contours but in
chains of length L~2.
curliness ~c = Z ( % -hi)
Qij are the pixels concatenated in chains of lenght >3 and Lij are the aligned pixels.
Straightnes~ ~L = KL L
The value of~L is restricted to be 0<~L<8, being L the number of aligned pixels within
a texel. (5<L<8).
Abruptness: ~A = KA a
where a is the angle of the smallest vertex contained in a texel.
3 Hardware Implementation
parts, performing each one separately a function of complexity 216, which can be
already implemented with a single memory module. Fig. 2.
The decomposition of the texel in four quadrants presents some restrictions since
they can not interchange information, but it allows to perform some global operations
over an 8x8 environment, as it happens in the process of straight lines detection where
both, the straight line slope and its continuity between adjacent subregions can be
analyzed.
The processor operation time will be the time required to obtain the six texture
parameters and that required to process these data. The six parameters are obtained at
video rate, but with a delay of 3+4+8 pixels.The algorithms used to process the texture
parameters depend on the complexity of the scene with a range from 20 to 60 ms.
I =O-Hfll ..... II II l/
~ eUWl PC B U S
II U
Fig. 2. Architecture of the processor designed to measure
in real time the parameters ~i
4 Results
The texture images obtained from images of 256x256 pixels,having each texel a size of
8x8 pixels,result in a much lower resolution,32x32texels. In spite of the low resolution
of the texture function generated, the results obtained are considered satisfactoryfor
image segmentation of scenes in structured environments. The segmentation of the
image is done based on the circumscribed contour of the elements of the scene, and the
texture information is used to differenciatethe regions.Fig.3 shows the results obtained
applying successively the straightness (3b). granularity (3c) and blurriness (3d)
operators. These results can still be improved using color as complementary
information for scene segmentation.
R e f ~
[Bou] Bouman, C., Liu, B.: Multiple Resolution Segmentation of Textured Images.
IEEE Trans P A M I Vol. 13 n92, (1991)
[Bro] Brodatz, P.: Textures. N e w York, Dover (1966)
164
[Buff du Buf, J.M.H. et al.: Texture Feature Performance for Image Segmentation.
Pattern Recognition, vol.23 No.4 291-309 (1990).
[Cas] Casals, A. and Pages, J.: A Vision System for Agricultural Machines
Guidance. IARPW on Robotics in Agriculture and the food Industry. (1990).
[Har] Haralick, R.M.: Statistical and Structural Approaches to Texture. Proc. Int.
Joint Conference on Pattern Recognition, vol.4 45-69, Kyoto (1978)
[Man] Manjunath, B.S. and Chellappa, R.: Unsupervised Texture Segmentation
Using Markow Randow Field Models. PAMI vol.13 No.5 (1991)
[Mat] Matsuyama et al.: A structural Analyzer for Regularity Arranged Textures.
Computer Graphics and Image Processing 18 259-279 (1982)
[Shi] Shipley, T. and Shore, T.: The Human Texture Visual Field: Fovea-to-
Periphery Pattern Recognition. Pattern Recognition, vol.23 N o . l l (1990).
[Tsu] Tsuji, S. and Tomita F.: A Structural Analyzer for a Class of Textures.
Computer Graphics and Image Processing 2 216-231 (1973)
[Zuc] Zucker, S.W.: Toward a Model of Texture Computer Graphics and Image
Processing 5 190-202 (1976)
Texture Segmentation by Minimizing Vector-Valued
Energy Functionals: The Coupled-Membrane Model
1 Introduction
This paper presents a computational model that segments images based on the textural
properties of object surfaces. The proposed model distinguishes itself from the previous
models in texture segmentation [Turner 1986, Voorhees and Poggio 1988, Malik and
Perona 1989, Fogel and Sagi 1989, Bovik, Clark and Geisler 1990, Reed and Wechsler
1990, Geman et al 1990] in the following way.
Previous models have started with the extraction from the image I(x, y) of some set of
texture features which can be viewed as forming auxiliary texture images I~ (x, y). Then
applying either region growing, boundary detection, or (in the single paper [Geman et al
1990]) a membrane-like method combining these two, a segmentation is derived. In our
model, the texture features are the power responses of quadrature Gabor filters. These
filters form a continuous family depending on two variables ~, 6, and can be derived like
wavelets from dilation and rotation of a single filter. Thus we think of the texture features
as combining into a single image "l,VI(tr, O,x, y) depending on ~ continuous variables.
We apply the weak membrane approach to segmenting this signal, in which coupling is
introduced in all 4 dimensions, but breaks are allowed in x and y only. We call this the
Coupled- Membrane model.
Why is this model useful? Previous methods generally deal only with textures that are
statistically stationary (i.e. approximately translationally invariant) and not too granular
(e.g. with widely spaced textons, or large local variations). But natural textures do not
satisfy either: Firstly, they show considerable texture 'gradients', in which the power dis-
tribution of the texture among various channels changes slowly but systematically over
* This research is supported in part by Harvard-MIT Division of Health Sciences & Technology
and Harvard's Division of Applied Sciences' fellowship to T.S. Lee, Army Research Office
grant DAAL03-86-0171 to D. Mumford, and NSF grant IRI-9003306 to A. YuiUe. Interesting
discussion and technical help from John Daugman, Mark Nitzberg, Peter Hallinan, Peter
Belhumeur, Michael Weisman, and Petros Maragos are greatly appreciated.
a region, due for instance to the perspective affine distortion imposed on surface features
of solid objects in a 3-dimensional world, and to the deformation caused by the non-
planarity of the objects' body shapes. J. J. Gibson [Gibson 1979] has emphasized how
these texture gradients are ubiquitous clues to the 3D structure of the world. Secondly,
they show random local fluctuations, due to the stochasticity in their generation pro-
ceases, which are often quite large (compare the four subimages in 'Mosaic' below, taken
from [Brodatz 1966]). The inter-membrane couplings unique to our Coupled-Membrane
model allow interaction between neighboring components in the spectral vector and pre-
vent minor local variations from producing segmentation boundaries. At the same time,
they introduce explicitly the appropriate metric between texture channels so that a shift
in the peak of the power spectrum to a nearby frequency or orientation is treated differ-
ently from a shift to a distant frequency or orientation. As we shall see, this allows us to
begin to solve these problems for natural textures.
This paper is organized as follows: first, we will discuss how texture is represented in
o u r model and how texture disparity can be computed from this representation. Then,
we will discuss the Coupled-Membrane model for texture segmentation in its continuous
formulation and discrete approximation. Finally, we will present our experimental results.
0"2 - 1-i~o2o(4(x cos #-by sin 0)24-(-x sin 04-y cos 0) ~ e i ( a cos O:v4-a sin #y)
x , y ) = 5-6 e 9 (1)
where a is the radial frequency, and 0 is the angular orientation of the filter.
In a manner completely analogous to the generation of wavelet bases from a single
basic wavelet, this whole family of Gabor filters can be generated by rotation and dilation
from the following single Gabor filter (as shown in figure 1):
y) = __1 (2)
50~r -
Self-similar Gabor filters from this family serve both as band-pass filters and multi-
scale matched filters, producing a representational scheme that unifies power-spectrum
analysis and feature detection.
The convolution of this family of filters with the image produces a single image
W I ( a , 0, x, y) which is the normalized power modulus of the filter ensemble as follows,
167
W I ( a , 0, x, y) = ~l(O:,0
1 9 I)(~, v)l 2 (3)
where
and
r f
Z = JJ I(c:,0, I)(:, y)N:~0 (5)
Since each Gabor filter has a Gaussian spread in its frequency plane, local power
spectrum of an image can be sampled in a parsimonious and discrete manner. In our
implementation, we use Gabor wavelets with a sampling interval of 1-octave in frequency
and 22.5 ~ in orientation to pave the spatial frequency plane, as shown in figure 2.
a 4-D s p a ~ - q ~
S~ ,w~ma m B
//~sm~m n~r~ I c
i!
mr-mmmmm Co~r~ Lap~.~ I
'""~- - -'-'e'~"~2 i
i jr t t t Tit ~V~f~A
I
~'%'"^'% I 8pi~il V~ ~ B
~ - ~ , tttttttttttttttttttt
, ill, 8p~ vn~7 lB.A
ii~-n~ . l l a ~ l 2
'III, "~'~'"
alJll ~ V , ~ C
where R and S are the finite 2-dimensional spatial and spectral domains respectively;
boundaries B C R is a finite set of piece-wise C 1 contours which meet OR and meet
each other only at their endpoints. The contours of B cut R into a a finite set of disjoint
regions R1, ..., Rm the connected components of R-B. The integration over S is done with
(d log a)d0 = ~ because the power spectrum is represented in log-polar form.
The first term of the energy functional forces the smoothed spectral response f(a, 0, x, y)
to be as close as possible to the measured spectral response YYI(a, 0, z, y). The second
term asks the spectral response to be as smooth as possible in both spatial and spectral
domains. These two potentially antagonistic demands are to arrive at a compromise that
is determined by the A, 7a and 70.
Since f(a, O, z, y) is required to be smooth only within each Ri but not across B, the
third integral term is needed to prevent breaks from appearing everywhere. This term
imposes a penalty a against each break and provides the binding force within a region.
5 Computer Implementation
To solve a functional minimization problem computationally, the energy functional is
discretized as follows,
1
+)~2 y ~ [f(g,j,k,l) - f(i,j + l,k,l)]2(1-h(i, "3+ -~))
i,j,~,l
1 1
+ y~[v(i + -~,j) + h(i,j + 5) 1
i,j
where i, j, k, 1 are indexes for x,y,0, log a respectively in the 4-dimensional spatial-spectral
sampling lattice, v and h are vertical and horizontal breaks between the lattice points in
the spatial domain.
Figure 4 illustrates the couplings among the nodes in the 4-D sampling lattice: Each
membrane corresponds to WI(k, l) for a frequency l, and an orientation k. Within each
membrane, each node is coupled to the nearest 4 neighboring nodes. At each spatial
location, a membrane is coupled with 4 other membranes which are its nearest spectral
neighbors.
As the segmentation-diffusion process unfolds, spectral response is allowed to diffuse
from one node to its 4 spatial and 4 spectral nearest neighbors. Breaks, however, can only
occur in the spatial domain. When the L2 norm of the evolving membranes exceeds the
texture disparity threshold V~X at a spatial location, a break will occur at that location
to cut across all the membranes.
Given a set of values for parameters A, 7a, 70, and a, an optimal compromise among
the three terms in the energy functional produces a set of segmentation boundaries and
smoothed spectral responses. Because the energy functional has many local minima due
to its nonconvexity, the global optimal compromise can be obtained using special math-
ematical programming methods. This paper presents results obtained using a stochastic
method called Simulated Annealing [Kirkpatrick 1983], and a deterministic method called
Graduated Non-Convexity [Blake and Zeisserman 1987]. We implemented both two meth-
ods on DEC5000 workstation and on a massively parallel computer called MASPAl~.
6 Experimental Results
A class of texture images, 256 x 256 pixels in size, are used to test the model. Percep-
tual boundaries in these images are defined primarily by difference in textures, and not
by luminance contrast. The segmentation-diffusion is performed on a 64 x 64 spatial
sampling grid: We use a simple annealing schedule schedule for Simulated Annealing:
Tn = 0.985Tn-1 at each temperature step, with a starting temperature of 25. It takes
about 24 hours on DEC 5000 or 6 hours on MASPAR to process each image. For GNC,
the error resolution ~ needs to be 2 - 1 2 to ensure the solution is close to the optimal one.
It takes 140 hours on DEC 5000 or 7 hours on MASPAR. Despite the fast annealing
schedule, the Simulated Annealing performs reasonably well. The answer provided by
GNC, however, is closer to the global minimum. These algorithms have also been imple-
mented in 1-D so that their solutions can be compared with the exact optimal solution
yielded by dynamic programming.
Three images are presented here as illustrations: 'Vase' (figure 5), 'Mondrian' (figure
7a), and 'Mosaic ~ (figure 7b). 'Vase' is used to demonstrate the model's tolerance to
texture 'gradient' due to inter-membrane coupling. When this coupling is disabled, the
segmentation is not perceptually valid (figure 5d). The initial response and the final
response of the filters to 'Vase' (figure 6) demonstrate the diffusion effect in both the
spatial and spectral domains.
'Mondrian' and 'Mosaic' both demonstrate that the model's ability in segmenting
synthetic and natural textures while withstanding significant texture noise, and local
171
............................................................................
i....................;.....
Fig. 5. (a) 'Vase' and its segmentations: (b) Simulated Annealing result
with a = 0.02,), -- 6, 7e -- 2,7~ = 4; (c) GNC result with a --0.02,)~ -- 6,70 = 2,7,~ -- 4;
(d) GNC with a -- 1.25, ~ = 6, 70 -- 7~ -- 0 i.e. without inter-membrane coupling.
Fig. 6. (a) Initial filter response map for 'Vase'. (b) Final filter response map at the end of the
segmentation-diffusion process (figure 5c). Each small square is the response map of a particular
filter to the image. The maps are arranged in frequency rows (three frequencies) and orientation
columns (eight orientations).
variation in scale and orientation. The initial and final response maps of 'Mondrian'
(figure 8) underscore the cooperative effect of the diffusion and segmentation processes
in producing Sharp texture boundary from fuzzy input.
The parameter values used for the segmentation-diffusion process are shown in the
figure caption. For the series of images we tested, the values needed to produce a seg-
mentation similar to our perception are fairly close together.
7 Discussion:
Fig. 8. (a) Initial filter response map for 'Mondrian'. (b) Final filter response map at the end
of the segmentation-diffusion process.
luminance edge information derived from the same Gabor-Wavelet representation, and
by modifying the domain of integration in the energy functional This effort will be
reported in another paper.
A similar approach can be taken to the problem of speech segmentation: speech seg-
mentation is presently done with either Hidden Markov models or time-warping. We
propose that segmentation of time by a Coupled-String model applied to the power spec-
trum of speech, with couplings between adjacent values of time and frequency, provides
a third approach. The Coupled-String model is amenable to dynamic programming and
hence fast, and will be effective for all phonemes without the need to model each phoneme
in details.
The model uses neurophysiological components as its processing elements, and can
be implemented in a locally connected parallel network. There is a strong possibility that
it can he linked to the computational processes in the visual cortex. For instance, the
segmentation process is related to boundary perception, while the diffusion process can
be linked to texture grouping or diffusion phenomenon in psychology. Our work suggests
that when cortical complex cells are coupled in a particular fashion, a successive gradient
descent type of algorithms can solve a class of image segmentation problems that are
essential to visual perception.
References
1. Blake, A. & Zisserman, A. (1987) Visual Reconstruction. The MIT Press, Cambridge,
Massachusetts.
173
2. Bovik, A.C., Clark, M. & Geisler, W.S. (1990) Multichannel Texture Analysis Using Lo-
calized Spatial Filters. IEEE Transections on Pattern Analysis and Machine Intelligence,
Vol. 12, No. 1, January.
3. Brodatz, P. (1966) Texture - A Photographic Album for Artists and Designers. New York:
Dover.
4. Daugman, J.G. (1985) Uncertainty relation for resolution in space, spatial frequency, arid
orientation optimized by two-dimensional visual cortical filters. J. Opt. soc. Amer., Vol.
2, No. 7, pp 1160-1169.
5. Fogel, I. & Sagi, D. (1989) Gabor Filters as Texture Discriminator. Biological Cybernatics
61, 103-113.
6. Geiger, D. & Yuille, A. (1991) A Common Framework for Image Segmentation. Intl.
Journal o] Computer Vision, 6:3,227-243.
7. Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990) Boundary Detection by Con-
strained Optimization. IEEE Transections on Pattern Analysis and Machine Intelligence,
Vol 12, No. 7, July.
8. Gibson, J.J. (1979) The Ecological Approach to Visual Perception, Houghton-Mifflin.
9. Heeger, D.J. (1989) Computational Model of Cat Striate Physiology. M I T Media Labora-
tory Technical Report 125. October, 1989.
10. Hubel, D.H., & Wiesel, T.N. (1977) Functional Architecture of macaque monkey visual
cortex. Proc. R. Soc. Lond. B. 198.
11. Kirkpatrick, S., Gelatt, C.D. & Vecchi, M.P. (1983) Optimization by simulated annealing.
Science, PPO. 671-680
12. Lee, T.S., Mumford, D. & Yuille, A. (1991) Texture Segmentation by Minimizing Vector-
Valued Energy Functionals: The Couple-Membrane Model. Harvard Robotics Laboratory
Technical Report no. 91-22.
13. Malik, J. & Perona, P. (1989) A computational model for texture segmentation. IEEE
CVPR Con]erence Proceedings.
14. Marroquin, J.L. (1984). Surface Reconstruction Preserving Discontinuities (Artificial In-
telligence Lab. Memo 792). MIT, Cambridge, MA. A more refined version " A Probabilistic
Approach to Computational Vision" appeared in Image Understanding 1989. Ed. by Ull-
man, S. & Whitman Richards. Ablex Publishing Corporation, New Jersey 1990.
15. Mumford, D. & Shah, J.(1985) Boundary Detection by Minimizing Functionals, I. IEEE
CVPR Con]erence Proceedings June 19-23. A more detailed and refined version appeared
in Image Understanding 1989. Ed. by Ullman,S. & Whitman Richards. Ablex Publishing
Corporation, New Jersey 1990.
16. Pollen, D.A. & Gaska, J.P. & Jacobson, L.D. (1989) Physiological Constraints on Models
of Visual Cortical Function. in Models o] Brain Function, Ed. Rodney, M.J. Cotterill.
Cambridge University Press, England.
17. Reed, T.R. & Wechsler, H. (1990) Segmentation of Textured Images and Gestalt Orga-
nization Using Spatial/Spatial-Frequency Representation. IEEE Transections on Pattern
Analysis and Machine Intelligence, Vol. 12, No. 1, January.
18. Silverman, M.S., Grosof, D.H., De Valois, R.L., & Elfar, S.D. (1989) Spatial-frequency
Organization in Primate Striate Cortex. Proc. Natl. Acad. Sci. U.S.A., Vol. 86, January.
19. Voorhees, H. & T. Poggio. (1988) Computing texture boundaries in images. Nature,
333".364-367. A detailed version exists as Voorhees' master thesis, "Finding Texture Bound-
aries in Images". MIT AI Lab Technical Report No. 968, June 1987.
20. Webster, M.A. & De Valois, R.L. (1985) Relationship between Spatial-frequency and Ori-
entation tuning of Striate-Cortex Cells. J. Opt. Soc. Am. A Vol 2, No. 7 July 1985.
This article was processed using the IATEXmacro package with ECCV92 style
Boundary Detection in Piecewise Homogeneous
Textured Images*
Stefano Casadei 1'2, Sanjoy Mitter 1,2 and Pietro Perona 3,4
1 Massachusetts Institute of Technology 35-308, Cambridge MA 02139, USA
e-mail: casadei@lids.mit.edu
2 Scuola Norma]e Superiore, Pisa, Italy
3 California Institute of Technology 116-81, Pasadena CA 91125, USA
4 Universit~ di Padova, Italy
A b s t r a c t . We address the problem of scale selection in texture analysis.
Two different scale parameters, feature scale and statistical scale, are de-
fined. Statistical scale is the size of the regions used to compute averages.
We define the class of homogeneous random functions as a model of tex-
ture. A dishomogeneity function is defined and we prove that it has useful
asymptotic properties in the limit of infinite statistical scale. We describe
an algorithm for image partitioning which has performed well on piecewise
homogeneous synthetic images. This algorithm is embedded in a redundant
pyramid and does not require any ad-hoc information. It selects the optimal
statistical scale at each location in the image.
1 Introduction
The problem of texture analysis (recognition and segmentation) has traditionally been
approached by computing locally defined vector-valued "descriptors" of each region in
the image. The texture recognition problem is thus reduced to a conventional classifi-
cation problem, and boundary detection may be performed by locating areas of rapid
change in the descriptor vectors, or, dually, by clustering regions with similar descriptors.
Constraints on texture analysis algorithms come from the physics and the geometry of
image formation: often one would like to ensure invariance with respect to changes in
illumination, scaling and rotation, sometimes also to tilt and slant.
The search for good local descriptors has been the focus of much work with two main
classes of descriptors being favored in the more recent literature: (a) linear filters followed
by elementary nonlinearities and smoothing (e.g. [1, 2, 3, 4]), and (b) different statistics
of brightness computed on image patches (e.g. [5, 6, 7]). The filtering framework has
natural characteristics for addressing the scale- and rotation-invariance issues: if each
filter category is present at multiple scales and orientations, and if the discretization is
fine enough, the representation of image properties given by the filter outputs is roughly
scale and rotation-invariant.
A number of important issues having to do with scale selection and response normal-
ization have remained virtually unanswered: What are the basic regularity hypotheses
that a texture has to satisfy for the local descriptor approach to work? How does one
identify automatically the proper scale for analyzing a texture, and how does one choose
the thresholds for declaring a boundary?
In this paper we make explicit and formalize a general assumption about texture
regularity. We show how this assumption can be used to find texture boundaries and we
* Research supported by Airforce grant AFOSR-89-0276-C and ARC) grant DAAL03-86-K-0171,
Center for Intelligent Control Systems
175
discuss its relation with the scale selection problem. We observe that two independent
scale parameters exist and must be dealt with. Finally, we show some results of an
efficient redundant-pyramid algorithm for computing homogeneous texture regions. This
algorithm is (approximately) translation-, scale- and illumination invariant.
1.1 O v e r v i e w o f t h e c o n t e n t s
A general assumption that forms the basis of all approaches to texture analysis is that
a texture at some level of description is homogeneous. Although this is not a new idea,
a precise definition of what this means is still missing. The difficulty in defining texture
homogeneity is rooted into the two conflicting natures of texture, namely, randomness
and regularity. A specific texture may be modeled as a realization of a random function.
We propose to define a random function as homogeneous if, with probability one, spatial
averages of any local operator are constant with respect to space in the limit of infinite
size of the averaging region. To quote Wilson [8, 9], when the scale of averaging is infinite
the class-localization of the texture pattern must become infinitely accurate. Of course,
in practical circumstances we cannot take averages over infinite regions: for practical
purposes the averaged texture features (i.e. the averaged output of the local operators)
must be close to space-independence for finite averaging regions. When this happens we
say that we are near the thermodynamic limit of a given texture. Due to the trade-off
which exists between reliably estimating texture properties (which requires large averag-
ing windows) and localizing the textured regions (which can be done more accurately if
the data are not too blurry) it is important to be able to estimate the smallest averaging
scale that approaches the thermodynamic limit.
The scale of averaging will also be called statistical scale or external scale. It must
be distinguished from a different scale parameter: the feature scale or internal scale. The
latter corresponds to the size of the support of a given local operator. These two scale
parameters are unrelated (apart from the obvious fact that the external scale must be
greater than internal scale). This is illustrated in figure 1. Note that even if the scale of
the texture elements is the same for all images, the behavior along the statistical scale is
quite different. A multiscale representation of a textured image should therefore contain
two scale parameters (this observation has been independently made by Eric Saund,
personal communication).
In order to select the most appropriate external scale(s) for analyzing a given texture
we need to quantify the level of variability of the descriptors at any given statistical scale.
To this purpose we define a real positive function, the dishomogeneity function, of two
arguments, spatial position and statistical scale. Low dishomogeneity values will indicate
that the thermodynamical limit has been reached. In order to be useful in texture analysis
such a dishomogeneity function must satisfy two elementary properties:
1. The dishomogeneity of a region which contains only one type of texture is zero.
2. The dishomogeneity of a region which contains a texture boundary is strictly positive
if the set of local descriptors is rich enough.
In section 2 we state two results (Theorems 1,2) which ensure that the dishomogeneity
function satisfies these two properties in the thermodynamic limit for piecewise homoge-
neous textures.
Figure 2 demonstrates these concepts on a finite image. Note how the dishomogeneity
of a given texture approaches 0 when the statistical scale is sufficiently large (left).
However when the analyzed region goes over a texture boundary the dishomogeneity
176
F i g . 1. Averages over larger regions are required to detect the regularity of the pattern in tex-
tures which are sparser or contain more "randomness" since the dishomogeneity of these textures
approaches zero at larger statistical scales than dense periodic textures. At right the dishomo-
geneity function, maximized over the image, is plotted vs. statistical scale k (i.e. max= g'(2 ~, x)
vs. k, see section 2.5) for the three different textures. Note that the dishomogeneity value of
e.g. 0.3 is attained approximately at k=4, k=5 a~d k=6 for the left, center and right textures
respectively. This suggests that a multiscale representation of the image should treat feature
scale and statistical scale as independent dimensions of analysis.
~~~::~:~.~#,~,~:~ . . . . . .
dbhomogQneity dlshon.JoSe n e | t y
14
- /
a~
F i g . 2. The dishomogeneity at different scales and positions. The dishomogeneity inside each
square region shown in the upper figure is plotted below. Left: the dishomogeneity of a given
type of texture as a function of statistical scale is shown for a constant localtion in the image.
The x-coordinate, k, is the logarithm of the statistical scale. The linear size (length of the side in
pixels) of each square is 2 k+l (k = 2, 3, 4, 5). The dishomogeneity at fine (k = 1,2) scales is zero
because at the considered location only the constant backgound colour is present in the smaller
squares. Right: the dishomogeneity at several positions in the image. The x-coordinate now
represents position in the image; the scale is held constant at k -- 5. Note that dishomogeneity
is high when the region contains a texture boundary.
177
becomes suddenly larger (right). Figure 3 shows how the dishomogeneity function is
constructed.
] Max and
~ " ~k~fand [Averages [ Min over
g(~)---~ g~slated ~i(A, Zh : -~i(.L, A, =
.input operators [ ~ulti~c~e Averaged [sLze z~
image I Descnptmn Multiscale
Description
g'(L,=)
~xti:em~ of Suecific ishgmogeneity
scillahon ITlshomogeneity nc~lon
inside B2L(z)
Fig. 3. Computation of the dishomogeneity function g'(L, ~).
Our segmentation algorithm selects the optimal statistical scale at each location in the
image by minimizing the dishomogeneity function (plus a penalty against small scales)
among all possible scales.
The definitions and results below are valid for any image descriptor "regular" enough
(see [10] for a precise definition). Most of the texture features used in the literature can
be adapted to fit into this formalism. However, to give an example (and because we use
it in our implementation) we discuss a particular case in more detail. Namely, given a
set functions hi, i - 1 . . . . , n such that hi(z) - 0 for z ~ BI(0), consider the following
descriptor:
qi . g = f, [/ hi(z')g(-z')dz'] . (1)
178
2.2 H o m o g e n e o u s R a n d o m Images.
Let (~2, $', 7~) be a probability space. A random image r is a function defined on ~2 into
the set of images: r : O --* G. That is, if w E f2, then r E G is an image.
1 f
lim ] q.T_~r = ~q (2)
1---*ov
B~(=0)
where ~'q E R n is the asymptotic description of the random homogeneous image r given
by the descriptor q.
To clarify the meaning of (2) let us rewrite it for the convolution-plus-nonlinearity case de-
1
fined in (1): lin~...oo ~ fBK~o) qi" T_=r = lin~_oo rx fBK~o) fl [(hi * r dx =
lin~_oo ~ fB,(=0) ~i(x)dx = ~fq,iwhere ~q = (~q,i : i = 1 , . . . , n) and ~i(x) is given by:
~i(z) = fi [(hi * r That is, homogeneity requires that averages of image descrip-
tions exist in the thermodynamical limit and be spatial independent.
This definition is quite general. It includes all ergodic random functions, in which
case ~fq is the ensemble average of the image description. It is also valid for periodic
deterministic functions such as those generated by regular repetitive texture.
However, if illumination inhomogeneities or distorsions such as those created by
prospective are present, then our model of texture is no longer valid at very large scales.
A more complex model is needed in which our homogeneous random function is cou-
pled with (perturbed by) a smooth, slowly varying function. If this perturbation is small
enough it is still possible to approach the thermodynamic limit before the long range
variations become important. The greatest degree of homogeneity is then attained at a
finite statistical scale.
~ Rotation is also a natural symmetry. However, since this paper focuses on scale issues we omit
to deal with orientation explicitly.
179
For the convolution-plus-nonlinearity descriptor defined in (1) we have, after some simple
calculation: ~i(A, x) = Ki(A)fi [((Dxhi) * g) (x)]. In this case, the i-th component of the
multiscale description is obtained by convolving the image with a bank of filters generated
by dilating the template filter hi. Wavelets representations can be expressed in this way
by letting A = 2 k and x = (12k, m2~), l, m, k E Z and choosing hi, fi in the appropriate
way.
Note that ~(A, x) depends only on the image inside Bx(x).
2.4 A v e r a g e d M u l t i s c a l e R e p r e s e n t a t i o n s .
1
-~(L, )%x) - (L - A) ~ / ~(A, x')dx'. (4)
B L - A (Z)
Note that the average is taken in such a way that g(L, A, x) depends only on the
image inside BL(X).
If g = r is a homogeneous random image then, by definition of homogeneity (see
(2)), we have with probability 1 and for all x 6 R 2 (assuming Ki(A) = K(A) for clarity):
lim r
L--*oo
A, x) = 2 i r n (L -1 A)2 K(A) f q. Dx-lT_~:,r = ft'(A) (5)
BL-A (x)
We are going to define a dishomogeneity function g'(L, z) > 0 such that g'(L, x) depends
only on the image inside B2L(X) and on averages of size L. We start by defining the
maximum and minimum value of yi(L, A, x) inside B2L(Z):
Now, let si(M, m) be a positive real function defined for M > m > 0 such that
si(M,m) "- 0 if M = m and, for each m, si(M,m) is strictly increasing in M (the
simplest example would be si(M, m) = M - m).
Definition8. The dishomogeneity function is then defined by taking the most dishomo-
geneous "channel" :
g'(L, x) = maxg~(L, A, z) (7)
i,),
180
D e f i n i t i o n 10. A piecewise homogeneous random image over the partition P is the ran-
dom function: r = ~i~=1 IR,r
Then we have to define what the thermodynamic limit means for these class of images.
For, it is no longer possible to let L go to infinity without mixing together different types
of texture. A solution to this problem is to make the random functions r "shrink" while
leaving the boundaries unchanged . Then, for r defined as above and 7 > 0 we define r
as: [r = E,"=I IR,(~)[r
This theorem suggests that in finite images the dishomogeneity function can provide
useful information wherever the thermodynamic limit is a reasonable approximation.
ckij = g~ij - ark is associated with each node. tk is increasing in k. The negative term
--cdk introduces a bias for large statistical scales. This allows to select a unique scale
and to generate a consistent segmentation in those cases where the underlying texture
is near the thermodynamic limit at more than one statistical scale (for instance, think
of a checkerboard). The dishomogeneity is computed as described in section 2 by using
filter-plus-nonlinearity descriptor of the type shown in (1). Step filters at 4 different
orientations have been used.
k=5
k=4
~ k=3
k=O
Fig. 4. One dimensional oversampled pyramid. The vertical displacement between intervals of
the same level has been introduced for clarity and has no other meaning. Each node of the
pyramid has a dishomogeneity value and a cost associated with it.
The algorithm is region based, i.e. its output is a partition of the image. Therefore it
assumes that boundaries form closed curves and are sharp enough everywhere. It works
in two steps: first it selects nodes in the pyramid which minimize locally the cost function
cklj and then merges neighboring selected nodes into larger regions. In the selection phase
each pixel of the image selects a unique node (the one having the lowest cost) among all
those in which it is contained.
4 Experiments
We now describe some of the experiments we have done with synthetic images. All the
three images shown in this section are 256 256. The CPU time required to run one
image is approximately 9 minutes on a Sun SparcStation II. Most of the time goes into
the computation of ~M(L, ,~, x) and ym(L, A, x) from ~i(L, ~, z')(see section 2.5).
Figure 5 shows the segmentation of a collage of textures which reach the thermody-
namic limit at several statistical scales.
Figure 6-1eft illustrates the segmentation of an "order versus disorder" image. This
example shows that looking for the optimal statistical scale can significantly enhance
discriminative capabilities making possible the detection of very subtle differences.
Finally, figure 6-right shows that this scheme can also be valid for textures whose
properties change smoothly across the image (as occurs when tilt or slant are present).
5 Conclusions
In this paper we have addressed the problem of scale selection in texture analysis. We
have proposed to make a clear distinction between two different scale parameters: sta-
tistical scale and feature scale. Both scale parameters should be taken into account in
182
F i g . 5. Top-right: a 256 256 textured image. T h e black lines are the boundaries found by
the algorithm. Left and bottom: the dishomogeneity g'(L, x) for L = 2 k, k = 1 , . . . , 5. k grows
anti-clockwise. Homogeneous regions are black. Note t h a t the thermodynamic limit is attained
at different statistical scales by different textures.
F i g . 6. Two 256 x 256 textured images: "Order versus disorder" and "tilted texture".
183
constructing image representations but they should be dealt with in very different ways.
In particular, we claim that it is necessary to find the optimal statistical scale(s) at
each location in the image. In doing this there is a natural trade-off between the reliable
estimation of image properties and the localization of texture regions. It is possible to
extract texture boundaries reliably only if a good enough trade-off can be found.
We have formalized the notion of homogeneity by the definition of homogeneous
random functions. When local operators are applied to these functions and the result is
averaged over regions of increasing size, we obtain a description of the image which is
asymptotically deterministic and space independent. In practical circumstances, we say
that the thermodynamic limit has been reached when this holds to a sufficient degree.
We have defined a dishomogeneity function and proved that in the thermodynamic limit
it is zero if and only if the analyzed region does not contain a texture boundary.
Our algorithm has performed well on images which satisfy the piecewise-homogeneous
assumption. However, it did not perform well on images which violate the piecewise
homogeneous property, mainly because in such images boundaries are not sharp enough
everywhere and are not well defined closed curves. Our node-merging phase is not robust
with respect to this problem. We are currently designing an algorithm which is more
edge-based and should be able to deal with boundaries which are not closed curves. Also,
we need to use a better set of filters.
References
1. H. Knuttson and G. H. Granlund. Texture analysis using two-dimensional quadrature fil-
ters. In Workshop on Computer Architecture ]or Pattern Analysis ans Image Database
Management, pages 206-213. IEEE Computer Society, 1983.
2. M.R. Turner. Texture discrimination by gabor functions. Biol. Cybern., 55:71-82, 1986.
3. J. Malik and P. Perona. Preattentive texture discrimination with early vision mechanisms.
Journal of the Optical Society o] America - A, 7(5):923-932, 1990.
4. A.C. Bovik, M. Clark, and W.S. Geisler. Mnltichannel texture analysis using localized
spatial filters. IEEE Trans. Pattern Anal. Machine Intell., 12(1):55-73, 1990.
5. B. 3ulesz. Visual pattern discrimination. IRE Transactions on In]ormation Theory IT-8,
pages 84-92, 1962.
6. It. L. Kashyap and K. Eom. Texture boundary detection based on the long correlation
model. IEEE transactions on Pattern Analysis and Machine Intelligence, 11:58-67, 1989.
7. D. Geman, S. Geman, C. Graffigne, and P. Dong. Boundary detection by constraint opti-
mization. IEEE Trans. Pattern Anal. Machine Intell., 12(7):609, 1990.
8. R. Wilson and G.H. Granlund. The uncertainty principle in image processing. IEEE Trans.
Pattern Anal. Machine Intell., 6(6):758-767, Nov. 1984.
9. M. Spann and R. Wilson. A quad-tree approach to image segmentation which combines
statistical and spatial information. Pattern Recogn., 18:257-269, 1985.
10. S. Casadei. Multiscale image segmentation by dishomogeneity evaluation and local opti-
mization (master thesis). Master's thesis, MIT, Cambridge, MA, May 1991.
11. S. Casadei, S. Mitter, and P. Perona. Boundary detection in piecewise homogeneous tex-
tured images (to appear). Technical Report -, MIT, Cambridge, MA, - -.
This article was processed using the I~TEX macro package with ECCV92 style
Surface Orientation and T i m e to Contact from
Image Divergence and D e f o r m a t i o n
1 Introduction
Relative motion between an observer and a scene induces deformation in image detail
and shape. If these changes are smooth they can be economically described locally by
the first order differential invariants of the image velocity field [16] - the curl (vorticity),
divergence (dilatation), and shear (deformation) components. These invariants have sim-
ple geometrical meanings which do not depend on the particular choice of co-ordinate
system. Moreover they are related to the three dimensional structure of the scene and
the viewer's motion - in particular the surface orientation and the time to contact ~ - in
a simple geometrically intuitive way. Better still, the divergence and deformation com-
ponents of the image velocity field are unaffected by arbitrary viewer rotations about
the viewer centre. They therefore provide an efficient, reliable way of recovering these
parameters.
Although the analysis of the differential invariants of the image velocity field has
attracted considerable attention [16, 14] their application to real tasks requiring visual
inferences has been disappointingly limited [23, 9]. This is because existing methods have
failed to deliver reliable estimates of the differential invariants when applied to real im-
ages. They have attempted the recovery of dense image velocity fields [4] or the accurate
extraction of points or corner features [14]. Both methods have attendant problems con-
cerning accuracy and numerical stability. An additional problem concerns the domain of
* Toshiba Fellow, Toshiba Research and Development Center, Kawasaki 210, Japan.
2 The time duration before the observer and object collide if they continue with the same
relative translational motion [10, 20]
188
applications to which estimates of differential invariants can be usefully applied. First or-
der invariants of the image velocity field at a single point in the image cannot be used to
provide a complete description of shape and motion as attempted in numerous structure
from motion algorithms [27]. This in fact requires second order spatial derivatives of the
image velocity field [21, 29]. Their power lies in their ability to efficiently recover reliable
but incomplete solutions to the structure from motion problem which can be augmented
with other information to accomplish useful visual tasks.
The reliable, real-time extraction of these invariants from image data and their ap-
plication to visual tasks will be addressed in this paper. First we present a novel method
to measure the differential invariants of the image velocity field robustly by computing
average values from the integral of simple functions of the normal image velocities around
image contours. This is equivalent to measuring the temporal changes in the area of a
closed contour and avoids having to recover a dense image velocity field and taking partial
derivatives. It also does not require point or line correspondences. Moreover integration
provides some immunity to image measurement noise.
Second we show that the 3D interpretation of the differential invariants of the image
velocity field is especially suited to the domain of active vision in which the viewer makes
deliberate (although sometimes imprecise) motions, or in stereo vision, where the relative
positions of the two cameras (eyes) are constrained while the cameras (eyes) are free to
make arbitrary rotations (eye movements). Estimates of the divergence and deformation
of the image velocity field, augmented with constraints on the direction of translation, are
then sufficient to efficiently determine the object surface orientation and time to contact.
The results of preliminary real-time experiments in which arbitrary image shapes are
tracked using B-spline snakes [6] are presented. The invariants are computed as closed-
form functions of the B-spline snake control points. This information is used to guide a
robot manipulator in obstacle collision avoidance, object manipulation and navigation.
2.1 R e v i e w
For a sufficiently small field of view (defined precisely in [26, 5]) and smooth change
in viewpoint the image velocity field and the change in apparent image shape is well
approximated by a linear (affine) transformation [16]. The latter can be decomposed
into independent components which have simple geometric interpretations. These are an
image translation (specifying the change in image position of the centroid of the shape);
a 2D rigid rotation (vorticity), specifying the change in orientation, curlv; an isotropic
expansion (divergence) specifying a change in scale, divv; and a pure shear or deformation
which describes the distortion of the image shape (expansion in a specified direction with
contraction in a perpendicular direction in such a way that area is unchanged) described
by a magnitude, defy, and the orientation of the axis of expansion (maximum extension),
#. These quantities can be defined as combinations of the partial derivatives of the image
velocity field, v = (u, y), at an image point (z, y):
where subscripts denote differentiation with respect to the subscript parameter. The curl,
divergence and the magnitude of the deformation are scalar invariants and do not depend
on the particular choice of image co-ordinate system [16, 14]. The axes of maximum
extension and contraction change with rotations of the image plane axes.
2.2 R e l a t i o n t o 3D S h a p e a n d V i e w e r M o t i o n
The differential invariants depend on the viewer motion (translational velocity, U, and
rotational velocity, 12), depth, A and the relation between the viewing direction (ray di-
rection Q) and the surface orientation in a simple and geometrically intuitive way. Before
summarising these relationships let us define two 2D vector quantities: the component of
translational velocity parallel to the image plane scaled by depth, A, A where:
A = U- (U.Q)Q (5)
A
and the depth gradient scaled by depth 3, F, to represent the surface orientation and
which we define in terms of the 2D vector gradient:
F = gradA
(6)
The magnitude of the depth gradient, IF], determines the tangent of the slant of the
surface (angle between the surface normal and the visual direction). It vanishes for a
frontal view and is infinite when the viewer is in the tangent plane of the surface. Its
direction, LF, specifies the direction in the image of increasing distance. This is equal to
the tilt of the surface tangent plane. The exact relationship between the magnitude and
direction of F and the slant and tilt of the surface (a, r) is given by:
With this new notation the relations between the differential invariants, the motion
parameters and the surface position and orientation are given by [15]:
LA + LF
- 2 (12)
The geometric significance of these equations is easily seen with a few examples. For
example, a translation towards the surface patch leads to a uniform expansion in the
image, i.e. positive divergence. This encodes the distance to the object which due to the
speed-scale ambiguity4 is more conveniently expressed as a time to contact, to:
The analysis above treated the differential invariants as observables of the image. There
are a number of ways of extracting the differential invariants from the image. These are
summarised below before presenting a novel method based on the temporal derivatives
of the moments of the area enclosed by a closed curve.
3.1 S u m m a r y o f E x i s t i n g M e t h o d s
1. P a r t i a l d e r i v a t i v e s of i m a g e v e l o c i t y field
4 Translational velocities appear scaled by depth making it impossible to determine whether
the effects are due to a nearby object moving slowly or a far-away object moving quickly.
5 This is somewhat related to the reliable estimation of relative depth from the relative image
velocities of two nearby points - motion parallax [21, 24, 6]. Both motion parallax and the
deformation of the image velocity field relate local measurements of relative image velocities
to scene structure in a simple way which is uncorrupted by the rotational image velocity
component. In the case of parallax, the depths are discontinuous and differences of discrete
velocities axe related to the difference of inverse depths. Equation (11) on the otherhand
assumes a smooth and continuous surface and derivatives of image velocities are related to
derivatives of inverse depth.
191
3.2 R e c o v e r y o f I n v a r i a n t s f r o m A r e a M o m e n t s o f C l o s e d C o n t o u r s
It has been shown that the differential invariants of the image velocity field conveniently
characterise the changes in apparent shape due to relative motion between the viewer and
scene. Contours in the image sample this image velocity field. It is usually only possible,
192
however, to recover the normal image velocity component from local measurements at a
curve [27, 12]. It is now shown that this information is often sufficient to estimate the
differential invariants within closed curves,
Our approach is based on relating the temporal derivative of the area of a closed
contour and its moments to the invariants of the image velocity field. This is a general-
isation of the result derived by Maybank [22] in which the rate of change of area scaled
by area is used to estimate the divergence of the image velocity field. The advantage is
that point or line correspondences are not used. Only the correspondence between shapes
is required. The computationally difficult, ill-conditioned and poorly defined process of
making explicit the full image velocity field [12] is avoided. Moreover, since taking tem-
poral derivatives of area (and its moments) is equivalent to the integration of normal
image velocities (scaled by simple functions) around closed contours our approach is ef-
fectively computing average values of the differential invariants (not point properties)
and has better immunity to image noise leading to reliable estimates. Areas can also be
estimated accurately, even when the full set of first order derivatives can not be obtained.
The moments of area of a contour, 1I, are defined in terms of an area integral with
boundaries defined by the contour in the image plane:
where a(t) is the area of a contour of interest at time t and f is a scalar function of image
position (x, y) that defines the moment of interest. For instance setting f = 1 gives us
area. Setting f -- z or f : y gives the first-order moments about the image x and y axes
respectively.
The moments of area can be measured directly from the image (see below for a novel
method involving the control points of a B-spline snake). Better still, their temporal
derivatives can also be measured. Differentiating (14) with respect to time and using a
result from calculus 6 we can relate the temporal derivative of the moment of area to an
integral of the normal component of image velocities at an image contour, v.n, weighted
by a scalar f ( x , y). By Green's theorem, this integral around the contour e(t), can be
re-expressed as an integral over the area enclosed by the contour, a(t).
~ (Ij) = ~ [fv.n]d8
(0
(lS)
= [ [div(fv)]dxdy (16)
Ja (0
f
= [ [fdivv + (v.gradf)]dxdy . (17)
da (0
If the image velocity field, v, can be represented by constant partial derivatives in the area
of interest, substituting the coefficients of the affine transformation for the velocity field
into (17) leads to a linear equation in which the left hand side is the temporal derivative of
the moment of area described by f (which can be measured, see below) while the integrals
on the right-hand side are moments of area (also directly measurable). The coefficients
of each term are the required parameters of the affine transformation. In summary, the
8 This equation can be derived by considering the flux linking the area of the contour. This
changes with time since the contour is carried by the velocity field. The flux field, f, in our
example does not change with time. Similar integrals appear in fluid mechanics, e.g. the flux
transport theorem [8].
193
image velocity field deforms the shape of contours in the image. Shape can be described
by moments of area. Hence measuring the change in the moments of area is an alternative
way characterising the transformation. In this way the change in the moments of area
have been expressed in terms of the parameters of the affine transformation.
If we initially set up the x - y co-ordinate system at the centroid of the image contour
of interest so that the first moments are zero, (17) with f = x and f = y shows that
the centroid of the deformed shape specifies the mean translation [u0, v0]. Setting f = 1
leads to the simple and useful result that the divergence of the image velocity field can
be estimated as the derivative of area scaled by area:
Increasing the order of the moments, i.e. different values of f ( x , y), generates new equa-
tions and additional constraints. In principle, if it is possible to find six linearly indepen-
dent equations, we can solve for the affine transformation parameters and combine the
co-efficients to recover the differential invariants. The validity of the affine approxima-
tion can be checked by looking at the error between the transformed and observed image
contours. The choice of which moments to use is a subject for further work. Listed below
are some of the simplest equations which have been useful in the experiments presented
here.
d [all
I~
Ia 0Oa2I. IuO 0OaI~
Iy 0 a Iv 0 I= 2Iy
vo
] [uo
u= (19)
dt = 0 2I y 0 | uy
/ / o / v.
Ll=suj k31=~yI ~ 41~y 3I~y~ I=, 21=~yj vy
(Note that in this equation subscripts are used to label the moments of area. The left-band
side represents the temporal derivative of the moments in the column vector.) In practice
certain contours may lead to equations which are not independent and their solution is
ill-conditioned. The interpretation of this is that the normal components of image velocity
are insufficient to recover the true image velocity field globally, e.g. a fronto-parallel circle
rotating about the optical axis. This was termed the "aperture problem in the large" by
Waxman and Wohn [30] and investigated by Berghom and Carlsson [2]. Note however,
that it is always possible to recover the divergence from a closed contour.
Applications of the estimates of the image divergence and deformation of the image
velocity field are summarised below. It has already been noted that measurement of the
differential invariants in a single neighbourhood is insufficient to to completely solve for
the structure and motion since (9,10,11,12) are four equations in the six unknowns of
scene structure and motion. In a single neighbourhood a complete solution would require
the computation of second order derivatives [21, 29] to generate sufficient equations to
solve for the unknowns. Even then the solution of the resulting set of non-linear equations
is non-trivial.
In the following, the information available from the first-order differential invariants
alone is investigated. It will be seen that the differential invariants are sufficient to con-
strain surface position and orientation and that this partial solution can be used to
194
perform useful visual tasks when augmented with additional information. Useful appli-
cations include providing information which is used by pilots when landing aircraft [10],
estimating time to contact in braking reactions [20] and in the recovery of 3D shape up
to a relief transformation [18, 19]. We now show how surface orientation and position
(expressed as a time to contact) can be recovered from the estimates of image divergence
and the magnitude and axis of the deformation.
1. W i t h k n o w l e d g e o f t r a n s l a t i o n b u t a r b i t r a r y r o t a t i o n
An estimate of the direction of translation is usually available when the viewer is
making deliberate movements (in the case of active vision) or in the case of binocular
vision (where the camera or eye positions are constrained). It can also be estimated
from image measurements by motion parallax [21, 24].
If the viewer translation is known, (10),(11) and (12) are sufficient to unambiguously
recover the surface orientation and the distance to the object in temporal units. Due
to the speed-scale ambiguity the latter is expressed as a time to contact. A solution
can be obtained in the following way.
(a) The axis of expansion (/~) of the deformation component and the projection in
the image of the direction of translation ( / A ) allow the recovery of the tilt of
the surface from (12).
(b) We can then subtract the contribution due to the surface orientation and viewer
translation parallel to the image axis from the image divergence (10). This is
equal to [defy[ c o s ( r - / A ) . The remaining component of divergence is due to
movement towards or away from the object. This can be used to recover the time
to contact, re. This can be recovered despite the fact that the viewer translation
may not be parallel to the visual direction.
(c) The time to contact fixes the viewer translation in temporal units. It allows the
specification of the magnitude of the translation parallel to the image plane (up
to the same speed-scale ambiguity), A. The magnitude of the deformation can
then be used to recover the slant, or, of the surface from (11).
The advantage of this formulation is that camera rotations do not affect the estima-
tion of shape and distance. The effects of errors in the direction of translation are
clearly evident as scalings in depth or by a relief transformation [15].
2. W i t h f i x a t i o n
If the cameras or eyes rotate to keep the object of interest in the middle of the image
(null the effect of image translation) the magnitude of the rotations needed to bring
the object back to the centre of the image determines A and hence allows us to solve
for surface orientation, as above. Again the major effect of any error in the estimate
of rotation is to scale depth and orientations.
3. W i t h no a d d i t i o n a l i n f o r m a t i o n - c o n s t r a i n t s on m o t i o n
Even without any additional assumptions it is still possible to obtain useful infor-
mation from the first-order differential invariants. The information obtained is best
expressed as bounds. For example inspection of (10) and (11) shows that the time to
contact must lie in an interval given by:
1 divv defy
- - - - (20)
re- 2 2
The upper bound on time to contact occurs when the component of viewer transla-
tion parallel to the image plane is in the opposite direction to the depth gradient.
The lower bound occurs when the translation is parallel to the depth gradient. The
upper and lower estimates of time to contact are equal when their is no deformation
195
component. This is the case in which the viewer translation is along the ray or when
viewing a fronto-parallel surface (zero depth gradient locally). The estimate of time
to contact is then exact. A similar equation was recently described by Subbarao [25].
4. W i t h n o a d d i t i o n a l i n f o r m a t i o n - t h e c o n s t r a i n t s o n 3D s h a p e
Koenderink and Van Doom [18] showed that surface shape information can be ob-
tained by considering the variation of the deformation component alone in small field
of view when weak perspective is a valid approximation. This allows the recovery of
3D shape up to a scale and relief transformation. That is they effectively recover the
axis of rotation of the object but not the magnitude of the turn. This yields a family
of solutions depending on the magnitude of the turn. Fixing the latter determines the
slants and tilts of the surface. This has recently been extended in the affine structure "
from motion theorem [19].
The solutions presented above use knowledge of a single viewer translation and mea-
surement of the divergence and deformation of the image velocity field. An alternative
solution exists if the observer is free to translate along the ray and also in two orthogonal
directions parallel to the image plane. In this case measurement of divergence alone is
sufficient to recover the surface orientation and the time to contact.
5.1 T r a c k i n g Closed L o o p C o n t o u r s
The implementation and results follow. Multi-span closed loop B-spline snakes [6] are
used to localise and track closed image contours. The B-spline is a curve in the image
plane
x(s) = Z fi(s)q, (21)
i
where fl are the spline basis functions with coefficients qi (control points of the curve)
and s is a curve parameter (not necessarily arc length)[1]. The snakes are initialised as
points in the centre of the image and are forced to expand radially outwards until they
were in the vicinity of an edge where image "forces" make the snake stabilise close to a
high contrast closed contour. Subsequent image motion is automatically tracked by the
snake [5].
B-spline snakes have useful properties such as local control and continuity. They also
compactly represent image curves. In our applications they have the additional advantage
that the area enclosed is a simple function of the control points. This also applies to the
other area moments. From Green's theorem in the plane it is easy to show that the area
enclosed by a curve with parameterisation x(s) and y(s) is given by:
. = x(s)y'(s)es (221
0
where x(s) and y(s) are the two components of the image curve and y~(s) is the derivative
with respect to the curve parameter s. For a B-spline, substituting (21) and its derivative:
Note that for each span of the B-spline and at each time instant the basis functions
remain unchanged. The integrals can thus be computed off-line in closed form. (At most
16 coefficients need be stored. In fact due to symmetry there are only 10 possible values
for a cubic B-spline). At each time instant multiplication with the control point positions
gives the area enclosed by the contour. This is extremely efficient, giving the exact area
enclosed by the contour. The same method can be used for higher moments of area as
well. The temporal derivatives of the area and its moments is then used to estimate image
divergence and deformation.
5.2 A p p l i c a t i o n s
Here we present the results of a preliminary implementation of the theory. The examples
are based on a camera mounted on a robot arm whose translations are deliberate while
the rotations around the camera centre are performed to keep the target of interest in the
centre of its field of view. The camera intrinsic parameters (image centre, scaling factors
and focal length) and orientation are unknown. The direction of translation is assumed
known and expressed with bounds due to uncertainty.
to(0) = uA(O)
. q (27)
This is in close agreement with the data (Fig. 2a). This is more easily seen if we look at
the variation of the time to contact with time. For uniform motion this should decrease
linearly. The experimental results are plotted in Fig. 2b. These are obtained by dividing
the area of the contour at a given time by its temporal derivative (estimated by finite
differences). The variation is linear, as predicted. These results are of useful accuracy,
predicting the collision time to the nearest half time unit (corresponding to 50cm in this
example).
For non-uniform motion the profile of the time to contact as a function of time is a
very important cue for braking and landing reactions [20].
197
Qualitative visual navigation Existing techniques for visual navigation have typically
used stereo or the analysis of image sequences to determine the camera ego-motion and
then the 3D positions of feature points. The 3D data are then analysed to determine, for
example, navigable regions, obstacles or doors. An example of an alternative approach
198
6 Conclusions
We have presented a simple and efficient method for estimating image divergence and
deformation by tracking closed image contours with B-spline snakes. This information
has been successfully used to estimate surface orientation and time to contact.
Aeknowledgement s
The authors acknowledge discussions with Mike Brady, Kenichi Kanatani, Christopher
Longuet-Higgins, and Andrew Zisserman. This work was partially funded by Esprit BRA
3274 (FIRST) and the SERC. Roberto Cipolla also gratefully acknowledges the support
of the IBM UK Scientific Centre, St. Hugh's College, Oxford and the Toshiba Research
and Development Centre, Japan.
References
1. R.I-I.Bartels, J.C. Beatty, and B.A. Barsky. An Introduction to Splines for use in Computer
Graphics and Geometric Modeling. Morgan Kaufmann, 1987.
2. F. Bergholm. Motion from flow along contours: a note on robustness and ambiguous case.
Int. Journal of Computer Vision, 3:395-415, 1989.
3. J.D. Boissonat. Representing solids with the delaunay triangulation. In Proc. ICPR, pages
745-748, 1984.
199
4. M. Campani and A. Verri. Computing optical flow from an overconstrained system of linear
algebraic equations. In Proc. 3rd Int. Conf. on Computer Vision, pages 22-26, 1990.
5. R. Cipolla. Active Visual Inference of Surface Shape. PhD thesis, University of Oxford,
1991.
6. R. Cipolla and A. Blake. The dynamic analysis of apparent contours. In Proc. 3rd Int.
Conf. on Computer Vision, pages 616-623, 1990.
7. R. Cipolla and P. Kovesi. Determining object surface orientation and time to impact from
image divergence and deformation. (University of Oxford (Memo)), 1991.
8. H.F. Davis and A.D. Snider. Introduction to vector analysis. Allyn and Bacon, 1979.
9. E. Francois and P. Bouthemy. Derivation of qualitative information in motion analysis.
Image and Vision Computing, 8(4):279-288, 1990.
10. J.J. Gibson. The Ecological Approach to Visual Perception. Houghton Mifflin, 1979.
11. C.G. Harris. Structure from motion under orthographic projection. In O. Faugeras, editor,
Proc. Ist European Conference on Computer Vision, pages 118-123. Springer-Verlag, 1990.
12. E.C. Hildreth. The measurement of visual motion. The MIT press, Cambridge Mas-
sachusetts, 1984.
13. K. Kanatani. Detecting the motion of a planar surface by line and surface integrals. Com-
puter Vision, Graphics and Image Processing, 29:13-22, 1985.
14. K. Kanatani. Structure and motion from optical flow under orthographic projection. Com-
puter Vision, Graphics and Image Processing, 35:181-199, 1986.
15. J.J. Koenderink. Optic flow. Vision Research, 26(1):161-179, 1986.
16. J.J. Koenderink and A.J. Van Doorn. Invariant properties of the motion parallax field due
to the movement of rigid bodies relative to an observer. Optica Acta, 22(9):773-791, 1975.
17. J.J. Koenderink and A.J. Van Doorn. How an ambulant observer can construct a model
of the environment from the geometrical structure of the visual inflow. In G. Hauske and
E. Butenandt, editors, Kybernetik. Oldenburg, Munchen, 1978.
18. J.J. Koenderink and A.J. Van Doorn. Depth and shape from differential perspective in the
presence of bending deformations. J. Opt. Soc. Am., 3(2):242-249, 1986.
19. J.J. Koenderink and A.J. van Doorn. Afflne structure from motion. Journal o/ Optical
Society of America, 1991.
20. D.N. Lee. The optic flow field: the foundation of vision. Phil. Trans. R. Soc. Lond., 290,
1980.
21. H.C. Longuet-Higgins and K. Pradzny. The interpretation of a moving retinal image. Proc.
R. Soc. Lond., B208:385-397, 1980.
22. S. J. Maybank. Apparent area of a rigid moving body. Image and Vision Computing,
5(2):111-113, 1987.
23. R.C. Nelson and J. Aloimonos. Using flow field divergence for obstacle avoidance: towards
qualitative vision. In Proc. Pnd Int. Conf. on Computer Vision, pages 188-196, 1988.
24. J.H. Rieger and D.L. Lawton. Processing differential image motion. J. Optical Soc. of
America, A2(2), 1985.
25. M. Subbarao. Bounds on time-to-collision and rotational component from first-order
derivatives of image flow. Computer Vision, Graphics and Image Processing, 50:329-341,
1990.
26. D.W. Thompson and J.L. Mundy. Three-dimensional model matching from an uncon-
strained viewpoint. In Proceedings of IEEE Conference on Robotics and Automation, 1987.
27. S. Ullman. The interpretation of visual motion. MIT Press, Cambridge,USA, 1979.
28. H. Wang, C. Bowman, M. Brady, and C. Harris. A parallel implementation of a structure
from motion algorithm. In Proc. ~nd European Conference on Computer Vision, 1992.
29. A.M. Waxman and S. Ullman. Surface structure and three-dimensional motion from image
flow kinematics. Int. Journal of Robotics Research, 4(3):72-94, 1985.
30. A.M. Waxman and K. Wohn. Contour evolution, neighbourhood deformation and global
image flow: planar surfaces in motion. Int. Journal of Robotics Research, 4(3):95-108, 1985.
200
Four samples of a video sequence taken from a moving observer approaching a stationary car
at a uniform velocity (approximately l m per time unit}. A B-spline snake automatically tracks
the area of the rear windscreen (Fig. ~a). The image divergence is used to estimate the time to
contact (Fig. 2b). The next image in the sequence corresponds to collision!
2E 6-
5-
2C
4.
1E
lC
0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
Time (frame number) Time (frame number)
Fig. 2. Apparent area of windscreen for approaching observer and the estimated time to contact.
201
Fig. 3. Visually guided object manipulation using image divergence and deformation.
(a) The image of a planar contour (zero tilt and positive slant, i.e. the direction of increasing
depth, F, is horizontal and from left to right). The image contour is localised automatically by
a B.spline snake initialised in the centre of the field of view. (b) The effect on apparent shape
when the viewer translates to the right while fixating on the target (i.e. A is horizontal, left to
right). The apparent shape undergoes an isotropic expansion (positive divergence which increases
the area) and a deformation in which the axis of expansion is horizontal. Measurement of the
divergence and deformation can used to estimate the time to contact and surface orientation.
This is used to guide the manipulator so that it comes to rest perpendicular to the surface with
a pre-determined clearance. Estimates of divergence and deformation made approximately l m
away were sufficient to estimate the target object position and orientation to the nearest 2cm
in position and 1 ~ in orientation. This information is used to position a suction gripper in the
vicinity of the surface. A contact sensor and small probing motions can then be used to refine
the estimate of position and guide the suction gripper before manipulation (d}.
202
(a) The image of a door and an object of interest, a pallet. (b) Movement towards the door
and pallet produces a deformation in the image seen as an expansion in the apparent area of
the door and pallet. This can be used to determine the distance to these objects, expressed as a
time to contact - the time needed for the viewer to reach the object if it continued with the same
speed. (c} A movement to the left produces combinations of image deformation, divergence and
rotation. This is immediately evident from both the door (positive deformation and a shear with
a horizontal axis of expansion) and the pallet (clockwise rotation with shear with diagonal axis of
expansion). These effects, combined with the knowledge that the movement between the images,
are consistent with the door having zero tilt, i.e. horizontal direction of increasing depth, while
the pallet has a tilt of approximately 90~ i.e. vertical direction of increasing depth. They are
sufficient to determine the orientation of the surface qualitatively (d). This has been done with
no knowledge of the intrinsic properties of the camera (camera calibration), its orientations or
the translational velocities. Estimation of divergence and deformation can also be recovered by
comparison of apparent areas and the orientation of edge segments.
Robust and fast computation of unbiased intensity
derivatives in images
I Introduction
Edges are important features in an image. Detecting them in static images is now a well
understood problem. In particular, an optimal edge-detector using Canny's criterion has
been designed [8,7]. In subsequent studies this method has been generalized to the com-
putation of 3D-edges [5]. This edge-detector however has not been designed to compute
edge geometric and dynamic characteristics, such as curvature and velocity.
It is also well known that robust estimates of the image geometric and dynamic
characteristics should he computed at points in the image with a high contrast, that is
edges. Several authors, have attempted to combine an edge-detector with other operators,
in order to obtain a relevant estimate of some components of the image features, or the
motion field [2], but they use the same derivatives operators for both problems.
However, it is not likely that the computation of edge characteristics has to be done
in the same way as edge detection, and we would like to analyse this fact in this paper.
Since edge geometric characteristics are related to the spatial derivatives of the picture
intensity [2]. we have to study how to compute "good" intensity derivatives, that is suitable
to estimate edge characteristics.
In this paper, we attempt to answer this question, and propose a way to compute
image optimal intensity derivatives, in the discrete case.
- A derivative filter is unbiased if it outputs 0nly the required derivative, but not lower
or higher order derivatives of the signal.
- Among these filters, a derivative filter is optimal if it minimizes the noise present in
the signal. In our ease we minimize the output noise.
Please note, that we are not dealing with filters for detecting edges, here, but rather
- edges having been already detected - with derivative filters to compute edge charac-
teristics. It is thus not relevant to consider other criteria used in optimal edge detection
such as localization or false edge detection [1].
In fact, spatial derivatives are often computed in order to detect edges with accuracy
and robustness. Performances of edge detectors are given in term of localization and signal
to noise ratio [1,8]. Although the related operators are optimal for this task, they might
204
not be suitable to compute unbiased intensity derivatives on the detected edge. Moreover
it has been pointed out [9] that an important requirement of derivative filters, in the
case where one wants to use differential equations of the intensity is the preservation
of the intensity derivatives, which is not the case of usual filters, ttowever, this author
limits his discussion to Gaussian filters, whereas we would like to derive a general set of
optimal filters for the computation of temporal or spatial derivatives. We are first going
to demonstrate some properties of such filters in the continuous or discrete case and then
use an equivalent formulation in the discrete case.
M i n i m i z i n g the output noise In the last paragraph we have obtained a set of condi-
tions for unbiasness. Among all filters which satisfy these conditions let us compute the
best one, considering a criteria related to the noise.
The mean-squared noise response, considering a white noise of variance 1, is (see
[1], for instance): f f r ( t ) 2 d t , and a reasonable optimal condition is to find the filter
which minimize this quantity and satisfy the constraints given by equation (1). Using
the opposite of the standard Lagrange multipliers ~p this might be written as :
9 . 1 ffr(t)~d t- O
p=O
Prom the calculus of variation, one can derive the Euler equation, which is a necessary
condition and which turns out to be, with the constraints, also sufficient in our case, since
we have a positive quadratic criteria with linear constraints.
The optimal filter equations (Euler equations and constraints) are then :
/ ~
fr(t) = ~ p = 0 Apt'
0 <_ q <_ Q f A ( t ) ~ . d t = 6q,
These equations are necessary conditions for the filter to be optimum. They yield
polynomial filters. Functions verify these equations only if they are defined, and presently,
polynomials are only define on finite supports. Thus these equations are convergent if
and only iffr(t) has a finite support. That is we obtain optimal filters minimizing output
noise, only on a finite window.
These equations have the following consequence : the optimal derivative filter is a
polynomial filter and is thus only defined on a finite window. If not, the Euler equations
are no more defined. In fact, we also studied infinite response filters, but we came with a
negative answer : even if considering special families of infinite response filters (such as
product of polynomial with Gaussian or exponentials) and applying the same constrainted
optimum criteria, it is not possible to obtain analytic filters as an infinite series of the
original basis of function, because the summation is divergent (see however section 2.5
for a discussion about sub-optimal solutions).
We thus have to work on finite windows and in this case, we can compute the values
of Ap, from a set of linear equations, since from the Euler equation and the constraints
we obtain :
= ] - dt = -7 = (2)
p=O q" =
for 0_< q_< Q.
Equations (2) define a unique optimal unbiased r-order filter. Itowever, if fr is this
optimal unbiased r-order filter, fr+t = fr~ is not the optimal unbiased (r+l)-order filter,
as it can be easily verified, whereas each filter has to be computed separately.
2.3 An equivalent parametric approach using p o l y n o m i a l a p p r o x i m a t i o n
There is another way to compute these derivatives, considering the Taylor expansion of
the input as a parametric model. Writing :
O tq
x(t) ~_ E ~"-~, + A/'oise (3)
q=0
one can minimize :J = 89f x(t) - E~=0 x, dt which is just a least-square criteria
with a similar interpretation, since we minimize the variance of the residual error.
206
Q 0 1 2 3 4 5 6 7
Smoother . o.W_._~5 1.W_~1 1W.__.~8 2.W_.24 .
First Order 1.5 9.4 25.0 65.0
Second Order 2a.0
w---r 2s0.0
w'-Wv- 1400.0
w~
Third Order 70o.o 16ooo.o 12o00o.o
w"Wr w7
SmootherQ~0,2,4,6,8.10
p----O
which correspond to the set of recursively implemented digital filters (see for instance
[6]), having an implementation of the form :
P q
Yt = E blxt-i -- E ajyt-i
i=O j--1
Applying the unbiasness condition of equation (1) to these functions leads to a finite
set of linear equations :
/ ~ ---d "t'+'
9 ~ J o!
p--O ~"
10=0 q=0
is minimum. This yields to the minimization of a quadratic positive criteria in the pres-
ence of linear constraints, having a unique solution obtained from the derivation of the
related normal equations.
In order to illustrate this point, we derive these equations for Q < 2 and d >_ 2 for
r = 1. And in that case we obtained : fdz(t ) = flte-I~fl which corresponds precisely to
Canny-Deriche recursive optimal derivative filters. More generally if the signal contains
derivatives up to the order of the desired derivative, usual derivative filters such as Canny-
Deriche filters are unbiased filters and can be used to estimate edge characteristics.
However, such a filter is not optimal among all infinite response operators, but only
in the the small parametric family of exponential filters z. The problem of finding an
optimal filter among all infinite response operators is an undefined problem, because the
Euler equation obtained in the previous section (a necessary condition for the optimum)
is undefined, as pointed out.
Since this family is dense in the functional space of derivable functions it is indeed
possible to approximate any optimal filters using a combination of exponential filters,
but the order n might be very high, while the computation window has to be increased.
Moreover, in practice, on real-time vision machines, these operators are truncated (thus
biased t) and it is much more relevant to consider finite response filters.
2 The same parametric approach could have been developed using Gaussian kernels.
209
Generalizing the previous approach to 2D-data we can use the following model of the
intensity, a Taylor expansion, the origin being at ( ~ , ~ ) :
I(x,y) I o - I x - I - I x x x 2 - I Y Y ~ - [ x --l~t*:xs-I=~:Yx= --IxYYx 2-IYu~' S-ere
= ~ - t -I'YY't" 2 + 2 y + = v y + 6 + 2 y-t- 2 y-l- 6 Y ~- "'"
where the development is not made up to the order of derivative to be computed, but up
to the order of derivative the signal is supposed to contain.
Let us now modelize the fact that the intensity obtained for one pixel is related to
the image irradiance over its surface. We consider rectangular pixels, with homogeneous
surfaces, and no gap between two pixels. Since, one pixel of a C C D camera integrates the
light received on its surface, this means that a realistic model for the intensity measured
for a pixel (i, j) is, under the previous assumptions :
I,r = fl i+' f]+'
I@,y)dxdy
= IoPo(i) + I=Pl(i) + IvPx(j ) + I~rP2(i) + IxvPl(i)Pl(j) + I~yP2(j) + . . .
where Pk(i) = j,/,i+1 rz*, dx = E L 0 CLli'.
Now, the related least-square problem is
N-1N-1
1
J = 2 E [Iij-(I~176 .)]2
i=0j=0
and its resolution provides optimal estimates of the intensity derivatives
{Io, I~, Iv, I==, I~ v, I v v , . . . } in function of the intensity values Iq in the N N window.
In other words we obtain the intensity derivatives as a linear combination of the
intensity values Iq, as for usual finite response digital filters.
For a 5 x 5 or 7 7window, for instance, and for a intensity model taken up to the
fourth order one have convolutions given in Fig3.
This approach is very similar to what was proposed by Haraliek [3], and w e call t h e s e
f i l t e r s I t a r a l i c k - l i k e filters. In both methods the filters depends upon two integers :
(1) the size of the window, (2) the order of expansion of the model. In both methods,
we obtain polynomial linear filters. However it has been shown [4] that Haralick filters
reduce to Prewitt filters, while our filters do not correspond to already existing filters.
The key point, which is - we think - the main improvement, is to consider the intensity
at one pixel not as the simple vMue at that location, but as the integral of the intensity
over the pixel surface, which is closer to reality.
Contrary to Haralick original filters these filters are not all separable, however this not
a drawback because separable filters are only useful when the whole image is processed.
In our case we only compute the derivatives in a small area along edges, and for that
reason efficiency is not as much an issue 3
2.7 Conclusion
We have designed a new class of unbiased optimal filters dedicated to the c o m p u t a t i o n
of intensity derivatives, as required for the computation of edge characteristics. Because
these filters are computed though a simple least-square minimization problem, we have
been capable to implement these operators in the discrete case, taking the C C D subpixel
mechanisms into account.
These filters are dedicated to the computation of edge characteristics, they are well
implemented in finite windows, and correspond to unbiased derivators with m i n i m u m
output noise. They do not correspond to optimal filters for edge detection.
3 Anyway, separable filters are quicker than genera] filters if and only if they are used on a
whole image not a few set of points
210
[i555
610 II00 1394 1492 1394 1100 610
i]
--1115 --380 61 208 61 --380 --1115
9 6 3 0-3-6
gooooo 63420-2-4-63
1 3-3-3-3-3-3 3 1 2 1 0-1-2-
Izz=~ -4-4-4-4-4 Izy=~ 0 0 0 0 0
-3-3-3-3-3 -2 -I 0 1 2
0 0 0 0 0 -4 -2 0 2 4
5 5 5 5 5 -6 -3 0 3 6
Iyy = Izx T
6 3 0--3--6
0 0 0 0 0
0 0
3 0--3-6-
8 4 0--4-8-12[
--15--i0--50 5 i0 15 J
0
:I
Izyy = Izzy T lyyy = Izxx T
I-Noise 2% 5% 10% 0 0 0
P-Noise 0 0 0 0.5 1 2
Error (in pixel) 2.1 6.0 10.4 6.0 12.2 huge
5 x 5 window, and that our model is only locally valid. In the last case, the second order
derivatives are used at the border of the neighbourhood and are no more valid.
References
1. J. F. Canny. Finding edges and lines in images. Technical Report AI Memo 720, MIT Press,
Cambridge, 1983.
2. R. Deriche and G. Giraudon. Accurate corner detection : an analytical study. In Proceedings
of the 3rd ICCV, Osaka, 1990.
3. R. M. Haralick. Digital step edges from zero crossing of second directional derivatives. ]EEE
Transactions on Pattern Analysis and Machine Intelligence, 6, 1984.
4. A. Huertas and G. Medioni. Detection of intensity changes with subpixel accuracy using
laplaca~n-gaussian masks. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 8:651-664, 1986.
5. O.Monga, J. Rocchisani, and R.Deriche. 3D edge detection using recursive filtering. CVGIP:
linage Understanding, 53, 1991.
6. R.Deriche. Separable recursive filtering for efficient multi-scale edge detection. In Int. Work-
shop Machine and Machine Intelligence, Tokyo, pages 18-23, 1987.
7. R.Deriche. Fast algorithms for low-level vision. IEEE Transaction on Pattern Analysis and
Machine Intelligence, 12, 1990.
8. R.Deriche. Using Canny's criteria to derive a recursively implemented optimal edge detector.
International Journal of Computer Vision, pages 167-187, 1987.
0. I. Weiss. Noise-resistant invariants of curves. In Application o] lnvariance in Computer
Vision Darpa-Esprit, lceland, 1991.
This article was processed using the IgrEX macro package with ECCV92 style
Testing Computational Theories of Motion
Discontinuities: A Psychophysical Study *
1 Introduction
Normal naive observers with good acuity and contrast sensitivity,and no known neu-
rological or psychiatric disorders, and three patients (A.F., C.D., and O.S.) with focal
bilateral brain damage resulting from a single stroke participated in an extensive psy-
chophysical study of motion perception. M R I studies revealed that the patients' lesions
directly involved or disconnected anatomical areas believed to mediate visual analysis.
The rationale for including these three patients in the study was their good performance
on static psychophysical tasks, their normal contrast sensitivity,and good performance
on some motion tasks but their selective poor performance on several other visual mo-
tion tasks. All the patients and healthy volunteers signed the Informed Consent form
according to the Boston University human subjects committee regulations. Detail of
the psychophysical experiments and of the experimental setting can be found in [Veal],
[Vea2], and [Vea3].
* This research was supported by the NIH grant EY07861-3 to L.V. and by the A F O S R grant
92-00564 to Dr. Suzanne McKee and N.M.G.
213
3 Results
B,
l O O ~ ~ ~ ?
Figure lb shows that the normal subjects and O.S. performed the task essentially
without error for all conditions. In contrast, patient C.D. was severely impaired at all
conditions. Because the patients A.F. and C.D performed at chance in the pure temporal-
frequency condition (0~ we conclude that they could not use this cue well enough to
localize discontinuities.
compared to five other speeds, giving speed ratios of I.I, 1.47, 2.2, 3.6, and 5.5. The
assignment of the highest speed to the top or bottom aperture was pseudo randomly
selected.
A. a .
t
ii+
%
+ % % .a
! % ! Contro~ (n 2"])
t ,--+ ~. i / 0
0
' - .
1
- .
2
- 9
3
- .
4
- .
5
- 9
6
"-'/ t%--. /
Speed Ratio
Subjects were asked to determine which of the two apertures contained the faster
moving dots. Figure 2b shows that in comparison to the control group and O.S., who
were performing almost perfectly for the 1.47 speed ratio, A.F. had a very severe deficit
on this speed discrimination task. Similarly C.D was also impaired on this task, but to
a lesser degree than A.F.
3.3 E x p e r i m e n t 3: M o t i o n C o h e r e n c e
In the third experiment, the stimuli were dynamic random-dot cinematograins with a
correlated motion signal of variable strength embedded in motion noise. The strength of
the motion signal, that is the percentage of the dots moving in the same, predetermined
direction, varied from 0% to 100% (Figure 3a). The algorithm by which the dots were
generated was similar to that of Newsome and Park's (INP]), which is described in detail
in [Veal], [Vea2], and [Vea3]. The aim of this task was to determine the threshold of
motion correlation for which a subject could reliably discriminate the direction of motion.
Figure 3b shows that the mean of the motion coherence threshold of the normal
subjects (n=16) was 6.5% for left fixation and 6.9% for right fixation. The patient A.F.
was significantly impaired this task. His direction discrimination threshold was 28.4%
for left fixation and 35.2% for right fixation. Similarly, O.S. was very impaired on this
task. In contrast, C.D.'s performance was normal when the stimulus was presented in
the intact visual field, but she could not do the task when the stimulus was presented in
the blind visual field.
215
A, B.
I VemQLdt
; ; VOUO*lt~t r J ~ ,
? cr~ ; ; ; ; ; so
9o ?
; ; 40 ~0
0% 50% 100% ~ 30 9
20
10 ~ 9
0
co~,o~ AiF. elY. OiS.
(n.16)
Fig. 3. Motion coherence
4 Discussion
4.1 C o m p u t a t i o n of Visual M o t i o n
It has been theorized that comparisons between fully encoded velocity signals underlie
localization of discontinuities ([NL]; [CloD. However, our data do not support this sug-
gestion, since A.F., who could not discriminate speed well, had a good performance in
the localization-of-discontinuities task.
Our data also address the issue of whether the computations underlying discontinuity
localization and motion coherence occur simultaneously in the brain. These possibilities
are suggested by theories based on Markov random fields and line processors ([Kea]).
Such theories would predict that if the computation of coherence is impaired, then so
is the computation of discontinuity. Our data do not support this simultaneous-damage
prediction as C.D. performed well on the coherence task, but failed in the localization-
of-discontinuities task. Further evidence against a simultaneous computation of discon-
tinuities and coherence comes from A.F. and O.S., who were good in the localization-
of-discontinuities task but very impaired in the coherence task.
From a computational perspective, it seems desirable to account for the data by
postulating that the computation of motion coherence receives inputs from two parallel
pathways (Figure 4). One pathway would bring data from basic motion measurements
(directional, temporal, and speed signals). The other pathway would bring information
about discontinuity localization (see [Hill and [GY] for theoretical models) to provide
boundary conditions for the spatial integration in the computation of motion coherence
([YG1]; [YG2]). According to this hypothesis, it is possible that different lesions may
cause independent impairments of discontinuity localization and motion coherence.
Notes and Comments. In Experiment 1, flicker was not completely eliminated, since at
0 angular difference the notch was still visible, as if it was a twinkling border. A possible
explanation for this apparent flicker is that the dots inside the notch had shorter lifetimes
and thus were turned on and off at a higher temporal frequency.
216
References
[Clo] Clocksin, W.F.: Perception of surface slant and edge labels from optical flow: A compu-
tational approach. Perception, 9 (1980) 253-269
[GY] Grzywacz, N.M., Yuille, A.L.: A model for the estimate of local image velocity by cells
in the visual cortex. Phil Trans. It. Soc. Lond. B, 239 (1990) 129-161
[H~ Hildreth, E.C.: The Measurement of Visual Motion. Cambridge, USA: MIT Press, (1984)
[Keel Koch, C., Wang, H.T., Mathur, B., Hsu, A., Suarez, H.: Computing optical flow in
resistive networks and in the primate visual system. Proc. of the IEEE Workshop on
Visual Motion, Irvine, CA, USA (1989) 62-72
[NL] Nakayama, K., Loomis, J.M.: Optical velocity patterns, velocity-sensitive neurons, and
space perception: A hypothesis. Perception, 3 (1974) 63-80
[NP] Newsome, W.T., Par~, E.B.: A selective impairment of motion perception following le-
sions of the middle temporal visual area (MT). 3. Neurosci. 8 (1988) 2201-2211
[Veal] Vaina, L.M., LeMay, M., Bienfang, D.C., Choi, A.Y., Nakayama, K.: Intact "biological
motion" and "structure from motion" perception in a patient with impaired motion
mechanisms. Vis. Neurosci. 5 (1990) 353-371
[Ve~2] Vaina, L.M., Grzywa~z, N.M., LeMay, M.: Structure from motion with impaired local-
speed and global motion-field computations. Neural Computation, 2 (1990) 420-435
[YeaS] Vaina, L.M., Grzywacz, N.M., LeMay, M.: Perception of motion discontinuities in pa-
tients with selective motion deficits. (Submitted for publication)
[YG1] Yuille, A.L., Grzywacz, N.M.: A computational theory for the perception of coherent
visual motion. Nature, 333 (1988) 71-74
[YG2] Yuille, A.L., Grzywacz, N.M.: A mathematical analysis of the motion coherence theory.
Intl. 3. Comp. Vision, 3 (1989) 155-175
This article was processed using the IATEXmacro package with ECCV92 style
Motion and Structure Factorization and Segmentation of Long
Multiple Motion Image Sequences
Abstract. This paper presents a computer algorithm which, given a dense tem-
poral sequence of intensity images of multiple moving objects, will separate the
images into regions showing distinct objects, and for those objects which are ro-
tating, will calculate the three-dimensional structure and motion. The method in-
tegrates the segmentation of trajectories into subsets corresponding to different
objects with the determination of the motion and structure of the objects. Trajecto-
ries are partitioned into groups corresponding to the different objects by fitting the
trajectories from each group to a hierarchy of increasingly complex motion mod-
els. This grouping algorithm uses an efficient motion estimation algorithm based
on the factorization of a measurement matrix into motion and structure compo-
nents. Experiments are reported using two real image sequences of 50 frames each
to test the algorithm.
1 Introduction
This paper is concerned with three-dimensional structure and motion estimation for scenes
containing multiple independently moving rigid objects. Our algorithm uses the image motion to
separate the multiple objects from the background and from each other, and to calculate the three-
dimensional sgucture and motion of each such object. The two-dimensional motion in the image
sequence is represented by the image plane trajectories of feature points. The motion of each ob-
ject, which describes the three-dimensional rotation and translation of the object between the im-
ages of the sequence, is computed from the object's feature trajectories. If the object on which a
particular group of feature points lie is rotating, the relative three-dimensional positions of the
feature points, called the structure of the object, can also be calculated.
Our algorithm is based on the following assumptions: (1) the objects in the scene are rigid,
i.e., the three-dimensional distance between any pair of feature points on a particular object is
constant over time, (2) the feature points are orthographically projected onto the image plane, and
(3) the objects move with constant rotation per frame. This algorithm integrates the task of seg-
menting the images into distinctly moving objects with the task of estimating the motion and
structure for each object. These tasks are performed using a hierarchy of increasingly complex
motion models, and using an efficient and accurate factorization-based motion and structure es-
timation algorithm.
This paper makes use of an algorithm for factorization of a measurement matrix into separate
motion and structure matrices as reported by the authors in [DA1]. Subsequently in [TK1], To-
masi and Kanade present a similar factorization-based method which allows arbitrary rotations,
but does not have the capability to process trajectories starting and ending at arbitrary frames.
Furthermore, it appears that some assumptions about the magnitude or smoothness of motion are
* Supported by DARPA and the NSF under grant IRI-89-02728, and the State of Illinois Departraent of
Commerce and CommunityAffairs under grant 90-103.
218
still necessary to obtain feature trajectories. Kanade points out [Kal] that with our assumption of
constant rotation we are absorbing the trajectory noise primarily in the structure parameters
whereas their algorithm absorbs them in both the motion and structure parameters.
Most previous motion-based image sequence segmentation algorithms use optical flow to
segment the images based on consistency of image plane motion. Adiv in [Adl] and Bergen et al
in [BB 1] instead segment on the basis of a fit to an affine model. Adiv further groups the resulting
regions to fit a model of a planar surface undergoing 3-D motions in perspective projection. In
[BB2] Boult and Brown show how Tomasi and Kanade's motion factorization method can be
used to split the measurement matrix into parts consisting of independently moving rigid objects.
for the motion in the region, the region is split using a region growing technique. When splitting
a region, a measure of motion consistency is computed in a small neighborhood around each tra-
jectory in the region. If the motion is consistent for a particular trajectory, we assume that the
trajectories in the neighborhood all arise from points on a single object. Thus the initial subre-
gions for the split consist of groups of trajectories with locally consistent motion, and these are
grown out to include the remaining trajectories.
Initially all the trajectories are ina single region. Processing then continues in a uniform fash-
ion: the new point positions in each new frame are added to the trajectories of the existing regions,
and then the regions are processed to make them compatible with the new data. The processing
of the regions is broken into four steps: (1) if the new data does not fit the old region motion mod-
el, find a model which does fit the data or split the region, (2) add any newly visible points or
ungrouped points to a compatible region, (3) merge adjacent regions with compatible motions,
(4) remove outliers from the regions.
Compatibility among feature points is checked using the structure and rotational motion esti-
marion algorithm or the translational motion estimation algorithm described in [Del]. A region's
feature points are considered incompatible if the fit error returned by the appropriate motion es-
timation algorithm is above a threshold. We assume that the trajectory detection algorithm can
produce trajectories accurate to the nearest pixel, and therefore we use a threshold (which we call
the error threshold) of one half of a pixel per visible trajectory point per frame.The details of the
four steps listed above may be found in [Del] or [DA3].
4 Experiments
Our algorithm was tested on two real image sequences of 50 frames: (1) the cylinder sequence,
consisting of images of a cylinder rotating around a nearly vertical axis and a box moving right
with respect to the cylinder and the background, and (2) the robot arm sequence, consisting of
images of an Unimate| PUMA| Mark III robot arm with its second and third joints rotating in
opposite directions. These sequences show the capabilities of the approach, and also demonstrate
some inherent limitations of motion based segmentation and of monocular image sequence based
motion estimation.
Trajectories were detected using the algorithm described in [De1] (using a method described
in [BH 1]), which found 2598 tzajectories in the cylinder sequence and 202 trajectories in the robot
arm sequence. These trajectories were input to the image sequence segmentation algorithm de-
scribed in Section 3, which partitioned the trajectories into groups corresponding to different rigid
objects and estimated the motion and structure parameters.
The segmentation for the cylinder sequence is shown in Fig. 1. The algorithm separated out
the three image regions: the cylinder, the box, and the background.The cylinder is rotating, and
thus its structure can be recovered from the image sequence. Fig. 2 shows a projection along the
cylinder axis of the 3D point positions calculated from the 1456 points on the cylinder. The points
lie very nearly on a cylindrical surface. Table 1 shows the estimated and the actual motion param-
Table 1. Comparisonof the parameters estimatedby the algorithm and the true parameter for the cylinder
image sequence experiment.
Parameters Estimated Actual
CO -0.022 -0.017
(0,.99,.12) (0,.98,.19)
(.29,-.19) (.14,0)
220
Fig. 1 The image sequence segmentation found Fig. 2 An end-on view of the three-dimensional
for the cylinder sequence (the segmentation is point positions calculated by our structure and
superimposed on the last frame of the sequence). motion estimation algorithrn from point trajectories
derived from cylinder image sequence.
co .0133 .0131
(-.67,-.01,.74) (-.62,.02,.79)
(.02,-.07) (0,0)
eters for the cylinder. The error in the to estimate is large because the cylinder is rotating around
an axis nearly parallel to the image plane and, as pointed out in [WH1], a rotation about an axis
parallel to the image plane is inherently difficult to distinguish from translation parallel to the im-
age plane and perpendicular to the rotation axis (this also explains the error in ~). Note that the
predicted trajectory point positions still differ from the actual positions by an average of less than
the error threshold of 0.5 pixel. The accuracy of the motion and structure estimation algorithm for
less ambiguous motion is illustrated in the experiments on the robot arm sequence.
The image sequence segmentation for the robot arm sequence is shown in Fig. 3. Note that
several stationary feature points (only two visible in Fig. 3) on the background are grouped with
221
the second segment of the arm. This occurs because any stationary point lying on the projection
of a rotation axis with no translational motion will fit the motion parameters of the rotating object.
Thus these points are grouped incorrectly due to an inherent limitation of segmenting an image
sequence on the basis of motion alone. The remaining points are grouped correctly into three im-
age regions: the second and the third segments of the robot arm, and the background. The two
robot arm segments are rotating and their three-dimensional structure was recovered by the mo-
tion and structure estimation algorithm.Only a small number of feature points were associated
with the robot arm segments making it difficult to illustrate the structure on paper, but the esti-
mated motion parameters of the second and third robot arm segments are shown in Table 2 and
Table 3, respectively. Note that all the motion parameters were very accurately determined.
5 Conclusions
The main features of our method are: (1) motion and structure estimation and segmentation
processes are integrated, (2) frames are processed sequentially with continual update of motion
and structure estimates and segmentation, (3) the motion and structure estimation algorithm fac-
tors the trajectory data into separate motion and structure matrices, (4) aside from SVDs, the mo-
tion and structure estimation algorithm is closed form with no nonlinear iterative optimization
required, (5) the motion and structure estimation algorithm provides a confidence measure for
evaluating any particular segmentation.
References
[Adl]Adiv, G.: Determining Three-Dimensional Motion and Structure from Optical Flow
Generated by Several Moving Objects. IEEE Transactions on PAMI 7 (1985) 384-401
[BB 1]Bergen, J., Burr, P., Hingorani, R., Peleg, S.: Multiple Component Image Motion: Motion
Estimation. Proc. of the 3`/ICCV, Osaka, Japan (December 1990) 27-32
[BB2]Boult, T., Brown, L.: Factorization-based Segmentation of Motions. Proc. of the IEEE
Motion Workshop, Princeton NJ (October 1991) 21-28
[BH1]Blostein, S., Huang, T.: Detecting Small, Moving Objects in Image Sequences using
Sequential Hypothesis Testing. IEEE Trans. on Signal Proc. 39 (July 1991) 1611-1629
[DA1]Debrunner, C., Ahuja, N.: A Direct Data Approximation Based Motion Estimation
Algorithm. Proc. of the 10th ICPR, Atlantic City, NJ (June 1990) 384-389
[DA2]Debrunner, C., Ahuja, N.: Estimation of Structure and Motion from Extended Point
Trajectories. (submitted)
[DA3]Debrunner, C., Ahuja, N.: Motion and Structure Factorization and Segmentation of Long
Multiple Motion Image Sequences. (submitted)
[Del]Debrunner, C.: Structure and Motion from Long Image Sequences. Ph.D. dissertation,
University of Illinois at Urbana-Champaign, Urbana, IL (August 1990)
[Kal]Kanade, T.: personal communication (October 1991)
[KA1]Kung, S., Arun, K., Rao, D.: State-Space and Singular-Value Decomposition-Based
Approximation Methods for the Harmonic Retrieval Problem. J.of the Optical Society of
America (December 1983) 1799-1811
[TK1] Tomasi, C., Kanade, T.: Factoring Image Sequences into Shape and Motion. Proc. of the
IEEE Motion Workshop, Princeton NJ (October 1991) 21-28
[WH1]Weng, J., Huang, T., Ahuja, N.: Motion and Structure from Two Perspective Views:
Algorithms, Error Analysis, and Error Estimation. IEEE Transactions on PAM111 (1989)
451-476
Motion and Surface Recovery Using
Curvature and Motion Consistency
1 Introduction
This paper describes an algorithm for reconstructing surfaces obtained from a sequence of
overlapping range images in a common frame of reference. It does so without explicitly
computing correspondence and without invoking a global rigidity assumption. Motion
parameters (rotations and translations) are recovered locally under the assumption that
the curvature structure at a point on a surface varies slowly under transformation. The
recovery problem can thus be posed as finding the set of motion parameters that preserves
curvature across adjacent views. This might be viewed as a temporal form of the curvature
consistency constraint used in image and surface reconstruction [2, 3, 4, 5, 6].
To reconstruct a 3-D surface from a sequence of overlapping range images, one can
attempt to apply local motion estimates to successive pairs of images in pointwise fashion.
However this approach does not work well in practice because estimates computed locally
are subject to the effects of noise and quantization error. This problem is addressed
by invoking a second constraint that concerns the properties of physical surfaces. The
motions of adjacent points are coupled through the surface, where the degree of coupling is
proportional to surface rigidity. We interpret this constraint to mean that motion varies
smoothly from point to point and attempt to regularize local estimates by enforcing
smooth variation of the motion parameters. This is accomplished by a second stage of
minimization which operates after local motion estimates have been applied.
The remainder of this paper will focus on the key points of the algorithm, namely:
how the problem of locally estimating motion parameters can be formulated as a convex
minimization problem, how local estimates can be refined using the motion consistency
constraint, and how the combination of these two stages can be used to piece together
a 3-D surface from a sequence of range images. An example of the performance of the
algorithm on laser rangefinder data is included in the paper.
Our approach is based on the minimization of a functional form that measures the sim-
ilarity between a local neighbourhood in one image and a corresponding neighbour-
hood in an adjacent image. Following the convention of [6], we describe the local struc-
ture of a surface S in the vicinity of a point P with the augmented Darboux frame
2)(P) = (P, Mp,.]~p, Np,~Mp,~./~p), where gp is the unit normal vector to S at P,
Mp and ~4p are the directions of the principal maximum and minimum curvatures re-
spectively at P, and gMp and ~ p are the scalar magnitudes of these curvatures [1].
Now let x and x t be points on adjacent surfaces S and S ~ ~,'ith corresponding frames
:D(x) and :D(x'), and let ~2 and T be the rotation matrix an i translation vector that
map x to x', i.e., x' = $2x + T. The relationship between :D(x) and D(x') is then
223
The task is to find/2 and T such that ]]:D(x)-:D(x')H is minimum. However, for reasons
of uniqueness and robustness which are beyond the scope of this paper, the minimization
must be defined over an extended neighbourhood that includes the frames in an i x i
neighbourhood of x,
min ~-'~ II~i(x') - 7-(79i(x),/2, T)II. (3)
aT "7"
If an appropriate functional DaT can be found that is convex in/2 and T, then these
parameters can b e easily be determined by an appropriate gradient descent procedure
without explicitly computing correspondence. Let A C S be a patch containing a point
x and :Di(x) a set of frames that describe the local neighbourhood of x. The patch A is
now displaced according t o / 2 and T. Specifically, we claim that if
1. /2 and T are such that the projection of x on S' lies somewhere on the image of ),,
A' on S',
2. A meets existence and uniqueness requirements with respect to :Di(x),
3. A is completely elliptic or hyperbolic,
DaT
~ 3+ I~Mx,lu I~Mx,,I + I~x,I + I~x,,I
,%
where ( g M x l , IC.MXi, Mxl, .Mxl, Nxi), and OCMx,,, ~ x , , , Mx,i, .Mx,i, Nx,i) are the com-
ponents of :Di(x) 9 S and :Di(x') 9 S' respectively.
An algorithm for the local recovery of /2 and T follows directly from (4) and is
described in [8]. Given two range images R(i, j) and R'(i, j), the curvature consistency
algorithm described in [2, 3, 4] is applied to each image to obtain :D(x) and :D(x') for
each discrete sample. Then for each point x for which/2 and T are required, the following
steps are performed:
A brute force solution to the 3-D reconstruction problem would be to estimate s and
T for each element of R(i, j) and map it to R'(i, j), eliminating multiple instantiations
of points in overlapping regions in the process. However a more efficient and effective
approach is to determine s and T for a subset of points on these surfaces and then
use the solutions for these points to map the surrounding neighbourhood. Besides being
more efficient, this strategy acknowledges that solutions may not be found for each point,
either because a point has become occluded, or because a weak solution has been vetoed.
Still, as proposed, this strategy does not take the behaviour of real surfaces into account.
Because of coupling through the surface, the velocities of adjacent points are related. A
reasonable constraint would be to insist that variations in the velocities between adjacent
regions be smooth. We refer to this as motion consistency and apply this constraint to the
local estimates of s and T in a second stage of processing. However, rather than explicitly
correct for each locally determined s and T,, we correct instead for the positions and
orientations of the local neighbourhoods to which these transformations are applied.
3.1 M o t i o n Consistency
The updating of position and orientation can be dealt with seperately provided that cer-
tain constraints are observed as there are differentcompositions of rotation and transla-
tion that can take a set of points from one position to another. While the final position
depends on the amount of rotation and translationapplied, the finalorientationdepends
solely on the rotation applied. This provides the necessary insight into how to separate
the problem.
Within the local neighbourhood of a point P we assume that the motion is approxi-
mately rigid,i.e.that the motion of P and its near-neighbours, Qi, can be described by
the same s and T. The problem, then, is to update the local motion estimate at P, s
and Tp, given the estimates computed independently at each of its neighbours, s and
Ti. Since motion is assumed to be locallyrigid,then the relativedisplacements between
P and each Qi should be preserved under transformation. One can exploit this constraint
by noting that
Pl -" qli -{-rll, (5)
where Pl and qll are vectors corresponding to P and Qi in view 1, and rll is the relative
displacement of point P from point Qi. It is straightforward to show that the position of
P in view 2 as predicted by its neighbour Qi is given by
p = E =I w p2,
r; ,
E'=I Wi
where the w, take into account the rigidity of the object and the distance between
neighbouring points. The weights w, and the size of the local neighbourhood determine
the rigidity of the reconstructed surface. In our experiments a Gaussian weighting was
used over neighbourhood sizes ranging from 3 3 to 11 x 11.
The second part of the updating procedure seeks to enforce the relative orientation
between point P and each of its neighbours Q,. However, under the assumption of locally
225
rigid motion, this is equivalent to saying that each point in the local neighbourhood
should have the same rotation component in its individual motion parameters. Of course
one cannot simply average the parameters of D as in the case of position, l ~ t a t i o n
parameters are not invariant with respect to reference frame. To get around this problem
we convert estimates of Y21 into their equivalent unit quaternions Qi [9, 7]. The locus of
unit quaternions traces out a unit sphere in 4-dimensional space. One of their desirable
properties is that metric distance between two quaternions is given by the great circle
distance on this sphere. For small distances, which is precisely the case for the rotations
associated with each Qi, a reasonable approximation is given by their scalar product.
The computational task is now to estimate the quaternion at P, Q p , given the quater-
nions Qi at each Qi by minimizing the distance between Q p and each Qi. Using standard
methods it can be shown that the solution to this minimization problem amounts to an
average of the quaternions Qi normalized to unit length. An example of the effect of
applying these updating rules is shown later on in this section.
3.2 E x p e r i m e n t a l Results
4 Conclusions
In this paper we have sketched out an algorithm for reconstructing a sequence of over-
lapping range images based on two key constraints: minimizing the local variation of
curvature across adjacent views, and minimizing the variation of motion parameters
across adjacent surface points. It operates without explicitly computing correspondence
and without the invoking a global rigidity assumption. Preliminary results indicate that
the resulting surface reconstruction is both robust and accurate.
226
Fig. 1. Laser rangefinder images of an owl statuette at (a) 00, (b) 450 and (c) 90 ~ Resolution
is 256 x 256 by 10 bits/rangel.
Fig. 2. First pair of images: Surface of the owl at O* rotation mapped into the coordinates of
a second frame at 45 ~ using local motion estimates. (a) Motion consistency not applied. (b)
Motion consistency applied. Second pair of images: (c) A laser rangefinder image rendered as a
shaded surface showing the owl from the viewpoint of the reconstructed surfaces shown next.
(d) Reconstruction of three views of the owl taken at 00, 45 ~ and 90~ and rendered as a shaded
surface.
References
another segment in the same Gll is matched to one segment in G2k (k r j). We call this
condition the completeness of grouping. We show in Fig. 2 a noncomplete grouping in
the second frame. The completeness of a grouping implies that we need only apply the
hypothesis generation process to each pairing of groups. As we have gl x g2 such pairings
and that the complexity for each paring is O ( ( ~
ra ) 2 ( ~n ) )2, the total complexity is now
O('~ " = ) . Thus we have a speedup of
g~g=
O(glg2).
/I-
,."........ "-
'....\ /
"........" 1a23
" . I / .....//
%,~176176
\ .............. /
Frame 1 Frame 2 Frame 1 Frame 2
Fig. 3. Illustration of how grouping speeds up the
Fig. 2. A grouping which is not complete
hypothesis generation process
Take a concrete example (see Fig. 3). We have 6 and 12 segments in Frame 1 and
Frame 2, respectively. If we directly apply the hypothesis generation algorithm to these
two frames, we need 6 x 5 x 12 11/2 = 1980 operations. If the first frame is divided into
2 and the second into 3, we have 6 pairings of groups. Applying the hypothesis generation
algorithm to each pairing requires 3 x 2 4 3/2 = 36 operations. The total number
of operations is 6 x 36 = 216, and we have a speedup of 9. We should remember that
the O(glg2) speedup is achieved at the cost of a prior grouping process. Whether the
speedup is significant depends upon whether the grouping process is efficient.
One of the most influential grouping techniques is called the perceptual grouping,
pioneered by Low [5]. In our algorithm, grouping is performed based on proximity and
coplanarity of 3D line segments. Use of proximity allows us to roughly divide the scene
into clusters, each constituting a geometrically compact entity. As we mainly deal with
indoor environment, many polyhedral objects can be expected. Use of coplanarity allows
us to further divide each cluster into semantically meaningful entities, namely planar
facets.
Two segments are said to be prozimally connected if one segment is in the neighborhood
of the other one. There are many possible definitions of a neighborhood. Our definition
is: the neighborhood of a segment S is a cylindrical space C with radius r whose axis is
coinciding with the segment S. This is shown in Fig. 4. The top and bottom of the cylinder
230
are chosen such that the cylinder C contains completely the segment S. S intersects the
two planes at A and B. The segment AB is called the extended segment of S. The distance
from one endpoint of S to the top or bottom of the cylinder is b. Thus the volume V
of the neighborhood of S is equal to a'r2(! + 2b), where i is the length of S. We choose
b = r. The volume of the neighborhood is then determined by r.
x/
Fig. 4. Definition of a segment's neighborhood
A segment Si is said in the neighborhood of S if Si intersects the cylindrical space
C. From this definition, Si intersects C if either of the following condition is satisfied:
1. At least one of the endpoints of Si is in C.
2. The distance between the supporting lines of S and Si is less than r and the common
perpendicular intersects both Si and the extended segment of S.
We define a cluster as a group of segments, every two of which are proximally con-
nected in the sense defined above either directly or through one or more segments in the
same group. A simple implementation to find clusters by testing the above conditions
leads to a complexity of O(n 2) in the worst case, where n is the number of segments in
a frame. In the following, we present a method based on a bucketing technique to find
clusters. First, the minima and maxima of the z, y and z coordinates are computed,
denoted by Xrnin, Ymin, Zmin and Xmax, Ymax, Zmax. Then the parallelepiped formed by the
minima and maxima is partitioned into p3 buckets Wijk (p = 16 in our implementation).
To each bucket Wijk we attach the list of segments Lijk intersecting it. The key idea of
bucketing techniques is that on the average the number of segments intersecting a bucket
is much smaller than the total number of segments in the frame. The computation of
attaching segments to buckets can be performed very fast by an algorithm whose com-
plexity is linear in the number of segments. Finally, a recursive search is performed to
find clusters. We can write an algorithm to find a cluster containing segment S in pseudo
C codes as:
List find_cluster(S)
Segment S ;
231
{
L i s t cluster = N U L L ;
if i s _ v i s i t e d ( S ) r e t u r n N U L L ;
mark_segment_visited(S) ;
list_buckets -- f i n d _ b u c k e t s _ i n t e r s e c t i n g A t e i g h b o r h o o d _ o f ( S ) ;
list_segments = union_of_all_segments_in( list_buckets) ;
f o r (Si = each_segment_in(list_segments))
cluster = ctuste U {s,} U ind_cZuster(S,) ;
r e t u r n cluster ;
}
where L i s t is a structure storing a list of segments, defined as
struct cell {
S e g m e n t seg ;
s t r u c t c e l l *next ;
} *List ;
From the above discussion, we see that the operations required to find a cluster are
really very simple with the aide of a bucketing technique, except probably the function
find_buckets_intersecting-neighborhood_of(S). This function, as indicated by its
name, should find all buckets intersecting the neighborhood of the segment S. T h a t is,
we must examine whether a bucket intersects the cylindrical space C or not, which is by
no means a simple operation. The gain in efficiency through using a bucketing technique
may become nonsignificant. Fortunately, we have a very good approximation as described
below which allows for an efficient computation.
I. . . . . I
I I
I I
(a) N/ (b)
Fig. 5. Approximation of a neighborhood of a segment
Fill the cylindrical space C with m a n y spheres, each just fitting into the cylinder,
i.e., whose radius is equal to r. We allow intersection between spheres. Fig. 5a illustrates
the situation using only a section passing through the segment S. The union of the set
of spheres gives an approximation of C. When the distance d between successive sphere
centers approaches to zero (i.e., d ~ 0), the approximation is almost perfect, except in
the top and b o t t o m of C. The error of the approximation in this case is 89 3. This part
is not very important because it is the farthest to the segment. Although the operation
232
to examine the intersectionness between a bucket and a sphere is simpler than between
a bucket and a cylinder, it is not beneficial if we use too many spheres. W h a t we do is
further, as illustrated in Fig. 5b. Spheres are not allowed to intersect with each other
(i.e., d = 2r) with the exception of the last sphere. The center of the last sphere is
always at the endpoint of S, so it may intersects with the previous sphere. The number
of spheres required is equal to [ 6 + 1], where [a] denotes the smallest integer greater
than or equal to a. It is obvious that the union of these spheres is always smaller than the
cylindrical space C. Now we replace each sphere by a cube circumscribing it and aligned
with the coordinate axes (represented by a dashed rectangle in Fig. 5b). Each cube has
a side length of 2r. It is now almost trivial to find which buckets intersect a cube. Let
the center of the cube be [x, y, z] T. Let
imin: max[O, [(z -- Xmin -- r ) / d x J ] , /max = min[m - 1, [(x - Xmin + r)/dx - 1]] ,
jmin---- max[O, L(y - Ymin r)/dyJ] ,
- - jmax-- min[m - 1, [(y - Ymin + r)/dy - 1]] ,
kmin = max[0, [(z - Zmin - r)/dzJ] , kmax = min[m - 1, [(z - Zmin Jr" r)/dz - 1]] ,
where ]a] denotes the greatest integer less than or equal to a, m is the dimension of the
buckets, dx = (Xma~ - Xmin)/m, dy = (Ym~x - Ymin)/m, and dz = (Zmax - Zmin)/m. The
buckets intersecting the cube is simply {Wij~ ] i = irnin, 9 9 j = jmin, 99-, jmax, k =
kmin, 9 - -, kmax}.
If a segment is parallel to either x, y or z axis, it can be shown that the union of the
cubes is bigger than the cylindrical space C. However, if a segment is near 45~ 135 ~
to an axis (as shown in Fig. 5b), there are some gaps in approximating C. These gaps are
usually filled by buckets, because a whole bucket intersecting one cube is now considered
as part of the neighborhood of S.
4 Finding Planes
Several methods have been proposed in the literature to find planes. One common method
is to directly use data from range finders [6,7]. The expert system of Thonnat [8] for
scene interpretation is capable of finding planes from 3D line segments obtained from
stereo. The system being developed by Grossmann [9,10] aims at extracting, also from
3D line segments, visible surfaces including planar, cylindrical, conical and spherical
ones. In the latter system, each coplanar crossing pair of 3D line segments (i.e., they are
neither collinear nor parallel) forms a candidate plane. Each candidate plane is tested for
compatibility with the already existing planes. If compatibility is established, the existing
plane is updated by the candidate plane. In this section, we present a new method to
find planes from 3D line segments.
As we do not know how many and where are the planes, we first try to find two
coplanar line segments which can define a plane, and then try to find more line segments
which lie in this hypothetical plane until all segments are processed. The segments in
this plane are marked visited. For those unvisited segments, we repeat the above process
to find new planes.
Let a segment be represented by its midpoint m, its unit direction vector u and its
length 1. Because segments reconstructed from stereo are always corrupted by noise, we
attach to each segment an uncertainty measure (covariance matrix) of m and u, denoted
by A m and Au. The uncertainty measure of l is not required. For two noncollinear
segments: ( m l , u l ) with ( A m , , A u , ) and (ms, us) with (Am2, Au2), the coplanarity
condition is
A
c = ( m s - m 1 ) . (Ul A u s ) = 0 , (1)
where 9 and A denote the dot product and the cross product of two vectors, respectively.
233
In reality, the condition (1) is unlikely met. Instead, we impose that Icl is less than
some threshold. We determine the threshold in a dynamic manner by relating it to the
uncertainty measures. The variance of c, denoted by Ac, is computed from the covariance
matrices of the two segments by
Ac = (ul A u2)T(Am, + Ama)(Ul A u2) + [Ul A (m2 -- m l ) ] T A u a [ u l A (m2 -- m l ) ]
+ [ u 2 ^ (m2 - m l ) ] Z A u , [us ^ (m2 - m l ) ] 9
Here we assume there is no correlation between the two segments. Since cS/Ac follows a
X a distribution with one degree of freedom, two segments are said coplanar if
c21Ao < ,r (2)
where tr can be chosen by looking up the X 2 table such that P r ( x 2 < ~) = c~. In our
implementation, we set a = 50%, or ir = 0.5.
As discussed in the last paragraph, the two segments used must not be collinear. Two
segments are collinear if and only if the following two conditions are satisfied:
ul - us = 0 , and ul A (ms - ml) ---- 0 . (3)
The first says that two collinear segments should have the same orientation (Remark:
segments are oriented in our stereo system). The second says that the midpoint of the
second segment lies on the first segment. In reality, of course, these conditions are rarely
satisfied. A treatment similar to the coplanarity can be performed.
Once two segments are identified to lie in a single plane, we estimate the parameters
of the plane. A plane is described by
uz + vy+ wz + d = 0 , (4)
where n = [u, v, w]w is parallel to the normal of the plane, and [dl/I[n[[ is the distance
of the origin to the plane. It is clear that for an arbitrary scalar A r 0, A[u, v, w, d]T
describes the same plane as [u, v, w, d] T. Thus the minimal representation of a plane has
only three parameters. One possible minimal representation is to set w = 1, which gives
ux+vy+z+d=O .
However, it cannot represent planes parallel to the z-axis. To represent all planes in 3D
space, we should use three maps [11]:
Map 1: u x + v y + z + d = 0 for planes nonparallel to the z-axis, (5)
Map 2: x + v y + w z + d = 0 for planes nonparallel to the x-axis, (6)
Map 3: u x + y + w z + d = 0 for planes nonparallel to the y-axis, (7)
In order to choose which map to be used, we first compute an initial estimate of the plane
normal no = u l A u2. If the two segments are parallel, n0 = u l A (ms - m l ) . If the z
component of n has a maximal absolute value, Map 1 (5) will be used; if the x component
has a maximal absolute value, Map 2 (6) will be used; otherwise, Map 3 (7) will be used.
An initial estimate of d can then be computed using the midpoint of a segment. In the
sequel, we use Map 1 for explanation. The derivations are easily extended to the other
maps.
We use an extended Kalman filter [12] to estimate the plane parameters. Let the
state vector be x = [u, v, d] T. We have an initial estimate x0 available, as described
just above. Since this estimate is not good, we set the diagonal elements of the initial
covariance matrix Axo to a very big number and the off-diagonal elements to zero. Sup-
pose a 3D segment with parameters (u, m ) is identified as lying in the plane. Define the
measurement vector as z = [u T, m r ] T. We have two equations relating z to x:
f nTu
f(x, z) = "1.nTm "}" d = 0 , (8)
234
where n = [u, v, 1]T. The first equation says that the segment is perpendicular to the
plane normal, and the second says that the midpoint of the segment is on the plane. In
order to apply the Kalman filter, we must linearize the above equation [13], which gives
y = Mx + ~, (9)
where y is the new measurement vector, M is the observation matrix, and ~ is the noise
disturbance in y, and they are given by
-- 0Z
;
[oT0]
0T n T
(12)
where 0 is the 3D zero vector and ~z is the noise disturbance in z. Now that we have two
segments which have been identified to be coplanar and an initial estimate of the plane
parameters, we can apply the extended Kalman filter based on the above formulation to
obtain a better estimate of x and its error covariance matrix Ax.
Once we have estimated the parameters of the plane, we try to find more evidences
of the plane, i.e., more segments in the same plane. If a 3D segment z = [u T, mT] T with
(Au, Am) lies in the plane, it must satisfy Eq. (8). Since data are noisy, we do not expect
to find a segment having exactly p ~ f(x, z) = 0. Instead, we compute the covariance
matrix Ap of p as follows:
0f(x, Z) AX 0f(x, z)T ~f(x, Z) ~f(x, Z) T
Ap = ax ax + ~ A z ~z ' (13)
then the segment is considered as lying in the plane. Since p T A ~ l p follows a X2 distri-
bution with 2 degrees of freedom, we can choose an appropriate ~p b y looking up the X ~
table such that P r ( x a _< ~p) = ap. We choose ap = 50%, or ~p = 1.4. Each time we find
a new segment in the plane, we update the plane parameters x and Ax and try to find
still more. Finally, we obtain a set of segments supporting the plane and also an estimate
of the plane parameters accounting for all these segments.
5 Experimental Results
In this section we show the results of grouping using an indoor scene. A stereo rig takes
three images, one of which is displayed in Fig. 6. After performing edge detection, edge
linking and linear segment approximation, the three images are supplied to a trinocular
stereo system, which reconstructs a 3D frame consisting of 137 3D line segments. Fig-
ure 7 shows the front view (projection on the plane in front of the stereo system and
perpendicular to the ground plane) and the top view (projection on the ground plane)
of the reconstructed 3D frame.
We then apply the bucketing technique to this 3D frame to sort segments into buckets,
which takes about 0.02 seconds of user time on a Sun 4/60 workstation. The algorithm
described in Sect. 3 is then applied, which takes again 0.02 seconds of user time to find
two clusters. They are respectively shown in Figs. 8 and 9. Comparing these with Fig. 7,
we observe that the two clusters do correspond to two geometrically distinct entities.
235
/:1
Fig. 8. Front and top views of the first cluster Fig. 9. Front and top views of the second cluster
Finally we apply the algorithm described in Sect. 4 to each cluster, and it takes 0.35
seconds of user time to find in total 11 planes. The four largest planes contain 17, 10,
25 and 13 segments, respectively, and they are shown in Figs. 10 to 13. Other planes
contain less than 7 segments, corresponding to the box faces, the table and the terminal.
From these results, we observe that our algorithm can reliably detect planes from 3D line
segments obtained from stereo, but a plane detected does not necessarily correspond to
a physical plane. The planes shown in Figs. 11 to 13 correspond respectively to segments
on the table, the wall and the door. The plane shown in Fig. 10, however, is composed of
segments from different objects, although they do satisfy the coplanarlty. This is because
in our current implementation any segment in a cluster satisfying the coplanarity is
retained as a support of the plane. One possible solution to this problem is to grow
the plane by look for segments in the neighborhood of the segments already retained as
supports of the plane.
Fig. 10. Front and top views of the first plane Fig. 11. Front and top views of the second plane
Due to space limitation, the reader is referred to [4] and [13] for application to 3D
motion determination.
236
yJ
j s 9
-z I
Fig. 12. Front and top views of the third plane Fig. 13. Front and top views of the fourth plane
6 Conclusion
We have described how to speed up the m o t i o n d e t e r m i n a t i o n algorithm through group-
ing. A formal analysis has been done. A speedup of O(glg2) can be achieved if two con-
secutive frames have been segmented into gl and g2 groups. Grouping must be complete
in order not to miss a hypothesis in the hypothesis generation process. Two criteria sat-
isfying the completeness condition have been proposed, n a m e l y proximity to find clusters
which are geometrically compact and coplanarity to find planes. I m p l e m e n t a t i o n details
have been described. Many real stereo d a t a have been used to test the algorithm and
good results have been obtained. We should note t h a t the two procedures are also useful
to scene interpretation.
References
1. F. Lustman, Vision St~r~oscopique et Perception du Mouvement en Vision Artificielle. PhD
thesis, University of Paris XI, Orsay, Paris, France, December 1987.
2. N. Ayache, Artificial Vision for Mobile Robots: Stereo Vision and Mnltisensory Perception.
MIT Press, Cambridge, MA, 1991.
3. Z. Zhang, O. Faugeras, and N. Ayache, "Analysis of a sequence of stereo scenes containing
multiple moving objects using rigidity constraints," in Proc. Second Int'l Conf. Comput.
Vision, (Tampa, FL), pp. 177-186, IEEE, December 1988.
4. Z. Zhang and O. D. Faugeras, "Estimation of displacements from two 3D frames obtained
from stereo," Research Report 1440, INRIA Sophia-Antipolis, 2004 route des Lucioles, F-
06565 Valbonne cedex, France, June 1991.
5. D. Lowe, Perceptual Organization and Visual Recognition. Kluwer Academic, Boston, MA,
1985.
6. W. Grimson and T. Lozano-Perez, "Model-based recognition and localization from sparse
range or tactile data," Int'l J. Robotics Res., vol. 5, pp. 3-34, Fall 1984.
7. O. Fangeras and M. Hebert, "The representation, recognition, and locating of 3D shapes
from range data," Int'l J. Robotics Res., vol. 5, no. 3, pp. 27-52, 1986.
8. M. Thonnat, "Semantic interpretation of 3-D stereo data: Finding the main structures,"
Int'i J. Pattern Reeog. Artif. Intell., vol. 2, no. 3, pp. 509-525, 1988.
9. P. Grossmann, "Building planar surfaces from raw data," Technical Report R4.1.2, ESPRIT
Project P940, 1987.
10. P. Grossmann, "From 3D line segments to objects and spaces," in Proc. 1EEE Conf. Corn-
put. Vision Pattern Reeog., (San Diego, CA), pp. 216-221, 1989.
11. O. D. Faugeras, Three-Dimensional Computer Vision. MIT Press, Cambridge, MA, 1991.
to appear.
12. P. Maybeck, Stochastic Models, Estimation and Control. Vol. 2, Academic, New York, 1982.
13. Z. Zhang, Motion Analysis from a Sequence of Stereo Frames and its Applications. PhD
thesis, University of Paris XI, Orsay, Paris, France, 1990. in English.
This article was processed using the ICFEX macro package with ECCV92 style
Hierarchical Model-Based Motion Estimation
1 Introduction
A large body of work in computer vision over the last 10 or 15 years has been con-
cerned with the extraction of motion information from image sequences. The motivation
of this work is actually quite diverse, with intended applications ranging from data com-
pression to pattern recognition (alignment strategies) to robotics and vehicle navigation.
In tandem with this diversity of motivation is a diversity of representation of motion
information: from optical flow, to affine or other parametric transformations, to 3-d ego-
motion plus range or other structure. The purpose of this paper is to describe a common
framework within which all of these computations can be represented.
This unification is possible because all of these problems can be viewed from the
perspective of image registration. That is, given an image sequence, compute a repre-
sentation of motion that best aligns pixels in one frame of the sequence with those in
the next. The differences among the various approaches mentioned above can then be
expressed as different parametric representations of the alignment process. In all cases
the function minimized is the same; the difference lies in the fact that it is minimized
with respect to different parameters.
The key features of the resulting framework (or family of algorithms) are a global
model that constrains the overall structure of the motion estimated, a local model that is
used in the estimation process 1, and a coarse-fine refinement strategy. An example of a
global model is the rigidity constraint; an example of a local model is that displacement
is constant over a patch. Coarse-fine refinement or hierarchical estimation is included in
this framework for reasons that go well beyond the conventional ones of computational
efficiency. Its utility derives from the nature of the objective function common to the
various motion models.
1.1 Hierarchical e s t i m a t i o n
Hierarchical approaches have been used by various researchers e.g., see [2, 10, 11, 22, 19]).
More recently, a theoretical analysis of hierarchical motion estimation was described in
1 Because this model will be used in a multiresolution data structure, it is "local" in a slightly
unconventional sense that will be discussed below.
238
[8] and the advantages of using parametric models within such a framework have also
been discussed in [5].
Arguments for use of hierarchical (i.e. pyramid based) estimation techniques for mo-
tion estimation have usually focused on issues of computational efficiency. A matching
process that must accommodate large displacements can be very expensive to compute.
Simple intuition suggests that if large displacements can be computed using low resolu-
tion image information great savings in computation will be achieved. Higher resolution
information can then be used to improve the accuracy of displacement estimation by
incrementally estimating small displacements (see, for example, [2]). However, it can also
be argued that it is not only e~icient to ignore high resolution image information when
computing large displacements, in a sense it is necessary to do so. This is because of
aliasing of high spatial frequency components undergoing large motion. Aliasing is the
source of false matches in correspondence solutions or (equivalently) local minima in the
objective function used for minimization. Minimization or matching in a multiresolution
framework helps to eliminate problems of this type. Another way of expressing this is
to say that many sources of non-convexity that complicate the matching process are not
stable with respect to scale.
With only a few exceptions ([5, 9]), much of this work has concentrated on using a
small family of "generic" motion models within the hierarchical estimation framework.
Such models involve the use of some type of a smoothness constraint (sometimes allow-
ing for discontinuities) to constrain the estimation process at image locations containing
little or no image structure. However, as noted above, the arguments for use of a mul-
tiresolution, hierarchical approach apply equally to more structured models of image
motion.
In this paper, we describe a variety of motion models used within the same hierar-
chical framework. These models provide powerful constraints on the estimation process
and their use within the hierarchical estimation framework leads to increased accuracy,
robustness and efficiency. We outline the implementation of four new models and present
results using real images.
1.2 M o t i o n M o d e l s
Because optical flow computation is an underconstrained problem, all motion estimation
algorithms involve additional assumptions about the structure of the motion computed.
In many cases, however, this assumption is not expressed explicitly as such, rather it is
presented as a regularization term in an objective function [14, 16] or described primarily
as a computational issue [18, 4, 2, 20].
Previous work involving explicitly model-based m o t i o n estimation includes direct
methods [17, 21], [13] as well as methods for estimation under restricted conditions [7, 9].
The first class of methods uses a global egomotion constraint while those in the second
class of methods rely on parametric motion models within local regions. The description
"direct methods" actually applies equally to both types.
With respect to motion models, these algorithms can be divided into three categories:
(i) fully parametric, (ii) quasi-parametric, and (iii) non-parametric. Fully parametric
models describe the motion of individual pixels within a region in terms of a parametric
form. These include affine and quadratic flow fields. Quasi-parametric models involve
representing the motion of a pixel as a combination of a parametric component that is
valid for the entire region and a local component which varies from pixel to pixel. For
instance, the rigid motion model belongs to this class: the egomotion parameters constrain
the local flow vector to lie along a specific line, while the local depth value determines the
239
exact value of the flow vector at each pixel. By non-parametric models, we mean those
such as are commonly used in optical flow computation, i.e. those involving the use of
some type of a smoothness or uniformity constraint.
A parallel taxonomy of motion models can be constructed by considering local models
that constrain the motion in the neighborhood of a pixel and global models that describe
the motion over the entire visual field. This distinction becomes especially useful in ana-
lyzing hierarchical approaches where the meaning of "local" changes as the computation
moves through the multiresolution hierarchy. In this scheme fully parametric models are
global models, non-parametric models such as smoothness or uniformity of displacement
are local models, and quasi-parametric models involve both a global and a local model.
The reason for describing motion models in this way is that it clarifies the relationship
between different approaches and allows consideration of the range of possibilities in
choosing a model appropriate to a given situation. Purely global (or fully parametric)
models in essence trivially imply a local model so no choice is possible. However, in the
case of quasi- or non-parametric models, the local model can be more or less complex.
Also, it makes clear that by varying the size of local neighborhoods, it is possible to move
continuously from a partially or purely local model to a purely global one.
The reasons for choosing one model or another are generally quite intuitive, though
the exact choice of model is not always easy to make in a rigorous way. In general,
parametric models constrain the local motion more strongly than the less parametric
ones. A small number of parameters (e.g., six in the case of affine flow) are sufficient
to completely specify the flow vector at every point within their region of applicability.
However, they tend to be applicable only within local regions, and in many cases are
approximations to the actual flow field within those regions (although they may be very
good approximations). From the point of view of motion estimation, such models allow
the precise estimation of motion at locations containing no image structure, provided the
region contains at least a few locations with significant image structure.
Quasi-parametric models constrain the flow field less, but nevertheless constrain it
to some degree. For instance, for rigidly moving objects under perspective projection,
the rigid motion parameters (same as the egomotion parameters in the case of observer
motion), constrain the flow vector at each point to lie along a line in the velocity space.
One dimensional image structure (e.g., an edge) is generally sufficient to precisely esti-
mate the motion of that point. These models tend to be applicable over a wide region
in the image, perhaps even the entire image. If the local structure of the scene can be
further parametrized (e.g., planar surfaces under rigid motion), the model becomes fully
parametric within the region.
Non-parametric models require local image structure that is two-dimensional (e.g.,
corner points, textured areas). However, with the use of a smoothness constraint it is
usually possible to "fill-in" where there is inadequate local information. The estimation
process is typically more computationally expensive than the other two cases. These
models are more generally applicable (not requiring parametrizable scene structure or
motion) than the other two classes.
The remainder of the paper consists of an overview of the hierarchical motion estimation
framework, a description of each of the four models and their application to specific
examples, and a discussion of the overall approach and its applications.
240
Figure 1 describes the hierarchical motion estimation framework. The basic components
of this framework are: (i) pyramid construction, (ii) motion estimation, (iii) image warp-
ing, and (iv) coarse-to-fine refinement.
There are a number of ways to construct the image pyramids. Our implementation
uses the Laplacian pyramid described in [6], which involves simple local computations
and provides the necessary spatial-frequency decomposition.
The motion estimator varies according to the model. In all cases, however, the estima-
tion process involves SSD minimization, but instead of performing a discrete search (such
as in [3]), Gauss-Newton minimization is employed in a refinement process. The basic
assumption behind SSD minimization is intensity constancy, as applied to the Laplacian
pyramid images. Thus,
I ( x , t ) = I ( x - u ( x ) , t - 1)
where x = (x, Y) denotes the spatial image position of a point, I the (Laplacian pyramid)
image intensity and u(x) = (u(x, y), v(x, y)) denotes the image velocity at that point.
the SSD error measure for estimating the flow field within a region is:
where the sum is computed over all the points within the region and {u} is used to denote
the entire flow field within that region. In general this error (which is actually the sum
of individual errors) is not quadratic in terms of the unknown quantities {u}, because
of the complex pattern of intensity variations. Hence, we typically have a non-linear
minimization problem at hand.
Note that the basic structure of the problem is independent of the choice of a motion
model. The model is in essence a statement about the function u(x). To make this
explicit, we can write,
u(x) = u(x; (2)
where pm is a vector representing the model parameters.
A standard numerical approach for solving such a problem is to apply Newton's
method. However, for errors which are sum of squares a good approximation to Newton's
method is the Gauss-Newton method, which uses a first order expansion of the individual
error quantities before squaring. If {u}i current estimate of the flow field during the ith
iteration, the incremental estimate {Su} can be obtained by minimizing the quadratic
error measure
E({Su}) = E ( A I + V I . 5u(x)) 2 , (3)
X
where
A I ( x ) -- I ( x , t ) - I ( x - u/(x), t - 1),
that is the difference between the two images at corresponding pixels, after taking the
current estimate into account.
As such, the minimization problem described in Equation 3 is underconstrained. The
different motion models Constrain the flow field in different ways. When these are used
to describe the flow field, the estimation problem can be reformulated in terms of the un-
known (incremental) model parameters. The details of these reformulations are described
in the various sections corresponding to the individual motion models.
241
The third component, image warping, is achieved by using the current values of the
model parameters to compute a flow field, and then using this flow field to warp I(t - 1)
towards I(t), which is used as the reference image. Our current warping algorithm uses
bilinear interpolation. The warped image (as against the original second image) is then
used for the computation of the error A I for further estimation 2. The spatial gradient
V I computations are based on the reference image.
The final component, coarse-to-fine refinement, propagates the current motion esti-
mates from one level to the next level where they are then used as initial estimates. For
the parametric component of the model, this is easy; the values of the parameters are
simply transmitted to the next level. However, when a local model is also used, that
information is typically in the form of a dense image (or images)---e.g., a flow field or a
depth map. This image (or images) must be propagated via a pyramid expansion opera-
tion as described in [6]. The global parameters in combination with the local information
can then be used to generate the flow field necessary to perform the initial warping at
this next level.
3 Motion Models
u(x) = X ( x ) a (5)
where a denotes the vector (al, a2, a3, a4, as, a6) T, and
x x,
Thus, the motion of the entire region is completely specified by the parameter vector a,
which is the unknown quantity that needs to be estimated.
The Estimation Algorithm: Let ai denote the current estimate of the afllne param-
eters. After using the flow field represented by these parameters in the warping step, an
incremental estimate ~a can be determined. To achieve this, we insert the parametric
form of ~u into Equation 3, and obtain an error measure that is a function of 8a.
(v O ( v x] = - (V (7)
2 We have avoided using the standard notation k in order to avoid any confusion about this
point.
242
3.2 P l a n a r S u r f a c e Flow
T h e Model: It is generally known that the instantaneous motion of a planar surface
undergoing rigid motion can be described as a second order function of image coordinates
involving eight independent parameters (e.g., see [15]). In this section we provide a brief
derivation of this description and make some observations concerning its estimation.
We begin by observing that the image motion induced by a rigidly moving object (in
this case a plane), can be written as:
1
u(x) = Z---~A(x)t + B(x)w (8)
where Z(x) is the distance from the camera of the point (i.e., depth) whose image position
is (x), and
o;]
L(/+y2)/f -(xy)/f
The A and the B matrices depend only on the image positions and the focal length f
and not on the unknowns: t, the translation vector, w the angular velocity vector, and
Z.
A planar surface can be described by the equation
where (kl, k2,k3) relate to the surface slant, tilt, and the distance of the plane from
the origin of the chose coordinate system (in this case, the camera origin). Dividing
throughout by Z, we get
1 x y
--=kt +k2 +k3.
z 7 Y
Using k to denote the vector (kt, k2, k3) and r to denote the vector (z/f, y/f, 1) we obtain
1
Z(x) = r(x)Tk"
where the 8 coefficients (az,..., as) are functions of the motion paramters t,w and the
surface parmeters k. Since this 8-parameter form is rather well-known (e.g., see [15]) we
omit its details.
If the egomotion parameters are known, then the three parameter vector k can be used
to represent the motion of the planar surface. Otherwise the 8-parameter representation
can be used. In either case, the flow field is a linear in the unknown parameters.
The problem of estimating planar surface motion has been has been extensively stud-
ied before [21, 1, 23]. In particular, Negahdaripour and Horn [21] suggest iterative meth-
ods for estimating the motion and the surface parameters, as well as a method of estimat-
ing the 8 parameters and then decomposing them into the five rigid motion parameters
the three surface parameters in closed form. Besides the embedding of these computations
within the hierarchical estimation framework, we also take a slightly different, approach
to the problem.
We assume that the rigid motion parameters are already known or can be estimated
(e.g., see Section 3.3 below). Then, the problem reduces to that of estimating the three
surface parameters k. There are several practical reasons to prefer this approach: First, in
many situations the rigid motion model may be more globally applicable than the planar
surface model, and can be estimated using information from all the surfaces undergoing
the same rigid motion. Second, unless the region of interest subtends a significant field
of view, the second order components of the flow field will be small, and hence the
estimation of the eight parameters will be inaccurate and the process may be unstable.
On the other hand, the information concerning the three parameters k is contained in the
first order components of the flow field, and (if the rigid motion parameters are known)
their estimation will be more accurate and stable.
~u = u - u 0
= (A(x)t) (r(x)T(k0 + 6k)) + B(x)w - ( A ( x ) t ) (r(x)Tk0) + B(x)w
= (A(x)t) r(x)TSk (12)
in Equation 3, we can obtain the incremental estimate ~k as the vector that minimizes:
[~-~r(tTAT)(vI)(VI)T(At)rT)]~k=--~-~r(tTAT)(VI)AI (14)
3.3 Rigid B o d y M o d e l
T h e Model: The motion of arbitrary surfaces undergoing rigid motion cannot usually
be described by a single global model. We can however make use of the global rigid body
model if we combine it with a local model of the surface. In this section, we provide a
brief derivation of the global and the local models. Hanna [12] provides further details
and results, and also describes how the local and global models interact at corner-like
and edge-like image structures.
As described in Section 3.2, the image motion induced by a rigidity moving object
can be written as:
u(x) = Z---~x)A ( x ) t B(x)~ (15)
where Z(x) is the distance from the camera of the point (i.e., its depth), whose image
position is (x), and
0
B(x)=
L(fU+y2)/f -(~u)/f -xJ
The A and the B matrices depend only on the image positions and the focal length
f and not on the unknowns: t, the translation vector, ~ the angular velocity vector, and
Z. Equation 15 relates the parameters of the global model, ~ and t, with parameters of
the local scene structure, Z(x).
A local model we use is the frontal-planar model, which means that over a local image
patch, we assume that Z(x) is constant. An alternative model uses the assumption that
6Z(x)--the difference between a previous estimate and a refined estimate--is constant
over each local image patch.
We refine the local and global models in turn using initial estimates of the local struc-
ture parameters, Z(x), and the global rigid body parameters ~ and t. This local/global
refinement is iterated several times.
refine the parameters of the local and global models. We now show how these models are
refined.
We begin by writing equation 15 in an incremental form so that
Inserting the parametric form of ~u into Equation 3 we obtain the pixel-wise error as
Differentiating equation 17 with respect to 1/Z(x) and setting the result to zero, we get
- ~5 ( A I -- ( v I ) T A t J Z i ( x ) + (UI)TBw - (VI)TBwi)
1/Z(x) = (19)
~-~5 5 ( ( V I ) T A t ) 2
To refine the global model, we minimize the error in Equation 17 summed over the
entire image:
fg,obal = ~ E(t,w,1/Z(x)). (20)
Image
We insert the expression for 1/Z(x) given in Equation 19--not the current numerical
value of the local parameter--into Equation 20. The result is an expression for EgtobaZ
that is non-quadratic in t but quadratic in w . We recover refined estimates of t and w
by performing one Gauss-Newton minimization step using the previous estimates of the
global parameters, t i and wi, as starting values. Expressions are evaluated numerically
at t = tl and w --- wl.
We then repeat the estimation algorithm several times at each image resolution.
shows a table of rigid-body motion parameters that were recovered at the end of each
resolution of analysis.
More experimental results and a detailed discussion of the algorithm's performance
on various types of scenes can be found in [12].
where the sum is taken within the 5 5 window. Minimizing this error with respect to
6u leads to the equation,
= -
We make some observations concerning the singularities of this relationship. If the sum-
ming window consists of a single element, the 2 2 matrix on the left-hand-side is an
outer product of a 2 1 vector and hence has a rank of atmost unity. In our case, when
the summing window consists of 25 points, the rank of the matrix on the left-hand-side
will be two unless the directions of the gradient vectors V I everywhere within the window
coincide. This situation is the general case of the aperture e]]ect.
In our implementation of this technique, the flow estimate at each point is obtained by
using a 5 5 windows centered around that point. This amounts to assuming implicitly
that the flow field varies smoothly over the image.
Experiments with the general flow model: We demonstrate the general flow algo-
rithm on an image sequence containing several independently moving objects, a case for
which the other motion models described here are not applicable. Figure 5a shows one
image of the original sequence. Figure 5b shows the difference between the two frames
that were used to compute image flow. Figure 5c shows little difference between the com-
pensated image and the other original image. Figure 5d shows the horizontal component
of the computed flow field, and figure 5e shows the vertical component. In local image
regions where image structure is well-defined, and where the local image motion is sim-
ple, the recovered motion estimates appear plausible. Errors predictably occur however
at motion boundaries. Errors also occur in image regions where the local image structure
is not well-defined (like some parts of the road), but for the same reason, such errors do
not appear as intensity errors in the compensated difference image.
247
4 Discussion
Thus far, we have described a hierarchical framework for the estimation of image motion
between two images using wrious models. Our motivation was to generalize the notion
of direct estimation to model-based estimation and unify a diverse set of model-based
estimation algorithms into a single framework. The framework also supports the combined
use of parametric global models and local models which typically represent some type of
a smoothness or local uniformity assumption.
One of the unifying aspects of the framework is that the same objective function
(SSD) is used for all models, but the minimization is performed with respect to different
parameters. As noted in the introduction, this is enabled by viewing all these problems
from the perspective of image registration.
It is interesting to contrast this perspective (of model-based image registration) with
some of the more traditional approaches to motion analysis. One such approach is to
compute image flow fields, which involves combining the local brightness constraint with
some sort of a global smoothness assumption, and then interpret them using appropriate
motion models. In contrast, the approach taken here is to use the motion models to
constrain the flow field computation. The obvious benefit of this is that the resulting
flow fields may generally be expected to be more consistent with models than general
smooth flow fields. Note, however, that the framework also includes general smooth flow
field techniques, which can be used if the motion model is unknown.
In the case of models that are not fully parametric, local image information is used to
determine local image/scene properties (e.g., the local range value). However, the accu-
racy of these can only be as good as the available local image information. For example,
in homogeneous areas of the scene, it may be possible to achieve perfect registration even
if the surface range estimates ( a n d the corresponding local flow vectors) are incorrect.
However, in the presence of significant image structures, these local estimates may be
expected to be accurate. On the other hand, the accuracy of the global parameters (e.g.,
the rigid motion parameters) depends only on having sufficient and sufficiently diverse
local information across the entire region. Hence, it may be possible to obtain reliable
estimates of these global parameters, even though estimated local information may not
be reliable everywhere within the region. For fully parametric models, this problem does
not exist.
The image registration problem addressed in this paper occurs in a wide range of
image processing applications, far beyond the usual ones considered in computer vision
(e.g., navigation and image understanding). These include image compression via motion
compensated encoding, spatiotemporal analysis of remote sensing type of images, image
database indexing and retrieval, and possibly object recognition. One way to state this
general problem is as that of recovering the coordinate system that relate two images of
a scene taken from two different viewpoints. In this sense, the framework proposed here
unifies motion analysis across these different applications as well.
References
1. G. Adiv. Determining three-dimensional motion and structure from optical flow generated
by several moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence,
7(4):384-401, July 1985.
2. P. Anandan. A unified perspective on computational techniques for the measurement of
visual motion. In International Conference on Computer Vision, pages 219-230, London,
May 1987.
3. P. Anandan. A computational framework and an algorithm for the measurement of visual
motion. International Journal of Computer Vision, 2:283-310, 1989.
4. J. R. Bergen and E. H. Adelson. Hierarchical, computationally efficient motion estimation
algorithm. J. Opt. Soc. Am. A., 4:35, 1987.
5. J. R. Bergen, P. J. Burt, R. Hingorani, and S. Peleg. Computing two motions from three
frames. In International Conference on Computer Vision, Osaka, Japan, December 1990.
6. P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE
Transactions on Communication, 31:532-540, 1983.
7. P.J. Butt, J.R. Bergen, R. Hingorani, R. Kolczinski, W.A. Lee, A. Leung, J. Lubin, and
H. Shvaytser. Object tracking with a moving camera, an application of dynamic motion
analysis. In IEEE Workshop on Visual Motion, pages 2-12, Irvine, CA, March 1989.
8. P.J. Burr, R. Hingorani, and R. J. Kolczynski. Mechanisms for isolating component pat-
terns in the sequential analysis of multiple motion. In IEEE Workshop on Visual Motion,
pages 187-193, Princeton, N J, October 1991.
9. Stefan Carlsson. Object detection using model based prediction and motion parallax. In
Stockholm workshop on computational vision, Stockholm, Sweden, August 1989.
10. J. Dengler. Local motion estimation with the dynamic pyramid. In Pyramidal systems for
computer vision, pages 289-298, Maratea, Italy, May 1986.
11. W. Enkelmann. Investigations of multigrid algorithms for estimation of optical flow fieldsin
image sequences. Computer Vision, Graphics, and Image Processing, 4339:150-177, 1988.
12. K. J. Hanna. Direct multi-resolution estimation of ego-motion and structure from motion.
In Workshop on Visual Motion, pages 156-162, Princeton, N J, October 1991.
13. J. Heel. Direct estimation of structure and motion from multiple frames. Technical Report
1190, MIT AI LAB, Cambridge, MA, 1990.
14. E. C. Hildreth. The Measurement of Visual Motion. The MIT Press, 1983.
15. B. K. P. Horn. Robot Vision. MIT Press, Cambridge, MA, 1986.
16. B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 17:185-
203, 1981.
17. B. K. P. Horn and E. J. Weldon. Direct methods for recovering motion. International
Journal of Computer Vision, 2(1):51-76, June 1988.
18. B.D. Lucas and T. Kanade. An iterative image registration technique with an application
to stereo vision. In Image Understanding Workshop, pages 121-130, 1981.
19. L. Matthies, R. Szeliski, and T. Kanade. Kalman filter-based algorithms for estimating
depth from image-sequences. In International Conference on Computer Vision, pages 199-
213, Tampa, FL, 1988.
20. H. H. Nagel. Displacement vectors derived from second order intensity variations in in-
tensity sequences. Computer Vision, Pattern recognition and Image Processing, 21:85-117,
1983.
21. S. Negahdaripour and B.K.P. Horn. Direct passive navigation. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 9(1):168-176, January 1987.
22. A. Singh. An estimation theoretic framework for image-flow computation. In International
Conference on Computer Vision, Osaka, Japan, November 1990.
23. A.M. Waxman and K. Wohn. Contour evolution, neighborhood deformation and global
image flow: Planar surfaces in motion. International Journal of Robotics Research, 4(3):95-
108, Fall 1985.
249
IResolutiol~ I ~2 I T I
(.oooo,.oooo,.oooo) I(.oooo,.oooo,l.oooo)
32 30 (.0027,.0039,-.0001)!(-.3379,-.1352,.9314)!
64 60 (.0038,0041,.0019) (-.3319,-.0561,.9416)!
128 120 (.0037,.0012,.0008) (-.0660,-.0383,.9971)
256 240 (.0029,0006,0013) (-.0255,-.0899,9956)
V. Sundareswaran*
1 Introduction
We consider the motion of a sensor in a rigid, static environment. The motion produces a
sequence of images containing the changing scene. We want to estimate the motion of the
sensor, given the optical flow fields computed from the sequence. We model the motion
using a translational velocity T and a rotational velocity w. These are the instantaneous
motion parameters.
Many procedures exist to compute the optical flow field [1,4]. Also, several methods
have been proposed to compute the motion parameters from the optical flow field. One
feature of most of these methods is that they operate locally. Recovering structure, which
is contained in local information, seems to be the motivation for preferring local methods.
However, the motion parameters are not local and they are better estimated by employing
global techniques. In addition, using more data usually results in better performance in
the presence of noise. Non-local algorithms are given in [3] and [8], and more recently, in
[6]. The algorithm presented in [3] requires search over grid points on a unit sphere. The
method of Prazdny [8] is based on a non-linear minimization. Faster methods have been
presented recently [7,10]. Though all these methods work well on noiseless flow fields,
there is insufficient data about their performance on real images. The work in this paper
has been motivated by the observation that making certain approximations to an exact
procedure gives a method that produces robust results from real data.
The algorithm presented here determines the location of the focus of expansion (FOE)
which is simply the projection of the translation vector T on the imaging plane. It is well
known that once the FOE is located, the rotational parameters can be computed from
the optical flow equations [2]. Alternative methods to directly compute the rotational
parameters from the flow field have also been proposed [11]. We begin by reviewing the
flow equations, then describe the algorithm and present experimental results.
2 T h e flow equations
We consider the case of motion of a sensor in a static environment. We choose the
coordinate system to be centered at the sensor which uses perspective projection for
* Supported under Air Force contract F33615-89-C-1087 reference 87-02-PMRE. The author
wishes to thank Bob Hummel for his guidance.
254
imaging onto a planar image surface (Fig. 1). The sensor moves with a translational
velocity of T -- (vl, v2, v3) and an angular velocity of w -- (wl, w2, w3).
The transformation from spatial coordinates to the image coordinates is given by the
equations
x = IX~Z, y = .fY/Z
where (X, Y, Z) = (X(x, y), Y(x, y), Z(x, y)) is the position
of the point in three-space
that is imaged at (x, y) and f is the focal length. The optical flow V = (u, v) at the
image point (x, y) is easily obtained [2,5,9]:
Here, u(x, y) and v(x, y) are the x and y components of the optical flow field V(x, y).
In this context, we are interested in determining the location (1" = fvl/v3,77 = fv2/v3)
which is nothing but the projection of the translational velocity T onto the image plane.
This location is referred to as the Focus of Expansion (FOE).
Looking at Eqn. 1, we note that the vector flow field V(x, y) is simply the sum of the
vector field Vv(x,y) arising from the translation T and the vector field V~(x, y) due to
the rotation w:
v(~,u) = y~(~, y) + y~(x, y).
3 Algorithm Description
The observation behind the algorithm is that a certain circular component computed from
the flow field by choosing a center (x0, Y0) is a scalar function whose norm is quadratic in
the two variables x0 and Y0. The norm is zero (in the absence of noise) at the FOE. This
procedure will be referred to as the Norm of the Circular Component (NCC) algorithm.
3.1 T h e c i r c u l a r c o m p o n e n t
For each candidate (x0, Y0), we consider the circular component of the flow field about
(xo, Y0) defined by:
and
+ - + + oyl. (4)
At the focus of expansion, when (zo, Yo) = (r, 77),
so that U(xo,~o) = U~o,yo) for (x0, Y0) = (% 77). Eqn. 5 is merely a result of the radial
structure of the translational component of the flow field. In other words, pure translation
produces a field that is orthogonal to concentric circles drawn with the FOE as the center.
Observations about the quadratic nature of U~ i, o , Y o /
, (Eqn 4) lead to the convolution
and subspace projection methods described in [6]. Here, we obtain a method that is
approximate but is quick and robust.
To this end, we define an error function E(xo, Yo) as the norm of U(:~o,yo)(X, y):
3.2 T h e N C C a l g o r i t h m
The first step is to choose six sets of values for (x0, y0) in a non-degenerate configuration
(in this case, non-collinear). Next, for each of these candidates, compute the circular
component and define E(xo, yo) to be the norm of the circular component (NCC). In a
discrete setting, the error value is simply the sum of the squares of the circular component
values. Note that this can be done even in the case of a sparse flow field. The error
function values at these six points completely define the error surface because of its
quadratic nature and so the location of the minimum can be found using a closed-form
expression. T h a t location is the computed FOE.
Let us now examine the claim about the minimum being at the location of the FOE.
Note that the function U(~o,uo)(x , y) is made up of two parts; one is the translational
part shown in Eqn. 3, and the other is the rotational part (Eqn. 4). The translational
part U(xo,uo)(x , y) vanishes at the FOE, as shown in Eqn. 5, and it is non-zero elsewhere.
Thus, the norm9 9
]]U~'~
~. 0 , u
. ,(x, y)[[2 is positive quadratic with minimum (equal to zero) at
the FOE. This Is no longer true once we add the rotational part. However, as long as the
contribution from the rotational part is small compared to that from the translational
part, we can approzirnate the behavior of [[U(~o yo)(x, y)[[2 by [[U(~o,~0)(x , y)[[2.
The method is exact for pure translation and '~s approximate when the rotation is small
compared to the translation or when the depth of objects is small (i.e., high p(x, y)) as
would be the case in indoor situations. Also, there is no apparent reason for this method
to fail in the case where a planar surface occupies the whole field of view. Previous
methods [7,10] are known to fail in such a case. Indeed, in two experiments reported
here, a large portion of the view contains a planar surface. In all experiments done with
synthetic as well as actual data, this algorithm performs well. We present results from
actual image sequences here.
258
4 Experiments
For all the sequences used in the experiments, the flow field was computed using an
implementation of Anandan's algorithm [1]. The dense flow field thus obtained (on a 128
by 128 grid) is used as input to the NCC algorithm. The execution time per frame is on
an average less than 0.45 seconds for a casual implementation on a SUN Sparcstation-2.
The helicopter sequences, provided by NASA, consist of frames shot from a moving
helicopter that is flying over a runway. For the straight line motion, the helicopter has
a predominantly forward motion, with little rotation. The turning flight motion has
considerable rotation. The results of applying the circular component algorithm to these
sequences are shown in Figure 2 for ten frames (nine flow fields). This is an angular
error plot, the angular error being the angle between the actual and computed directions
of translation. The errors are below 6 degrees for all the frames of the straight flight
sequence. Notice the deterioration in performance towards the end of the turning flight
sequence due to the high rotation (about 0.15 fads/see).
The results from a third sequence (titled ridge, courtesy David Heeger) are shown in
Figure 2. Only frames 10 through 23 are shown because the actual translation data was
readily available only for these frames. In this sequence, the FOEs are located relatively
high above the optical axis. Such sequences are known to be hard for motion parameter
estimation because of the confounding effect between the translational and rotational
parameters (see the discussion in [6]). The algorithm presented here performs extremely
well, in spite of this adverse situation.
5 Conclusions
References
y Iv2
ba2
zJ
V3
Fig. 1. The coordinate systems and the motion parameters
\ /
/
/
/
k I
I0 20 30 40 JO 60 70 12 14 16 18 20
R'~2= ~ F.Lme N u m b ~
Fig. 2, Angular error plots for the helicopter sequences(left: straight line flight in solid line and
turning flight in dotted line) and the ridge sequence (right)
Identifying multiple motions from optical flow *
A b s t r a c t . This paper describes a method which uses optical flow, that is,
the apparent motion of the image brightness pattern in time-varying images,
in order to detect and identify multiple motions. Homogeneous regions are
found by analysing local linear approximations of optical flow over patches
of the image plane, which determine a list of the possibly viewed motions,
and, finally, by applying a technique of stochastic relaxation. The presented
experiments on real images show that the method is usually able to identify
regions which correspond to the different moving objects, is also rather
insensitive to noise, and can tolerate large errors in the estimation of optical
flOW.
1 Introduction
by means of relaxation techniques was first proposed in [2,8]. The presented method has
several very good features. Firstly, although accurate pointwise estimates of optical flow
are difficult to obtain, the spatial coherence of optical flow appears to be particularly well
suited for a qualitative characterisation of regions which correspond to the same moving
surface independently of the complexity of the scene. Secondly, even rather cluttered
scenes are segmented into a small number of parts. Thirdly, the computational load is
almost independent of the data. Lastly, the choice of the method for the computation
of optical flow is hardly critical since the proposed algorithm is insensitive to noise and
tolerates large differences in the flow estimates.
The paper is organised as follows. Section 2 discusses the approximation of optical
flow in terms of linear vector fields. In Section 3, the proposed method is described in
detail. Section 4 presents the experimental results which have been obtained on sequences
of real images. The main differences between the proposed method and previous schemes
are briefly discussed in Section 5. Finally, the conclusions are summarised in Section 6.
The interpretation of optical flow over small regions of the image plane is often ambiguous
[9]. Let us discuss this fact in some detail by looking at a simple example of a sequence
of real images.
Fig. 1A shows a frame of a sequence in which the camera is moving toward a picture
posted on the wall. The angle between the optical axis and the direction orthogonal to
the wall is 30 ~ The optical flow which is obtained by applying the method described in
[10] to the image sequence and relative to the frame of Fig. 1A is shown in Fig. lB. It
is evident that the qualitative structure of the estimated optical flow is correct. It can
be shown [7] that the accuracy with which the optical flow of Fig. 1B and its first order
properties can be estimated is sufficient to recover quantitative information, like depth
and slant of the viewed planar surface. The critical assumption that makes it possible to
extract reliable quantitative information from optical flow is that the relative motion is
known to be rigid and translational.
In the absence of similar "a priori" information (or in the presence of more complex
scenes) the interpretation of optical flow estimates is more difficult. In this case, a local
analysis of the spatial properties of optical flow could be deceiving. Fig. 1C, for example,
shows the vector field which has been obtained by dividing the image plane in 64 non-
overlapping squared patches of 32 x 32 pixels and computing the linear rotating vector
field which best approximates the optical flow of Fig. 1B over each patch. Due to the
presence of noise and to the simple spatial structure of optical flow, the correlation
coefficient of this "bizarre" local approximation is very high. On a simple local (and
deterministic) basis there is little evidence that the vector field of Fig. 1B is locally
expanding. However, a more coherent interpretation can be found by looking at the
distributions of Fig. 1D. The squares locate the foci of expansion of the linear expanding
vector fields which best approximate the estimated optical flow in each patch, while the
crosses locate the centers of rotation of the rotating vector field which have been used to
produce the vector field of Fig. 1C. It is evident that while the foci of expansion tend to
clusterise in the neighbourhood of the origin of the image plane (identified by the smaller
frame), the centers of rotation are spread around. This observation lies at the basis of the
method for the identification of multiple motion which is described in the next Section.
lsa!; oql sosA[uuu poqloua oql j o dols ~saU oql 'fig "~t.A u! suo!~om luoaoj~!p oql Aj!luopt. o~l
aopao u I "fli~ "~!~I u! umoqs s! '[0[] u! poq!a~sop oanpo~oad ~ q~noaq~, po~ndmo~ pu~ 'V6
"~[A Jo om~aj oq~ o~ oht.~[oa ~ o ~ I~o!~do oq~L "~uI.l~%oa s.I p u n o a ~ I ~ q oq~ pu~ ou~Id o ~ m ~
oH1 p a ~ n o ~u!~lSU~a~ st. oaaqds a~ll~tUS oq~ o[!qA~ 'Ou~ld o ~ u a ! oq~ p a ~ o ~ ~u!l~lSU~a~
s.t oaoqds ao~a~[ oq~ q~!qm u[ o~uonbos po~aouo~ ao~ndtuo~ ~ j o om~aj ~ smoqs Vg "~!d
~ o B Ie~.l]do j o s u o t . ~ t u ! x o a d d ~ aeam.i ~ u t ] n d m o D I ' g
9o~uonbos o~um! ~[~oq~uAs u jo oidmuxo uu
~u ~upIoo [ gq AlO~uaudos possn~s!p oa~ poqloui oq~ jo sdols ut~ua ao~q~ aq~L "posodoad s[
A~og [~o!ldo tuoaj suo!~oua old!~inm ~u!gj!~uop! puu ~u!~olop aoj poq~oul ~ uo!~oS s!q~ u I
suo!~otu oldt.~inua ~u!~:m~op aoj poq~atu V 8
9(otu~at p.qos oql
Xq po~J!luap!) too!^ jo plog oql u~ql ao~a~[ s~m!~ anoj ~oa~ u~ u!ql.t~ ~.q q,!qm Alos!l~odso~ sppg
ao~ooA ~ut.l~lox os.t~[~op pu~ ]3u!pu~dxa axom.[ oql jo (sossoa~) uo~.l~oa jo saa~uo~ pu~ (soa~nbs)
ltOtStt~dxo jo .tooj oql ~o suo.tlrtq!als[(I ((I '(~suas aa~nbs u~am ~sma[ aq~ u[) q ~ d q ~ a ut. ( a jo
~ott [ ~ t l d o oql sol~m.txoadd~ lsoq q~!qt~ pl~lt Su!l~loa a~om.[ oq~ ~u.tlndmo~ pu~ (tI,~a s[ox!d
~ x ~ ) soq~,~d pop, ribs 1'9 u[ ~ao.iAjo p[otj oq~ Su!p!^.tp s p o u ~ q o s.[ q~t.qta taolt [~[2,do oq~L (O
"(V lo om~aj oql ql!ta pol~!aoss~ [0[] u! poq!a~sop poq~,otu oql to suborn ,r po~,ndtuo~ mo U l ~ ! l d o
~ (ff "o0~ s.I s!x~ [~p, do oq~, pu~ ll~a oq$ ol aol~oA [~UlaOUoq~, uoa~a1,oq oI~U~ oq~L "llamaoq~ uo
polsod aan~.d ~ pa~mo~ ~UI.AOmsl xaam~o ~U.IA~aI.AOq~ q~t.qm U! a~uonbos ~ jo om~al V (V "l "~!~I
{~8E 9~ 8~1 0
8CI-
.'.
: ~ ~ * ~176
' 0
9 .;o r 9
8~I
i....; ...
9g~
~176
~8E
G 0
III
III
#2't
t77
C;;:; ..... :
N'NN
8
09~
261
order spatial properties of optical flow. The optical flow is divided into patches of fixed
size and the expanding (EVF), contracting (CVF), clockwise (CRVF) and anticlockwise
(ARVF) rotating, and constant (TVF) vector fields which best approximate the optical
flow in each patch si, i = 1, ..., N , are computed. Roughly speaking, this is equivalent to
reducing the possible 3D motions to translation in space with a fairly strong component
along the optical axis (EVF and CVF), rotation around an axis nearly orthogonal to
the image plane (CRVF and ARVF), and translation nearly parallel to the image plane
(TVF). This choice, which is somewhat arbitrary and incomplete, does not allow an
accurate recovery of 3D motion and structure (the shear terms, for example, are not
taken into account), but usually appears to be sufficient in order to obtain a qualitative
segmentation of the viewed image in the different moving objects (see Section 4).
As a result of the first step, five vectors x~, j = 1,...,5, are associated with each
patch si: the vector x81., position over the image plane of the focus of expansion of the
EFV; x82,, position of the focus of contraction of the CVF; xs~ , position of the center of
the CRVF; x, 4, position of the center of the ARVF, and the unit vector xa~ , parallel to
the direction of the TVF.
In order to produce a list of the "possible" motions in the second step, global properties
of the obtained EVFs, CVFs, CRVFs, ARVFs, and CVFs are analysed. This step is
extremely crucial, since the pointwise agreement between each of the computed local
vector fields and the optical flow of each patch usually makes it difficult, if not impossible,
to select the most appropriate label (see Section 2). Figs. 2C and D respectively show
the distribution of the foci of expansion and contraction, and centers of clockwise and
anticlockwise rotation, associated with the EVFs, CVFs, CRVFs, and ARVFs of the
optical flow of Fig. 2B. A simple clustering algorithm has been able to find two clusters
in the distribution of Fig. 2C, and these clusters clearly correspond to the expansion
and contraction along the optical axis of Fig. 2B. The same algorithm, applied to the
distribution of the centers of rotation (shown in Fig. 2D), reveals the presence of a
single cluster in the vicinity of the image plane center corresponding to the anticlockwise
rotation in Fig. 2B. On the other hand, in the case of translation, the distribution of the
unit vectors parallel to the directions of the TVFs is considered (see Fig. 2E). For the
optical flow of Fig. 2B the distribution of Fig. 2E is nearly flat indicating the absence
of preferred translational directions. Therefore, as a result of this second step, a label l
is attached to each "possible" motion which can be characterised by a certain cluster of
points x,~ (0, where c(l) equals 1, ...,4, or 5 depending on I. In the specific example of
Fig. 2, one label of expansion, one of contraction, and one of anticlockwise rotation, are
found.
In the third and final step, each patch of the image plane is assigned one of the possible
labels by means of an iterative relaxation procedure [11]. The key idea is that of defining
a suitable energy function which not only depends on the optical flow patches but also on
the possible motions, and reaches its minimum when the correct labels are attached to
the flow patches. In the current implementation, the energy function is a sum extended
over each pair of neighbouring patches in which the generic term u(sl, sj), where si and
sj are a pair of neighbouring patches, is given by the formula
262
B
/ I/Z--" . ~
~, . . . . . . tt
C D
384 384 9 . . . . . .
256 256 I
9 %- ~ Oo9
;9 ;. J "
128 ...:~i.~.,
"'"-.... 128
9~ 9 .~ .~176',
0 0
-128 -128
-128 0 128 256 384 -128 0 128 256 384
od,~o~176
/ %
0 . . . . ~. . . . . . . . . . . . . . . . . . .
i
4'
-i 0 1
Fig. 2. A) A frame of a synthetic sequence in which the larger sphere is translating toward the
image plane, while the smaller sphere is moving away and the background is rotating anticlock-
wise. B) The corresponding optical flow computed by means of the method described in [10].
C) Distributions of the foci of expansion (squares) and contraction (crosses) of the EVFs and
CVFs respectively which lie within an area four times larger than the field of view (identified
by the solid frame). D) Distribution of the centers of anti~lockwise rotation of the ARVFs. E)
Distribution of the directions of the TVFs on the unit circle. F) Colour coded segmentation of
the optical flow of B) obtained through the algorithm described in the text.
263
/ \
u(~,, ~i) = (,11~, - Xo:(')ll + I1~, - x,;(')ll) ~,,=,~ (1)
where x~ is the center of mass of the cluster corresponding to the label l, and ~ = 1
if the labels of the two patches, ii and Ij respectively, equal l, otherwise $ = 0. The
relaxation procedure has been implemented through an iterative deterministic algorithm
in which, at cach iteration, each patch is visited and assigned the label which minimises
the current value of the energy function, keeping all the other labels fixed. The procedure
applied to the optical flow of Fig. 2B, starting from a random configuration, produces
the colour coded segmentation shown in Fig. 2F after twenty iterations. From Fig. 2F, it
is evident that the method is able to detect and correctly identify the multiple motions
of the optical flow of Fig. 2B. Extensive experimentation indicates that the deterministic
version usually converges on the desired solution. This is probably due to the fact that,
for the purpose of detecting multiple motions, the true solution can be approximated
equally well by nearly optimal solutions.
To conclude, it has to be said that the profile of the segmented regions can be suitably
modeled by adding ad hoc terms to the energy (or "penalty functions") which tend
to penalise regions of certain shapes. The choice of the appropriate penalty functions
reflects the available "a priori" knowledge, if any, on the expected shapes. In the current
implementation, in which no "a priori" knowledge is available, only narrow regions have
been inhibited (configurations in which in a square region of 3 x 3 patches there are no
five patches with the same label are given infinite energy).
Let us now discuss two experiments on real images. Fig. 3A shows a frame of a sequence
in which the viewing camera is translating toward the scene while the box is moving
toward the camera. The optical flow associated with the frame of Fig. 3A is shown in
Fig. 3B. From Fig. 3B it is evident that the problem of finding different moving objects
from the reconstructed optical flow is difficult. Due to the large errors in the estimation of
optical flow, simple deterministic (and local) procedures which detect flow edges, or sharp
changes in optical flow, are doomed to failure. In addition, the viewed motion consists
of two independent expansions and even in the presence of precisely computed optical
flow, no clear flow edge can be found as the flow direction in the vicinity of the top, right
side, and bottom of the box agrees with the flow direction of the background. Fig. 3C
shows the distribution of the foci of expansion associated with the EVFs computed as
described above. Two clusters are found which correspond to the (independent) motion of
the camera and of the box of Fig. 3A. On the contrary, no clusters are found in the other
distributions. Therefore, it can be concluded that, at most, two different motions (mainly
along the optical axis) are present in the viewed scene. The colour coded segmentation
which is obtained by applying the third step of the proposed method is shown in Fig. 3D.
It is evident that the algorithm detects and correctly identifies the two different motions
of the viewed scene.
In the second experiment (Fig. 4A), a puppet is moving away from the camera, while
the plant in the lower part of Fig. 4A is moving toward the image plane. The optical
flow associated with the frame of Fig. 4A is reproduced in Fig. 4B. As can be easily seen
from Fig. 4C both the distributions of the foci of expansion (squares) and contraction
(crosses) clusterise in the neighbourhood of the origin. No cluster has been found in the
other distributions, which is consistent with the optical flow of Fig. 4B. The segmentation
which is obtained by applying the relaxation step is shown in Fig. 4D.
264
~r u tt It t r it g' ~ ~, ~ '~ h Na
C D
384
256
128
-128
0
-128 0
9m~
m
128
9 9
",~m
256 384
iiiiiiiliiii!iiii i
iil
ii ii iiii ii iiiiii
Fig. 3. A) A frame of a sequence in which the box is translating toward the camera, while the
camera is translating toward an otherwise static environment. B) The corresponding optical flow
computed by means of the method described in [10]. C) Distribution of the foci of expansion
of the EVFs. D) Colour coded segmentation of the optical flow of B) obtained through the
algorithm described in the text.
This example clarifies the need for two distinct labels for expansion and contraction
(and, similarly, for clockwise and anticlockwise rotation). The energy term of Eq. 1,
which simply measures distances between singular points, would not be sufficient to
distinguish between expanding and contracting patches. In order to minimise the number
of parameters which enter the energy function, it is better to consider a larger number
of different local motions than to add extra-terms to the right-hand-side of Eq. 1.
To summarise, the proposed method appears to be able to detect multiple motion and
correctly segment the viewed image in the different moving objects even if the estimates
of optical flow are rather noisy and imprecise.
It is evident that the presented method is very different from the deterministic schemes
which attempt to identify multiple motions by extracting flow edges [12-13]. Important
similarities, instead, can be found with the techmque proposed in [2]. Firstly, the same
mathematical machinery (stochastic relaxation) is used. Secondly, in both cases first
265
C D
394 . . ~ - ~
256
9 =l I 9 ~ ~
Fig. 4. A) A frame of a sequence in which the puppet is moving away from the camera, while
the plant is translating toward the image plane. B) The corresponding opticM flow computed
by means of the method described in [10]. C) Distribution of the loci of expansion (squares) and
contraction (crosses) of the EVFs and CVFS respectively. D) Colour coded segmentation of the
optical flow of B) obtained through the algorithm described in the text.
order spatial properties of optical flow, such as expansion and rotation, are employed to
determine the different types of motion. However, the two methods are basically different.
In [2] regions are segmented and only at a later stage local spatial properties of optical
flow are used to interpret the viewed motions. The possible motions are data-independent
and the resolution is necessarily fairly low. On the contrary, the method described in the
previous Section computes the possible motions first and then identifies the regions which
correspond to the different moving objects. Consequently, the number of labels remains
small and stochastic relaxation always runs efficiently. In addition, since the possible
motion are data-dependent, the resolution is sufficiently high to allow for the detection
of "expansion within expansion" (see Fig. 3D) or the determination of arbitrary direction
of translation.
6 Conclusion
In this paper a method for the detection and identification of multiple motions from
optical flow has been presented. The method, which makes use of linear approximations of
266
optical flow over relatively large patches, is essentially based on a technique of stochastic
relaxation. Experimentation on real images indicates that the method is usually capable
of segmenting the viewed image into the different moving parts robustly against noise,
and independently of large errors in the optical flow estimates. Therefore, the technique
employed in the reconstruction of optical flow does not appear to be critical. Due to the
coarse resolution at which the segmentation step is performed, the proposed algorithm
only takes a few seconds on a Sun SPARCStation for a 256x256 image, apart from the
computation of optical flow.
To conclude, future work will focus on the extraction of quantitative information on
the segmented regions and will be biased to the theoretical (and empirical) study of the
local motions which must be added in order to increase the capability of the method.
References
1. Adiv, G. Determining three-dimensional motion and structure from optical flow gen-
erated by several moving objects. IEEE Trans. Pattern Anal. Machine Intell. 7
(1985), 384-401.
2. Francois, E. and P. Bouthemy. Derivation of qualitative information in motion anal-
ysis. Image and Vision Computing 8 (1990), 279-288.
3. Gibson, J.J. The perception of the Visual World. (Boston, Houghton Mifflin, 1950).
4. Koenderink, J.J. and Van Doom, A.J. How an ambulant observer can construct
a model of the environment from the geometrical structure of the visual inflow. In
Kibernetic 1977, G. Hauske and E. Butendant (Eds.), (Oldenbourg, Munchen, 1977).
5. Verri, A., Girosi, F., and Torre, V. Mathematical properties of the two-dimensional
motion field: from Singular Points to Motion Parameters. J. Optical Soc. Amer. A 6
(1989), 698-712.
6. Subbarao, M. Bounds on time-to-collision and rotational component from first order
derivatives of image flow. CVGIP 50 (1990), 329-341.
7. Campani, M., and Verri, A. Motion analysis from first order properties of optical
flow. CVGIP: Image Understanding in press (1992).
8. Bouthemy, P. and Santillana Rivero, J. A hierarchical likelihood approach for region
segmentation according to motion-based criteria. In Proc. 1st Intern. Conf. Comput.
Vision London (UK) (1987), 463-467.
9. Adiv, G. Inherent ambiguities in recovering 3D motion and structure from a noisy
flow field. Pattern Anal. Machine Intell. 11 (1989), 477-489.
10. Campani, M. and A. Verri. Computing optical flow from an overconstrained system
of linear algebraic equations. Proe. 3rd Intern. Conf. Comput. Vision Osaka (Japan)
(1990), 22-26.
11. Geman, D., Geman, S., Grafflgne, C., and P. Dong. Boundary detection by con-
strained optimization. IEEE Trans. Pattern AnaL Machine IntelL 12 (1990), 609-
628.
12. Thompson, W.B. and Ting Chuen Pong. Detecting moving objects. IJCV 4 (1990),
39-58.
13. Verri, A., Girosi, F., and Torte, V. Differential techniques for optical flow. J. Optical
Soe. Amer. A 7 (1990), 912-922.
A Fast Obstacle Detection Method based on Optical
Flow *
Nicola Ancona
1 Introduction
The exploitation of robust techniques for visual processing is certainly a key aspect in
robotic vision application. In this work, we investigate an approach for the detection
of static obstacles on the ground, by evaluation of optical flow fields. A simple way to
define an obstacle is by a plane lying on the ground, orthogonal to it and high enough
to be perceived. In this framework, we are interested in the changes happening on the
ground plane, rather than in the environmental aspect of the scene. Several constraints
help analysis and in tackling the problem. Among them:
1. the camera attention is on the ground plane, sensibly reducing the amount of required
computational time and data;
2. the motion of a robot on a plane exhibits only three degrees of freedom; further, the
height of the camera from this plane remains constant in time.
The last constraint is a powerful one on the system geometry, because, in pure transla-
tional motion, only differences of the vehicle's velocity and depth variations can cause
changes in the optical flow field. Then the optical flow can be analysed looking for the
anomalies with respect to a predicted velocity field [SA1].
A number of computational issues have to be taken into account: a) the on-line
aspect, that is the possibility to compute the optical flow using at most two frames;
b)the capability of detecting obstacles on the vehicle's path in a reliable and fast way;
c)the possibility of updating the system status when a new frame is available. The above
considerations led us to use a recursive token matching scheme as suitable for the problem
at hand. The developed algorithm is based on a correlation scheme [LI1] for the estimation
* Acknowledgements: this paper describes research done at the Robotic and Automation
Laboratory of the Tecnopolis CSATA. Partial support is provided by the Italian PRO-ART
section of PROMETHEUS.
268
of the optical flow fields. It uses two frames at a time to compute the optical flow and
so it is a suitable technique for on-line control strategies. We show how the estimation
of the optical flow on only one row of reference on the image plane is robust enough to
predict the presence of an obstacle. We estimate the flow field in the next frame using
a predictive Kalman filter, in order to have an adaptive search space of corresponding
patches, according to the environmental conditions. The possibility of changing the search
space is one of the key aspects of the algorithm's performance. We have to enlarge the
search space only when it is needed that is only when an obstacle enters the camera's
field of view. Moreover, it is important to point out that no calibration procedure is
required. The developed methodology is different from [SA1] and [EN1] because we are
interested in the temporal evolution of the predicted velocity field.
2 Obstacle model
Let us suppose that a camera moves with pure translational motion on a floor plane, fl,
(fig. la), and that the distance h between the optical center C and the reference plane
stays constant in time. Let V(V~, Vy, Vz) be the velocity vector of C and let us suppose
that it is constant and parallel to the plane /3. Let us consider a plane 7 (obstacle)
orthogonal to fl, having its normal parallel to the motion direction. Moreover, let us
consider a point P(P~, P~, Pz) lying on/3 and let P(Pu,Pv) be its perspective projection
on the image plane: 7"(P) = p. When the camera moves, 7 intersects the ray projected
by p in a point Q(Q~, Q~, Q~), with Qz < Pz. In other words, Q lies on the straight line
through C and P . It is a worth to point out that the points P and Q are acquired from
the element p at different temporal instants, because we have assumed the hypothesis of
opacity of the objects' surfaces in the scene.
C F t~
t I 9
I
j ",,'x,f / I
|
/
I \, \
w""~
~ , ~~+A~
*f ~ ii 99
/wQ "" i
-V ~i Q
(a)
Fig. 1. (a) The geometrical aspects of the adopted model. The optical axis of the camera is
directed toward the floor plane /3, forming a fixed angle with it. (b) The field vectors relatives
to the points p and q on r in a general situation.
Let us consider the field vectors, W p and W q , projected on the image plane by the
camera motion, relative to P and Q. At this point, let us make some useful considerations:
269
1. W p and W o have the same direction. This statement holds because, in pure trans-
lational motion, the vectors of the 2D motion field converge to the focus of expansion
F, independently from the objects' surface into the scene. Then, W e and W Q are
parallel and so the following proposition holds: 3A > 0 9' W q = AWp.
2. As Qz < Pz, the following relation holds: [[Wp[] < [[Wq[[
So we can claim that, under the constraint of a constant velocity vector, a point Q
(obstacle), rising from the floor plane, does not change the direction of the vector flow,
with respect to the corresponding point on the floor plane P, but it only increases its
length. So, the variation of the modulus of the optical flow at one point, generated by the
presence of a point rising from the floor plane, is a useful indicator to detect obstacles
along the robot path.
We know that estimation of the optical flow is very sensitive to noise. To this aim
let us take a row on the image plane rather than single points, where to extract the flow
field. The considerations as above still hold. Let us suppose the X axis of the camera
coordinate system to be parallel to the plane ft. We consider a straight line r : v = k on
the image plane and let w be the corresponding line on fl obtained by back projecting r:
w = 7--Z(r)Aft. Under these constraints, all points on w have the same depth value. Let
us consider two elements of r: P(Pu, k) and q(qu, k), and let P(P~, P~, z) and Q(Q~, Qv, z)
be two points on w such that: 7"(P) -= p, T ( Q ) -- q We can state that the end points
of the 2D motion field, Wp and Wq, lie on the straight line s (fig. lb). In particular,
when the camera views the floor plane without obstacles, the straight lines r and s stay
parallel and maintain the same displacement during the time. When an obstacle enters
into the camera's field of view, the line parameters change and they can be used to detect
the presence of obstacles.
A first analysis of the optical flow estimation process shows that the performances of
the algorithm are related to the magnitude of ~, the expected velocity of a point on the
image plane. This quantity is proportional to the search space (SS) of corresponding
brightness patches, in two successive frames. In other words, SS is the set of possible
displacements of a point during the time unit. We focus our attention on the search space
reduction, one of the key aspects of many correspondence based algorithms, to make the
performance of our approach close to real time and to obtain more reliable results. The
idea is to adapt the size of the search space according to the presence or absence of an
obstacle in the scene.
Let us consider (fig. lb) a row r on the image plane and let p and q be two points
on it. Let Wp and Wq be the relative field vectors. Moreover, let st be the straight line
where all of end points of the field vectors lie, at the time t. The search space SSt, at
the time t, is defined by the following rectangular region:
We want to stress out that SSt is constrained by let and by the straight line st. For the
sake of these considerations, wishing to predict SSt+a~, at the time t + At, it is enough
to predict s~+a~, knowing F~ and st at the previous time. To realize this step, that we
call optical flow prediction, we assume the temporal continuity constraint to be true, in
other words the possibility of using an high sample rate of input data hold.
270
Suppose we know an estimate of the F O E fi't at time t JAN1]. The end points of Wp
and Wq determine a straight line st, whose equation is y = mix + nt. As we are only
considering pure translational motion, the straight lines determined by the vectors Wp
r
This equation, describing the temporal evolution of A, can be written in a recursive way.
Setting r -- (k - 1)T and t = kT, where T is the unit time, we get:
xl(k) = x l ( k - 1) + x 2 ( k - 1 ) T + l a ( k - 1)T 2
x2(k) x2(k 1) + a(k - 1)T
(3)
where ~l(k) denotes the value of the parameter A, x2(k) its velocity and a(k) its acceler-
ation. In this model, a(k) was regarded as a white noise. Using a vectorial representation,
the following equation holds: x(k) = A x ( k - 1 ) + w ( k - 1 ) describing the dynamical model
of the signal. At each instant of time it is possible to know only the value of A, so the
observation model of the signal is given by the following equation: y(k) = Cx(k) + v(k)
and C = (1,0), where E[v(k)] = 0 and E[v2(k)] = a~. The last two equations, describing
the system and observation model, can be solved by using the predictive Kalman filtering
equations. At each step, we get the best prediction of the parameter A for A and B and
so we are able to predict the estimate of the optical flow field, for all of points of r.
4 Experimental results
The sequence, fig. 2, was acquired from a camera mounted on a mobile platform moving
at a speed of 100 mrn/sec. The camera optical axis was pointing towards the ground.
In this experiment, we used human legs as obstacle. The size of each image is made of
64 x 256 pixels. The estimation of the opticM flow was performed only on the central
row (32 "a) of each image. The fig. (3) shows the parameters m and n of s during the
sequence. It is possible to note that at the beginning of the sequence, the variations of
the parameters m and n are not very strong. Only when the obstacle is close to the
camera, the perception module can detect the presence of the obstacle. This phenomena
is due to the experimental set-up: camera's focM length and angle between optical axis
and ground plane. The algorithm perceives the presence of an obstacle when one of the
above parameters increase or decrease in a monotonous way. Our implementation run on
a Risk 6000 IBM at the rate of 0.25 sec.
A c k n o w l e d g e m e n t s : we would like to thank Piero Cosoli for helpful comments on the
paper. Antonella Semerano checked the English.
271
///\\\~//\\\\y \\\\~!!xw/\\'u
F i g . 2. Ten images of the sequence and the relative flow fields computed on the reference row.
Y 9 ,.~ Y
9 i~re-
.ls.~
4,.~.
4 ~
Fig. 3. The values of the parameters m and n of s computed during the sequence.
References
[LI1] Little J., Bulthoff H. and Poggio T.: Parallel Optical Flow Using Local Voting. IEEE
2nd International Conference in Computer Vision, 1988
[SA1] Sandini G. and Tistarelli M.: Robust Obstacle Detection Using Optical Flow. IEEE
Workshop on Robust Computer Vision, 1-3 October 1990, Seattle - USA
JAN1] Ancona N.: A First Step Toward a Temporal Integration of Motion Parameters.
IECON'91, October 28 1991, Kobe - Japan
[EN1] Enkelmann W.: Obstacle Detection by Evaluation of Optical Flow Fields from Image
Sequences. First European Conference on Computer Vision, April 1990, Antibes - France
A parallel i m p l e m e n t a t i o n
of a structure-from-motion algorithm
1 Introduction
PARADOX [5] is a hybrid parallel architecture which has been commissioned at Oxford
in order to improve the execution speed of vision algorithms and to facilitate their in-
vestigation in time-critical applications such as autonomous vehicle guidance. Droid[3] is
a struciure-frora-motion vision algorithm which estimates 3-Dimensional scene structure
from an analysis of passive image sequences taken from a moving camera. The motion of
the camera (ego-motion) is unconstrainted, and so is the structure of the viewed scene.
Until recently, because of the large amount of computation required, Droid has been
applied off-line using prerecorded image sequences, thus making real-time evaluation of
performance difficult.
Droid functions by detecting and tracking discrete image features through the image
sequence, and determining from their image-plane trajectories both their 3D locations
and the 3D motion of the camera. The extracted image features are assumed to be
the projection of objective 3D features. Successive observations of an image feature are
combined by use of a Kalman filter to provide optimum 3D positional accuracy.
The image features originally used by Droid are determined from the image, I, by
forming at each pixel location the 2 2 matrix, A = w 9 [ ( V I ) ( V I ) r ] , where w is a
Ganssian smoothing mask. Feature points are placed at maxima of the response function
R [3], R = det(A) - k(trace(A)) 2, where k is a weighting constant. Often, features are
located near image corners, and so the operator tends to be referred to as a corner finder.
In fact, it also responds to local textural variations in the grey-level surface where there
are no extracted edges. Such features arise naturally in unstructured environments such
as natural scenes. Manipulation and matching of corners are quite straightforward and
relatively accurate geometric representation of the viewed scene can be achieved. In the
current implementation, the depth map is constructed from tracked 3D points using a
local interpolation scheme based on Delanuay triangulation [2].
Droid runs in two stages: the first stage is the booting stage, called boot mode, in
which Droid uses the first two images to start the matching process; the second stage is
the run stage called run mode.
273
In the boot mode, points in the two 2D images are matched using epipolar constraints.
The matched points provide disparity information which is then used for estimation of
ego-motion and 3D instantiation. Ego-motion is described as a 6-vector (3 in translation
and 3 in rotation).
The run mode of Droid includes a 3D-2D match which associates the 3D points with
the newly detected 2D points, an updated ego-motion estimation and a 2D-2D match,
between residual points in the feature points list and unmatched points from the previous
frame, to identify new 3D features. Also, 3D points which have been unmatched over a
period are retired.
2 PARADOX Architecture
3 Performance Evaluation
Figure 2 shows an image from a sequence of frames with a superimposed Cartesian grid
plot of the interpreted 3D surface by Droid. The driveable region can be clearly identified.
An algorithm has been developed by D. Charnley [1] to extract the drivable region by
computing the surface normal of each grid.
The above demonstrates qualitatively the performance of Droid in live situations, but
not quantitatively. A series of experiments has been conducted at Oxford and at Roke
Manor to measure the performance of Droid in both live and static environments. The
intention has been to demonstrate the competence of dynamic vision in a real environ-
ment.
The performance obtained from PARADOX for parallel Droid was 0.87 seconds per
frame which is 17 times faster than a pure Sun-4 implementation. The overall performance
is limited primarily by the parallel execution of the 3D-isation and corner detection
algorithms which have comparable execution times. The Datacube control and visual
display functions contribute negligible fraction of execution time.
274
AGV
VMEbes
BAe :~:::::::;:~i~~~~i~~~
~ ~ ~' ~~~~~~~~~~~~~~~:"~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~
~~~~~~
~~~! :'~[ii~:~:~
/ " '
The laser scanner on the vehicle can determine the AGV's location (2D position and
orientation) by detecting fixed bar-coded navigation beacons. This allows comparison
between the "true" AGV trajectory and that predicted by Droid. The following results
were obtained from an experiment where the AGV was programmed to run in a straight
line with varying speeds.
z (m)
-1, 1 2 3 4 5 6 7 8 9 i0
Fig. 3. Plane view of A G V trajectory. Solid iine-Droid predicted motion; Dashed line-laser
scanner readings
Figure 3 depicts a plane view of the AGV's trajectories: the solid line represents the
AGV trajectory reported by Droid and the dashed line as reported by the laser scanner.
In this particular run, the AGV has been programmed to move in a straight line at
two different speeds. For the first part of the run it travels at about 8 cm/sec and for
the second it travels at about 4 cm/sec. Droid reports the camera position (6 degrees of
freedom - 3 translation and 3 rotation) from the starting point of the vehicle, which has
coordinates (z0, z0) = (0, 0) and the laser reported trajectory is re-aligned accordingly.
It can be seen from Figure 3 that the alignment between the laser scanner readouts and
the Droid prediction is very close.
During the run, the vehicle has been stopped twice manually to test this system's
tolerance under different situations. Figure 4 shows the speed of the AGV as determined
by Droid (solid line) and by the on-board laser scanner(dashed line). The speed plots
in figure 4 agree closely apart from the moment when the vehicle alters its speed where
Droid consistently overshoots. This can be improved using non-critical dumping.
1( speed
,= I
(cm/sec)
,
AGV speed, Sequence 12
8 , ,
q ,
ou wuu tSO L,O0
Fig. 4. Comparison of A G V speed. Solid line-speed reported using Droid; Dashed line-speed
reported using laser scanner
been developed and is under test at Oxford [7]. This uses second order directional deriva-
tives with the direction tangential to an edge. This algorithm has improved accuracy of
corner localisation and reduced computational complexity. Consequently, it allows faster
execution speed (14 frames per second) than the original Droid corner detection algo-
rithm. This, together with parallelisation of the 3D-isation algorithms, will offer further
improvements to overall execution speed. Future work will include (1) the incorporation
of the new fast corner detection algorithm into Droid, (2) the use of odometery informa-
tion taken from the AGV to provide Droid with more accurate motion estimations, and
(3) to eventually close the control loop of the AGV-that is to control AGV by utilising
the information provided by Droid.
References
1. D Charnley and R Blisset. Surface reconstruction from outdoor image sequences. Image
and Vision Computing, 7(1):10-16, 1989.
2. L. De Floriani. Surface representation based on triangular grids. The Visual Computer,
3:27-50, 1987.
3. C G Harris and J M Pike. 3D positional integration from image sequences. In Proc. 3rd
Alvey Vision Conference, Cambridge, Sept. 1987.
4. J.A. Sheen. A parallel architecture for machine vision. In Colloquium on Practical applica-
tions of signal processing. Institution of the Electrical Engineers, 1988. Digest no: 1988/111.
5. H Wang and C C Bowman. The Oxford distributed machine for 3D vision system. In
IEE colloquium on Parallel Architectures for Image Processing Applications, pages 1/2-5/2,
London, April 1991.
6. H Wang, P M Dew, and J A Webb. Implementation of Apply on a transputer array. CON-
CURRENCY: Practice and Experience, 3(1):43-54, February 1991.
7. Han Wang and Mike Brady. Corner detection for 3D vision using array processors. In
B A R N A I M A G E 91, Barcelona, Sept. 1991. Springer-Vedag.
Structure from Motion Using the Ground Plane Constraint
T. N. Tan, G. D. Sullivan & K. D. Baker
Department of Computer Science, University of Reading
Reading, Berkshire RG6 2AY, ENGLAND
1. Introduction
The work described here was carried out as part of the ESPRIT II project P2152 (VIEWS -
Visual Inspection and Evaluation of Wide-area Scenes). It is concerned with semi-automatic
methods to construct geometric object models using monocular monochromatic image
sequences. A common feature in the images used in the VIEWS project is that the movement of
objects satisfies the ground plane constraint [1, 2], i.e., the objects move on a ground surface
which, locally at least, is approximately flat. We approximate the flat ground surface by the X-Y
plane of a world coordinate system (WCS), whose Z-axis points upwards. In this WCS, an
object can only translate along the X- and Y-axis, and rotate about the Z-axis, leaving 3 degrees
of freedom of motion.
We show in this paper that, in order to make the most effective use of the ground plane
constraint in structure from motion (SFM), it is necessary to formulate structure (and motion)
constraint equations in the WCS. This allows us to derive simple yet robust SFM algorithms.
The paper is organised as follows. We first discuss the use of the ground plane constraint to
simplify the constraint equations on the relative depths (i.e., the structure) of the given rigid
points. We then describe simple robust methods for solving the constraint equations to recover
the structure and motion parameters. Experimental studies of both a conventional 6 degrees of
freedom SFM algorithm [3] and the proposed 3 degrees of freedom algorithm are then reported.
2. Constraint Equations
We assume a pinhole camera model with perspective projection as shown in Fig.1. Under this
...................!:i::i:i........ ~Yc
2
imaging model, the squared distance dmn measured in the WCS between two points P,, with
image coordinates (Um,Vm)and P, with image coordinates (Un,Vn)is given by
2
dmn -- ( ~mUm - )~nUn) 2 + ( )~mVm - ~,nVn) 2 + ( ~mWm- )~nWn)2 (I)
where km and Xn are the depths (scales)of P,~ and P, respectively(XF = zr and U, V and W
are terms computable fromknown cameraparameters and image coordinates [I, 2]. A similar
equation can be written for the point pair in a subsequent frame. Using primed notation to
indicate the new frame, we have
where Xni = 1 for i -- n and the superscript n indicates the depths computed under
reference point Pn" The depths of each set in (6) are normalised with respect to the depth of the
same single point (say P1 ) in the corresponding set to get N normalized sets of depths as
279
4. E x p e r i m e n t a l Results
We have compared the performance of the new algorithm with that of a recent linear SFM
algorithm proposed by Weng et al. [3]. For convenience, we call the proposed algorithm the
TSB algorithm, and the algorithm in [3] the WHA algorithm in the subsequent discussions.
Using synthetic image data, Monte Carlo simulations were conducted to investigate the
noise sensitivity of, and the influence of the number of point correspondences on the two
algorithms. Comprehensive testing has been carried out [1, 2]. Numerous results show that in
general the TSB algorithm performs much better than the WHA algorithm especially under high
noise conditions.
With real image sequences, the assessment of the accuracy of the recovered structure is not
straightforward as the ground truth is usually unknown. Proposals have been made [3] to use the
standard image error (SIE) Ae defined as follows [3]
,, 4+d'?
I i=~..~ ~
where N is the number of points, d i and d~ the distances in the two images between the
(9)
projection of the reconstructed 3D point i and its observed positions in the images. The SIEs of
the two algorithms applied to five different moving objects are listed in Table I. In these terms,
Table 1
SIE (in pixels) of Algorithms TSB and W H A Under Real Image Sequences
MovingObject StandardImage Error Error Ra~o
AlgorithmWHA AlgorithmTSB (WHA/TSB)
Lorry 0.160 0.00134 119
Estate1 0.455 0.00144 316
Estate2 0.532 0.00147 362
SaJoonl 0.514 0.00177 290
Saloon2 1.188 0.00171 695
the TSB algorithm performs several hundred times better than the WHA algorithm. It is argued
in [1, 2] that the SIE gives a poor measure of performance. In addition to the observed error in
the two frames, the performance should be analysed by a qualitative assessment of the projected
wire-frame model of all given points from other views. For example, Fig.2 shows two images of
280
Figure 3. Three different views of the lorry model recovered by the new algorithm
against the lorry image sequence and tracked automatically using the methods reported in [9] as
illustrated in Fig.4. The match is very good.
We have also investigated the sensitivity of the proposed algorithm to systematic errors
such as errors in rotational camera parameters. It was found that such errors have small effects
281
Figure 4. Matching between the recovered lorry model and four lorry images
on the estimation of the rotation angle, and moderate impact on that of the translational
parameters. Detailed results cannot be reported here due to space limitation.
5. Discussion
In the real world, the movement of many objects (e.g., cars, objects on conveyor belts, etc.) is
constrained in that they only move on a fixed plane or surface (e.g., the ground). A new SFM
algorithm has been presented in this paper which, by formulating motion constraint equations in
the world coordinate system, makes effective use of this physical motion constraints (the ground
plane constraint). The algorithm is computationally simple and gives a unique and closed-form
solution to the motion and structure parameters of rigid 3-D points. It is non-iterative, and
usually requires two points in two frames. The algorithm has been shown to be greatly superior
to existing linear SFM algorithms in accuracy and robustness, especially under high noise
conditions and when there are only a small number of corresponding points. The recovered 3-D
coordinates of object points from outdoor images enable us to construct 3-D geometric object
models which match the 2-D image data with good accuracy.
References
[1] T.N. Tan, G. D. Sullivan, and K. D. Baker, Structure from Constrained Motion, ESPRIT II
P2152 project report, RU-03-WP.T411-01, University of Reading, March 1991.
[2] T.N. Tan, G. D. Sullivan, and K. D. Baker, Structure from Constrained Motion Using Point
Correspondences, Proc. of British Machine Vision Conf., 24-26 September 1991, Glasgow,
Scotland, Springer-Verlag, 1991, pp.301-309.
[3] J.Y. Weng, T. S. Huang, and N. Ahuja, Motion and Structure from Two Perspective Views:
Algorithms, Error Analysis, and Error Estimation, IEEE Trans. Pattern Anal. Mach. lntell.,
vol.ll, no.5, 1989, pp.451-477.
[4] A. Mitiche and J. K. Aggarwal, A Computational Analysis of Time-Varying Images, in
Handbook of Pattern Recognition and Image Processing, T.Y. Young and K. S. Fu, Eds.
New York: Academic Press, 1986.
[5] S. Ullman, The Interpretation of Visual Motion, MIT Press, 1979.
[6] J.K. Aggarwal and N. Nandhakumar, On the Computation of Motion from Sequences of
Images - A Review, Proc. oflEEE, vol.76, no.8, 1988, pp.917-935.
[7] T.N. Tan, G. D. Sullivan, and K. D. Baker, 3-D Models from Motion (MFM) - an
application support tool, ESPRIT II P2152 project report, RU-03-WP.T411-02, University
of Reading, June 1991.
[8] D.J. Heeger and A. Jepson, Simple Method for Computing 3D Motion and Depth, Proc. of
IEEE 3rd lnter. Conf. on Computer Vision, December 4-7, 1990, Osaka, Japan, pp.96-100.
[9] A.D. Worrall, G. D. Sullivan, and K. D. Baker, Model-based Tracking, Proc. of British
Machine Vision Conf., 24-26 September 1991, Glasgow, Scotland, Springer-Verlag, 1991,
pp.310-318.
Detecting and Tracking Multiple Moving Objects
Using Temporal Integration*
Michal Irani, Benny Rousso, Shmuel Peleg
Dept. of Computer Science
The Hebrew University of Jerusalem
91904 Jerusalem, ISRAEL
1 Introduction
Motion analysis, such as opticalflow [7], is often performed on the smallest possible re-
gions, both in the temporal domain and in the spatial domain. Small regions, however,
carry little motion information, and such motion computation is therefore very inac-
curate. Analysis of multiple moving objects based on optical flow.[1] suffers from this
inaccuracy.
The major difficulty in increasing the size of the spatial region of analysis is the
possibility that larger regions will include more than a single motion. This problem
has been treated for image-plane translations with the dominant translation approach
[3, 4]. Methods with larger temporal regions have also been introduced, mainly using a
combined spatio-temporal analysis [6, 10]. These methods assume motion constancy in
the temporal regioiis, i.e., motion should be constant in the analyzed sequence.
In this paper we propose a method for detecting and tracking multiple moving objects
using both a large spatial region and a large temporal region without assuming temporal
motion constancy. When the large spatial region of analysis has multiple moving objects,
the motion parameters and the locations of the objects are computed for one object after
another. The method has been applied successfully to parametric motions such as affine
and projective transformations. Objects are tracked using temporal integration of images
registered according to the computed motions.
Sec. 2 describes a method for segmenting the image plane into differently moving
objects and computing their motions using two frames. Sec. 4 describes a method for
tracking the detected objects using temporal integration.
* This research has been supported by the Israel Academy of Sciences.
283
To detect differently moving objects in an image pair, a single motion is first computed,
and a single object which corresponds to this motion is identified. We call this motion the
dominant motion, and the corresponding object the dominant object. Once a dominant
object has been detected, it is excluded from the region of analysis, and the process is
repeated on the remaining region to find other objects and their motions.
The motion parameters of a single translating object in the image plane can be recovered
accurately, by applying the iterative translation detection method mentioned in Sec. 3 to
the entire region of analysis. This can be done even in the presence of other differently
moving objects in the region of analysis, and with no prior knowledge of their regions of
support [5]. It is, however, rarely possible to compute the parameters of a higher order
parametric motion of a single object (e.g. affine, projective, etc.) when differently moving
objects are present in the region of analysis.
Following is a summary of the procedure to compute the motion parameters of an
object among differently moving objects in an image pair:
The above procedure segments an object (the dominant object), and computes its
motion parameters (the dominant motion) using two frames. An example for the de-
termination of the dominant object using an affine motion model between two frames
is shown in Fig. 2.c. In this example, noise has affected strongly the segmentation and
motion computation. The problem of noise is overcome once the algorithm is extended
to handle longer sequences using temporal integration (Sec. 4).
3 Motion A n a l y s i s a n d Segmentation
This section describes briefly the methods used for motion computation and segmenta-
tion: A more detailed description can be found in [9].
Motion Computation. It is assumed that the motion of the objects can be approximated
by 2D parametric transformations in the image plane. We have chosen to use an iter-
ative, multi-resolution, gradient-based approach for motion computation [2, 3, 4]. The
parametric motion models used in our current implementation are: pure translations (two
parameters), affine transformations (six parameters [3]), and projective transformations
(eight parameters [1]).
284
Segmentation. Once a motion has been determined, we would like to identify the region
having this motion. To simplify the problem, the two images are registered using the
detected motion. The motion of the corresponding region is therefore cancelled, and the
problem becomes that of identifying the stationary regions.
In order to classify correctly regions having uniform intensity, a multi-resolution
scheme is used, as in low resolution pyramid levels the uniform regions are small. The
lower resolution classification is projected on the higher resolution level, and is updated
according to higher resolution information (gradient or motion) when it conflicts the
classification from the lower resolution level.
Moving pixels are detected in each resolution level using only local analysis. A simple
grey level difference is not sufficient for determining the moving pixels. However, the grey
level difference normalized by the gradient gives better results, and was sufficient for our
experiments. Let I(z, y, t) be the gray level of pixel (x, y) at time t, and let VI(x, y, t) be
it spatial intensity gradient. The motion measure D(z, y, t) used is the weighted average
of the intensity differences normalized by the gradients over a small neighborhood N(z, y)
of (x, y).
The algorithm for the detection of multiple moving objects described in Sec. 2 is extended
to track objects in long image sequences. This is done by using temporal integration of
images registered with respect to the tracked motion, without assuming temporal motion
constancy. The temporally integrated image serves as a dynamic internal representation
image of the tracked object.
Let {I(t)} denote the image sequence, and let M(t) denote the segmentation mask of
the tracked object computed for frame I(t), using the segmentation method described in
Sec. 3. Initially, M(0) is the entire region of analysis. The temporally integrated image
is denoted by Av(t), and is constructed as follows:
Av(O) ~f I(0)
(~)
Av(t + 1) de=fw. I(t + 1) + (1 -- w). register(Av(t), I(t + 1))
where currently w = 0.3, and register(P, Q) denotes the registration of images P and
Q by warping P towards Q according to the motion of the tracked object computed
between them. A temporally integrated image is shown in Fig. 1.
Following is a summary of the algorithm for detecting and tracking the dominant
object in an image sequence, starting at t = 0:
1. Compute the dominant motion parameters between the integrated image Av(t) and
the new frame I(t + 1), in the region M(t) of the tracked object (Sec. 2).
2. Warp the temporally integrated image Av(t) and the segmentation mask M(t) to-
wards the new frame I(t + 1) according to the computed motion parameters.
285
3. Identify the stationary regions in the registered images above (Sec. 3), using the
registered mask M(t) as an initial guess. This will be the tracked region in I(t + 1).
4. Compute the integrated image Av(t + 1) using (2), and process the next frame.
When the motion model approximates well enough the temporal changes of the
tracked object, shape changes relatively slowly over time in registered images. There-
fore, temporal integration of registered frames produces a sharp and clean image of the
tracked object, while blurring regions having other motions. An example of a temporally
integrated image of a tracked rolling ball is shown in Fig. 1. Comparing each new frame
to the temporally integrated image rather than to the previous frame gives the a strong
bias to keep tracking the same object. Since additive noise is reduced in the the average
image of the tracked object, and since image gradients outside the tracked object decrease
substantially, both segmentation and motion computation improve significantly.
In the example shown in Fig. 2, temporal integration is used to detect and track
a single object. Comparing the segmentation shown in Fig. 2.c to the segmentation in
Fig. 2.d emphasizes the improvement in segmentation using temporal integration.
Fig. 2. Detecting and tracking the dominant object using temporal integration.
a-b) Two frames in the sequence. Both the background and the helicopter are moving.
c) The segmented dominant object (the background) using the dominant affine motion computed
between the first two frames. Black regions axe those excluded from the dominant object.
d) The segmented tracked object after a few frames using temporal integration.
Another example for detecting and tracking the dominant object using temporal inte-
gration is shown in Fig. 3. In this sequence, taken by an infrared camera, the background
moves due to camera motion, while the car has another motion. It is evident that the
tracked object is the background, as other regions were blurred by their motion.
286
Fig. 3. Detecting and tracking the dominant object in an image sequence using temporal inte-
gration.
a-b) Two frames in an infrared sequence. Both the background and the car are moving.
c) The temporally integrated image of the tracked object (the background). The background
remains sharp with less noise, while the moving car blurs out.
d) The segmented tracked object (the background) using an afline motion model. White regions
are those excluded from the tracked region.
This temporal integration approach has characteristics similar to human motion de-
tection. For example, when a short sequence is available, processing the sequence back
and forth improves the results of the segmentation and motion computation, in a similar
way that repeated viewing helps human observers to understand a short sequence.
4.1 T r a c k i n g O t h e r O b j e c t s
After segmentation of the first object, and the computation of its affine motion between
every two successive frames, attention is given to other objects. This is done by applying
once more the tracking algorithm to the "rest" of the image, after excluding the first
detected object. To increase stability, the displacement between the centers of mass of
the regions of analysis in successive frames is given as the initial guess for the computation
of the dominant translation. This increases the chance to detect fast small objects.
After computing the segmentation of the second object, it is compared with the
segmentation of the first object. In case of overlap between the two segmentation masks,
pixels which appear in the masks of both the first and the second objects are examined.
They are reclassified by finding which of the two motions fits them better.
Following the analysis of the second object, the scheme is repeated recursively for
additional objects, until no more objects can be detected. In cases when the region of
analysis consists of many disconnected regions and motion analysis does not converge,
the largest connected component in the region is analyzed.
In the example shown in Fig. 4, the second object is detected and tracked. The de-
tection and tracking of several moving objects can be performed in parallel, by keeping
a delay of one or more frame between the computations for different objects.
5 Concluding Remarks
Fig. 4. Detecting and tracking the second object using temporal integration.
a) The initial segmentation is the complement of the first dominant region (from Fig. 3.d).
b) The temporMly integrated image of the second tracked object (the car). The car remains
sharp while the background blurs out.
c) Segmentation of the tracked object after 5 frames.
References
1. G. Adiv. Determining three-dimensional motion and structure from optical flow generated
by several moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence,
7(4):384-401, July 1985.
2. J.R. Bergen and E.H. Adelson. Hierarchical, computationally efficient motion estimation
algorithm. J. Opt. Soc. Am. A., 4:35, 1987.
3. J.R. Bergen, P.J. Burr, K. Hanna, R. Hingorani, P. Jeanne, and S. Peleg. Dynamic
multiple-motion computation. In Y.A. Feldman and A. Bruckstein, editors, Artificial Intel-
ligence and Computer Vision: Proceedings of the Israeli Conference, pages 147-156. Elsevier
(North Holland), 1991.
4. J.R. Bergen, P.J. Butt, R. I-Iingorani, and S. Peleg. Computing two motions from three
frames. In International Conference on Computer Vision, pages 27-32, Osak&, Japa~n,
December 1990.
5. P.J. Butt, R. Hingoraaai, and R.J. Kolczynski. Mechanisms for isolating component patterns
in the sequential analysis of multiple motion. In IEEE Workshop on Visual Motion, pages
187-193, Princeton, New Jersey, October 1991.
6. D.J. Heeger. Optical flow using spatiotemporal filters. International Journal of Computer
Vision, 1:279-302, 1988.
7. B.K.P. Horn and B.G. Schunck. Determining optical flow. Artificial Intelligence, 17:185-
203, 1981.
8. M. Irani and S. Peleg. Improving resolution by image registration. CVGIP: Graphical
Models and Image Processing, 53:231-239, May 1991.
9. M. Irani, B. Rousso, and S. Peleg. Detecting multiple moving objects using temporal inte-
gration. Technical Report 91-14, The Hebrew University, December 1991.
10. M. Shizawa and K. Maze. Simultaneous multiple optical flow estimation. In International
Conference on Pattern Recognition, pages 274-278, Atlantic City, New Jersey, June 1990.
This article was processed using the IbTEX macro paclmge with ECCV92 style
A Study of Affine Matching With Bounded Sensor
Error *
1 Introduction
* This report describes research done in part at the Artificial Intelligence Laboratory of the
Massachusetts Institute of Technology. Support for the laboratory's research is provided in
part by an ONR URI grant under contract N00014-86-K-0685, and in part by DARPA un-
der Army contract number DACA76-85-C-0010 and under ONR contract N00014-85-K-0124.
WELG is supported in part by NSF contract number IRI-8900267. DPH is supported at Cor-
nell University in part by NSF grant IRI-9057928 and matching funds from General Electric
and Kodak, and in part by AFOSR under contract AFOSR-91-0328.
292
1.1 Afllne T r a n s f o r m a t i o n s a n d I n v a r i a n t R e p r e s e n t a t i o n s
Thus the pair (a,/3) constitute affine-invariant coordinates of x with respect to the basis
( m l , m2, m3). We can think of (a,/3) as a point in a 2D space, termed the a-/3-plane.
The main issue we wish to explore is: Given a model basis of three points and some
other model point, what sets of four image features are possible transformed instances of
these points? The exact location of each image feature is unknown, and thus we model
image features as discs of radius e. The key question is what effect this uncertainty has
on which image quadruples are possible transformed instances of a model quadruple.
We assume that a set of model points is given in a Cartesian coordinate frame, and
some distinguished basis triple is also specified. Similarly a set of image points is given
in their coordinate frame. Two methods can be used to map between the model and
the image. One method, used by geometric hashing [20], maps both model and image
points to (c~,/3) values using the basis triples. The other method, used by alignment [15],
computes the transformation mapping the model basis to the image basis, and uses it to
map all model points to image coordinates. In both cases, a distinguished set of three
293
model and image points is used to map a fourth point (or many such points) into some
other space. We consider the effects of uncertainty on these two methods.
First we characterize the range of image measurements in the x-y (Euclidean) plane
that are consistent with the (a,/9) pair computed for a given quadruple of model points,
as specified by equation (1). This corresponds to explicitly computing a transformation
from one Cartesian coordinate frame (the model) to another (the image). We find that if
sensor points' locational uncertainty is bounded by a disc of radius e, then the range of
possible image measures consistent with a given (c~,/9) pair is a disc with radius between
e(1 + [c~[+ [/9[) and 2e(1 + [a[ + [/9[). This defines the set of image points that could match
a specific model point, given both an image and model basis.
We then perform the same analysis for the range of affine coordinate, (c~,/9), values
that are consistent with a given quadruple of points. This corresponds to mapping both
the model and image points to (a,/9) values. To do this, we use the expressions derived
for the Euclidean case to show that the region of a-/9-space that is consistent with a
given point and basis, is in general an ellipse containing the point (c~,/9). The parameters
of the ellipse depend on the actual locations of the points defining the basis. Hence the
set of possible values in the a-/9-plane c a n n o t be computed independent of the actual
locations of the image basis points. In other words there is an interaction between the
uncertainty in the sensor values and the actual locations of the sensor points. This limits
the applicability of methods which assume that these are independent of one another. For
example, the geometric hashing method requires that the a-/9 coordinates he independent
of the actual location of the basis points in order to construct a hash table.
2 Image U n c e r t a i n t y a n d Affine C o o r d i n a t e s
Consider a set of three model points, m l , m~, m3, and the affine coordinates (a,/9) of a
fourth model point x defined by
plus a set of three sensor points Sl, s2, s3, such that si = T ( m i ) Jr e i , where T is some
affine transformation, and ei is an arbitrary vector of magnitude at most ei. That is, T
is some underlying affine transformation that cannot be directly observed in the data
because each data point is known only to within a disc of radius ei.
We are interested in the possible locations of a fourth sensor point, call it ~1, such that
could correspond to the ideally transformed point T(x). The possible positions of ~ are
affected both by the error in measuring each image basis point, sl, and by the error in
measuring the fourth point itself. Thus the possible locations are given by transforming
equation (2) and adding in the error e0 from measuring x,
= T ( m l + a ( m 2 - m l ) + / 9 ( m 3 - m l ) ) + eo
= Sl -~- or(s2 - Sl) q-/9(s3 - Sl) - el -~- or(el - e2) q-/9(el - e3) -I- eo.
The measured point ~ can lie in a range of locations about the ideal location Sl + a(s~ -
sl) +/9(s3 - sl) with deviation given by the linear combination of the four error vectors:
The set of possible locations specified by a given ei is a disc of radius el about the origin:
Similarly, the product of any constant k with ei yields a disc C(kei) of radius Iklei
centered about the origin. Thus substituting the expressions for the disc in equation (3),
the set of all locations about the ideal point sl + 4(s2 - sl) +/~(s~ - sl) is:
C l a i m 1 C(r~) 9 C(r~) = C(r~) e C(,'~) = C(,'1 + ,'~.), where C(,'i) is a disc of radi,s
ri centered about the origin, ri > O.
The absolute values arise because a and ~ can become negative, but the radius of a
disc is a positive quantity. Clearly the radius of the error disc grows with increasing
magnitude of 4 and ~, but the actual expression governing this growth is different for
different portions of the 4 - fl-plane, as shown in figure 1.
Fig. 1. Diagram of error effects. The region of feasible points is a disc, whose radius is given by
the indicated expression, depending on the values of tr and 19. The diagonal line is 1 - a - 19= 0.
We can bound the expressions defining the radius of the uncertainty disc by noting:
P r o p o s i t i o n I . The range of image locations that is consistent with a given pair of affine
coordinates (4,~) is a disc of radius r, where
and where e > 0 is a constant bounding the positional uncertainty of the image data.
295
ID 4
1
O 2
03 i I s SS
Fig. 2. Diagram of error effects. On the left are four model points, on the right are four image
points, three of which are used to establish a basis. The actual position of each transformed
model point corresponding to the basis image points is offset by an error vector of bounded
magnitude. The coordinates of the fourth point, written in terms of the basis vectors, can thus
vary from the ideal case, shown in solid lines, to cases such as that shown in dashed lines. This
leads to a disc of variable size in which the corresponding fourth model point could lie.
The expression in Proposition 1 allows the calculation of error bounds for any method
based on 2D affine transformations, such as [2, 15, 24]. In particular, if [a[ and [fl[ are
both less than 1, then the error in the position of a point is at most 6e. This condition
can be met by using as the affine basis, three points m l , m ~ and m a that lie on the
convex hull of the set of model points, and are maximally separated from one another.
The expression is independent of the actual locations of the model or image points,
so that the possible positions of the fourth point vary only with the sensor error and the
values of a and ft. They do not vary with the configuration of the model basis (e.g., even
if close to collinear) nor do they vary with the configuration of the image basis. Thus,
the error range does not depend on the viewing direction. Even if the model is viewed
end on, so that all three model points appear nearly collinear, or if the model is viewed
at a small scale, so that all three model points are close together, the size of the region
of possible locations of the fourth model point in the image will remain unchanged.
The viewing direction does, however, greatly affect the affine coordinate system de-
fined by the three projected model points. Thus the set of possible ~ n e coordinatesof
the fourth point, when considered directly in a-j3-space, will vary greatly. Proposition 1
defines the set of image locations consistent with a fourth point. This implicitly defines
the set of affine transformations that produce possible fourth image point locations, which
can be used to characterize the range of (a,/~) values consistent with a set of four points.
We will do the analysis using the upper bound on the radius of the error disc from
Proposition 1. In actuality, the analysis is slightly more complicated, because the expres-
sion governing the disc radius varies as shown in figure 1. For our purposes, however,
considering the extreme case is sufficient. It should also be noted from the figure that the
extreme case is in fact quite close to the actual value over much of the range of a and/~.
Given a triple of image points that form a basis, and a fourth image point, s4, we
want the range of affine coordinates for the fourth point that are consistent with the
possibly erroneous image measurements. In effect, each sensor point si takes on a range
of possible values, and each quadruple of such values produces a possibly distinct value
using equation (1). As illustrated in figure 3 we could determine all the feasible values
by varying the basis vectors over the uncertainty discs associated with their endpoints,
finding the set of (a',/~ ~) values such that the resulting point in this affine basis lies
within e of the original point. By our previous results, however, it is equivalent to find
affine coordinates (a',/~') such the Euclidean distance from s 1 --]-O~t(S2 -- Sl) -~- fl'(S 3 -- 81)
296
a'u'
Fig. 3. On the left is a canonical example of affine coordinates. The fourth point is offset from
the origin by a scaled sum of basis vectors, a u +/~v. On the right is a second consistent set of
affine coordinates. By taking other vectors that lie within the uncertainty regions of each image
point, we can find different sets of affine coordinates a ' , ~' such that the new fourth point based
on these coordinates also lies within the uncertainty bound of the image point.
T h e b o u n d a r y of the region of such p o i n t s ( a ' , ~ ' ) occurs w h e n the distance from the
n o m i n a l image p o i n t s4 = Sl + a(s2 - s l ) + / ~ ( s 3 - s l ) is 2e(1 + la'] + I~'D, i.e.
[ 2 d l + I~'l + I~'1)] 2 = [(~ - ~')u] 2 + 2(~ - ~ ' ) ( ~ - ~ ' ) w cos r + [(~ - ~')v] 2 (5)
where
all : u 2 -- 4e ~ a22 = v 2 - 4e 2
al~ = v u c o s r 4s~s~ ~ a13 = - u [ ~ u + ~ v cos r - 4 s ~ 2
a23 = --v [au cos r + j3v] - 4s~e 2 a3a = a 2 u 2 + 2 a ~ u v cos r +/~2v2 - 4e 2
I = u 2 + v 2 - 8r 2 (v)
D = u2v 2 sin 2 r - 4e 2 (u s - 2 u v s a s ~ cos r + v ~) (s)
A = - 4 e 2 u 2 v 2 sin 2 r + sao~ + s#/~) 2 (9)
If u 2 + v 2 > 8e 2, t h e n ~ < 0. F u r t h e r m o r e , if
then D > 0 and the conic defined by equation (5) is an ellipse. These conditions are not
met only when the image basis points are very close together, or when the image basis
points are nearly collinear. For instance, if the image basis vectors u and v are each at
least 2e in length then u 2 + v 2 > 8e s. Similarly, if s i n e is not small, D > 0. In fact, cases
where these conditions do not hold will be very unstable and should be avoided.
We can now compute characteristics of the ellipse. The area of the ellipse is given by
Hence, given four points whose locations are only known to within e-discs, there is an
elliptical region of possible (a,/~) values specifying the location of one point with respect
to the other three. Thus if we compare (a,/~) values generated by some object model with
those specified by an e-uncertain image, each image d a t u m actually specifies an ellipse
of (a, f~) values, whose area depends on e, a, f/, and the configuration of the three image
points that form the basis. To compare the model values with image values one must see
if the affine-invariant coordinates for each model point lie within the elliptical region of
possible affine-invariant values associated with the corresponding image point.
The elliptical regions of consistent parameters in 4-/~-space cause some difficulties
for discrete hashing schemes. For example, geometric hashing uses affine coordinates
of model points, computed with respect to some choice of basis, as the hash keys to
store the basis in a table. In general, the implementations of this method use square
buckets to tessellate the hash space (the a-/~-space). Even if we chose buckets whose size
is commensurate with the ellipse, several such buckets are likely to intersect any given
ellipse due to the difference in shape. Thus, one must hash to multiple buckets, which
increases the probability that a random pairing of model and image bases will receive a
large number of votes.
A further problem for discrete hashing schemes is that the size of the ellipse increases
as a function of (1 + 141 + I~1)s. Thus points with larger affine coordinates give rise to
larger ellipses. Either one must hash a given value to m a n y buckets, or one must account
for this effect by sampling the space in a manner that varies with (1 + 141 + I~1)s.
The most critical issue for discrete hashing schemes, such as geometric hashing, is
that the shape, orientation and position of the ellipse depend on the specific image basis
chosen. Because the error ellipse associated with a given (4,/~) pair depends on the
298
characteristics of the image basis, which are not known until run time, there is no way
to pre-compute the error regions and thus no clear way to fill the hash table as a pre-
processing step, independent of a given image. It is thus either necessary to approximate
the ellipses by assuming bounds on the possible image basis, which will allow both false
positive and false negative hits in the hash table, or to compute the ellipse to access
at run time. Note that the geometric hashing method does not address these issues. It
simply assumes that some 'appropriate' tessellation of the image space exists.
In summary, in this section we have characterized the range of image coordinates and
the range of (a, j3) values that are consistent with a given point, with respect to some
basis, when there is uncertainty in the image data. In the following section we analyze
what fraction of all possible points (in some bounded image region) are consistent with a
given range of (a,/~) values. This can then be used to estimate the probability of a false
match for various recognition methods that employ affine transformations.
What is the probability than an object recognition system will erroneously report an
instance of an object in an image? Recall that such an instance in general is specified by
giving a transformation from model coordinates to image coordinates, and a measure of
'quality' based on the number of model features that are paired with image features under
this transformation. Thus we are interested in whether a random association of model
and image features can occur in sufficient number to masquerade as a correct solution.
We use the results developed above to determine the probability of such a false match.
There are two stages to this analysis; the first is a statistical analysis that is independent
of the given recognition method, and the second is a combinatorial analysis that depends
on the particular recognition method. In this section we examine the first stage. In the
following section we apply the analysis to the alignment method.
To determine the probability that a match will be falsely reported we need to know
the 'selectivity' of a quadruple of model points. Recall that each model point is mapped
to a point in a-/~-space, with respect to a particular model basis. Similarly each image
point, modeled as a disc, is mapped to an elliptical region of possible points in a-/~-space.
Each such region that contains one or more model points specifies an image point that is
consistent with the given model. Thus we need to estimate the probability that a given
image basis and fourth image point chosen at random will map to a region of a-/~-space
that is consistent with one of the model points written in terms of some model basis.
This is characterized by the proportion of ~-/~-space consistent with a given basis and
fourth point (where the size of the space is bounded in some way). As shown above, the
elliptical regions in a-j3-space are equivalent to circular regions in image space. Thus, for
ease of analysis we use the formulation in terms of circles in image space.
To determine the selectivity, assume we are given some image basis and a potential
corresponding model basis. Each of the remaining m - 3 model points are defined as affine
coordinates relative to the model basis. These can then be transformed into the image
domain, by using the same affine coordinates, with respect to the image basis. Because
of the uncertainty of t h e image points, there is an uncertainty in the associated affine
transformation. This manifests itself as a range of possible positions for the model points,
as they are transformed into the image. Previously we determined that a transformed
model point had to be within 2e(1 + [c~J+ J/~[) of an image point in order to match it.
That calculation took into account error in the matched image point as well as the basis
image points. Therefore, placing an appropriately sized disc about each model point is
299
equivalent to placing an ~ sized disc about each image point. We thus represent each
transformed model point as giving rise to a disc of some radius, positioned relative to the
nominal position of the model point with respect to the image basis. For convenience, we
use the upper bound on the size of the radius, 2e(1 + I~1 + I/~1). For each model point,
we need the probability that at least one image point lies in the associated error disc
about the model point transformed to the image, because if this happens then there is
a consistent model and image point for the given model and image basis. To estimate
this probability, we need the expected size of the disc. Since the disc size varies with
I~l + 1/31, this means we need an estimate of the distribution of points with respect to
affine coordinates. By figure 1 we should find the distribution of points as a function of
(~,/~). This is messy, and thus we use an approximation instead.
For this approximation, we measure the distribution with respect to p = I~1+1/~1, since
both the upper and lower bounds on the disc size are functions ofp. Intuitively we expect
the distribution to vary inversely with p. To verify this, we ran the following experiment.
A set of 25 points were generated at random, such that their pairwise separation was
between 25 and 250 pixels. All possible bases were selected, and for each basis for which
the angle between the axes was at least ~/16, all the other model points were rewritten
in terms of atilne invariant coordinates (~,/3). This gave roughly 300,000 samples, which
we histogrammed with respect to p. We found that the maximum value for p in this case
was roughly 51. In general, however, almost all of the values were much smaller, and
indeed, the distribution showed a strong inverse drop off (see figure (4)). Thus, we use
the following distribution of points in affine coordinates:
l.Oi2
9.010
0,018
0.006'
O,Og4'
11.002'
e.NQ
:::::5 10 15
Fig. 4. Histogram of distribution of [~[ + I/~[ values. Vertical axis is ratio of number of samples
to total samples, horizontal axis is value for I~1 + I/~1. The maximum over 300,000 samples was
51. Only the first portion of the graph is displayed. Overlayed with this is the distribution given
in equation (13).
Note that this model underestimates the probability for large values of p, while over-
estimating it for small values of p. Since we want the expected size of the error disc, and
this grows with p, such an approximation will underestimate the size of the disc.
First, we integrate equation (13) and normalize to 1 to deduce the constant:
2
k= (14)
300
The second case considers discs that lie entirely within the image. For convenience,
assume that the coordinate frame of the basis is centered at the image center (since the
circle is entirely inside the image), and the image dimensions are 2r by 2r. In this case,
we have r - p _> 7 where 7 = 2e(1 + p). In general, we have p <_ pd where d is the
separation between two of the basis points in the image, and thus if 1 _< p _< cl where
cl = min {Pro,
2e+dJ
then the discs will all lie entirely within the image. Thus the second case is
As = + p)Skp-2 do = 4re2k [
Cl 2t- 2 log Cl --
= 41resk (16)
21~ + (d+2e)(r-2e) J"
The final expansion assumes that Pm> Cl, which is true for virtually all cases of interest.
Two other cases deal with discs that are partially truncated by the image bound-
aries. Details of these areas A3 and A4 are found in [14]. Because these areas contribute
minimally to the overall expected area, we focus on the cases described above.
Depending on the specific values for Pm and Cl we can add in the appropriate contri-
butions of A 1 , . . . , A4, together with the value for k (equation 14) to obtain an underesti-
mate for the expected area of an error disc - - the expected area of a circle in image space
that will be consistent with a point expressed in terms of some affine basis. Since such
discs can in general occur with equal probability anywhere in the image, the probability
that a model point lies within a disc associated with an image point is simply the ratio
of this area to the area of the image. Thus by normalizing these equations, we have an
underestimate for the selectivity of the scheme. This leads to the following result:
P r o p o s i t i o n 3 . Given a model basis and a fourth model point, the probability that a
corresponding image basis and fourth image point will map at random to a region of
(~-fl-space consistent with the model point and basis is given by
A1 + As + As + A4
p = 4r 2 (17)
where the Ai's are the areas for the four cases considered above.
This uses the upper bound on the radius of the error discs. As noted earlier, a simple
lower bound can be obtained by substituting e/2 in place of e, reflecting the use of the
bound e(1 +p) in place of 2e(1 + p). In this case, the bounds cl and cs will change slightly.
:301
We can use this to compute example values for the selectivity, which depends on Prn
(the maximum value of Icrl+ IflD. If we allow any possible triple of points to form a basis,
then Pm can he arbitrarily large. Consider a point p that makes an angle 0 with the u
axis, and where u, v make an angle ~. The value for p associated with the point p is
where m and M are the minimum and maximum distance between any two model points.
To evaluate the selectivity, we also need to know d, the length of the basis vector,
1 < d < r. Given a specific value for d, we can compute the selectivity. To get a sense of
the variation of/~, it is plotted as a function of d in figure 5, for e = 3.
1.004
1.103
0.002
0.I01
O.O00
$0 leo ISO 200 2SO
In general, d will take on a variety of values, as the choice of basis points in the
image is varied. To estimate the expected degree of selectivity, we perform the following
analysis. We assume, for simplicity, that the origin of the image basis is at the center of
the image. The second point used to establish the basis vector can lie anywhere in the
image, with equal probability. Hence the probability distribution for d is roughly ~ . We
could explicitly integrate equation 17 with respect to this distribution for d to obtain an
expected selectivity. This is messy, and instead we pursue two other options.
First, we can integrate this numerically for a set of examples, shown in Table 1 under
the column marked predicted, which lists values for p as a function of noise in the image
(with an image dimension of 2r = 500). The value of pm was set using ~0 = ~r/16,
and a ratio of minimum to maximum model point separation of M / r n = 10. It should
be noted that varying ~0 over the range lr/8 to lr/32 produced results very similar to
those reported in the table. As expected, the probability of a consistent match increases
(selectivity decreases) with increasing error in the measurements. Thus, for ranges of
parameters that one would find in many recognition situations, a considerable fraction
of the space of possible a and ~ values are consistent with a given feature and basis.
302
Table l. Table comparing ~mulated and ~redicted sdectivities. See text for discus~on.
Case Measured Predated Approximation
= 1 .000116 .000117 .000118
r = 3 .001146 .001052 .001064
e = 5 .003142 .002911 .002955
Second, we can approximate the selectivity expression. By applying power series ex-
pansions and keeping only first and second order terms, we get:
k~re2 [17 r r 2 - d 2]
D~ ~ - 4 - 2log ~ -t- ~ j . (19)
We can find the expected value for equation 19 over the distribution for d, where ~ <
d < r, for some minimum value g. If we assume g << r, this expected value reduces to
This predicts values close to those in Table 1, as shown in the approximation column.
Note that the selectivity is clearly not linear in sensor error. For a fixed size image, in-
creasing the error e by some amount should decrease the selectiviby by at least a quadratic
effect (perhaps more since there are higher order terms). This is reflected in Table 1. This
expected value of the selectivity allows us to analyze the probability that a match will be
reported at random by some recognition method that uses affine transformations. The
selectivity, 7, in essence reflects the power of a given quadruple of features to distinguish
a particular model. Now we consider the manner in which information from multiple
quadruples is combined. This analysis differs slightly for different recognition methods.
As an illustration of how the analysis applies, we consider the alignment method.
The initial version of the affine-invariant alignment method was restricted to planar
objects [15], whereas later versions operate on 3D models (unlike affine hashing which
uses 2 D models) [16]. We consider the 2D case. The basic method is:
- Choose an ordered triple of image features and an ordered triple of model features,
and hypothesize that these are in correspondence.
- Use this correspondence to compute an affine transformation mapping model into
image.
303
- Apply this transformation to all of the remaining model features, thereby mapping
them into the image.
- Search over an appropriate neighborhood about each projected model feature for a
matching image feature, and count the total number of matched features.
This operation is in principle repeated for each ordered triple of model and image
features, although it may be terminated after one or more matches are found, or after a
certain number of triples are tried without finding a match.
We can use the expressions derived above to analyze the sensitivity of the alignment
method. The key question is whether a random collection of sensor points can masquerade
as a correct interpretation. In this case, we can investigate the probability of such false
positive identifications as follows:
p = 1 - (1-p)'-a
because the probability that a particular model point is not consistent with a partic-
ular image point is ( 1 - ~) and by independence, the probability that all s - 3 points
are not consistent with this model point is (1 - ~ ) , - 3 .
3. The process is repeated for each model point, so the probability of exactly k of them
having a match is
qk = k pk(1 - . (2D
Further, the probability of a false positive identification of size at least k is
k-1
wk = 1 - E qi.
i=0
Note that this is the probability of a false positive for a particular sensor basis and
a particular model basis.
4. This process is repeated for all choices of model bases, so the probability of a false
positive identification for a given sensor basis with respect to any model basis is
ek - 1 - (1 - wk)(~') . (22)
To check the correctness of our model, we ran a series of experiments based on equa-
tion 21. In particular, we used our analysis to generate a distribution for the probability
of a false positive of size k, given e -- 3 and r -- ~ , and using a model with 25 features
and images with 25, 50,100 and 200 features. For comparison, we also generated sets of
model and image points of the same sizes, selected bases for each at random, and de-
termined the size of vote associated with that pairing of bases. That is, for each model
point (other than the basis points) we computed the affine coordinates relative to the
chosen basis. Then we used the affine coordinates to determine the nominal position of
an associated image point. If at least one image point was contained within a given model
point's error disc, then we incremented the vote for this pairing of bases. This trial was
repeated 1000 times. The results are shown in figure (6).
304
O,l~
. . . . . . . . . . . . . . . . . . . i ...................... i ...................... ~-.....................
. . . .
11
.
15 $~
9"
0.0
ii iliiil iliiii 10 11
O.| .......................
+.+.~ . . . . . . . . . . . . . . . . . . . . . . .
+ ......................
+ ......................
! ......................
: ......................
:. . . . . . . . . . . . . . . . . . . . . . .
:. . . . . . . . . . . . . . . . . . . . . . .
0,~8 ::~ . . . . . . . . . . . . . . . . . . . . . . . . ..j. . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . . .
I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
O.G OC - " . . . . . . .
$ 10 15 0 $ LO 15 2O
Fig. 6. Comparison of predicted and measured probabilities of false positives. Each graph com-
pares the probability of a false peak of size k observed at random. The cases are for m = 25
and s = 25 and 50, in the top row, and s = 100 and 200 in the bottom row. In each case, 9 = 3.
The graphs drawn with triangles show the predicted probability, while the graphs drawn with
squares show the observed empirical probabilities.
~., . . . . . . . . . . i.................
i................
;.............
i................. ,, ............... + .................
i................
i.................
0" ................. + "" IN'Ill S ................ ++ ............... m .......................... + ................. + .............. o.,. . . . . . . . . . . . . . . . . . :~.... i
0J ................. G. . . . + ................. i ................ m4 ........................ G .............. i ................
Fig. 7. Graph of probability of false positives. In each graph, vertical axis is probability of false
positive of size k, horizontal axis is k. Each family of plots represents a different number of
sensor features, starting with s = 25 for the left most plot, and increasing by increments of 25.
For the three families in the top row, the model consisted of 25 features, and the sensor error
was 9 = 1,3 and 9 = 5 (from right to left). For the three families in the bottom row, 9 was fixed
at 3, and the model had 25, 38 and 50 features (from right to left).
One can see t h a t the cases are in good agreement. In fact, our model tends to overes-
t i m a t e the probability of small false positives, and u n d e r e s t i m a t e the probability of large
false positives, so our results will tend to be conservative.
Next, w h a t does ek look like? As an illustration, we graph in figure (7) the probability
of a false positive based on equation (22). In particular, we use a selectivity based on
e -- 3, obtained from Table 1, and plot the value of ek for an object with 25, 38 or 50
m o d e l features, for different values of k and a given n u m b e r of sensor features s. This is
graphed in figure (7). T h e process was r e p e a t e d for different numbers of sensor features
305
5 Summary
The computation of an affine-invariant representation in terms of a coordinate frame
( m l , m 2 , m 3 ) has been used in a number of model-based recognition methods. Nearly
all of these recognition methods were developed assuming no uncertainty in the sensory
data, and then various heuristics were used to allow for error in the locations of sensed
points. In this paper we have formally examined the effect of sensory uncertainty on such
recognition methods. This analysis involves considering both the Euclidean plane used
by the alignment method, and the space of affine-invariant (a,/~) coordinates used by
the geometric hashing method. Our analysis models each sensor point in terms a disc of
possible locations, where the size of this disc is bounded by an uncertainty factor, e.
Under the bounded uncertainty error model, in the Euclidean space the set of possible
values for a given point x and a basis (raz, m2, m3) forms a disc whose radius is bounded
by r = ke(1 + Ic~l+ 1/31), where 1 < k < 2. That is, assuming that each image point has
a sensing uncertainty of magnitude e, the range of image locations that are consistent
with x forms a circular region. In the a-/~-space, the set of possible values of the affine
coordinates of a point x in terms of a basis ( m l , m ~ , m 3 ) forms an ellipse (except in
degenerate cases). The area, center and orientation of this ellipse are given by somewhat
complicated expressions that depend on the actual configuration of the basis points.
The most important consequence of our analysis is that the set of possible values in the
a-/~-plane cannot be computed independent of the actual locations of the model or image
basis points. This means that the table constructed by the geometric hashing method
can only approximate the correct values, because the locations of the image points are
not known at construction time. We further find for even moderately large positional
uncertainty, methods that use affine transformations have a substantial probability of
false positive matches. These methods only check the consistency of each matched point
with a set of basis matches. They do not ensure the global consistency of all matched
points. Our results suggest that such methods will require that a substantial number of
hypothesized matches be ruled out by some subsequent verification stage.
References
1. Ballard, D.H., 1981, "Generalizing the Hough Transform to Detect Arbitrary Patterns,"
Pattern Recognition 13(2): 111-122.
2. Basri, R. & S. Ullman, 1988, "The Alignment of Objects with Smooth Surfaces," Second
Int. Con/. Comp. Vision, 482-488.
306
3. Besl, P.J. & R.C. Jain, 1985, "Three-dimensional Object Recognition," ACM Computing
Surveys, 17(1):75-154.
4. Costa, M., R.M. Haralick & L.G. Shapiro, 1990, "Optimal Affine-Invariant Point Match-
ing," Proc. 6th Israel Conf. on AI, pp. 35-61.
5. Cyganski, D. & J.A. Orr, 1985, "Applications of Tensor Theory to Object Recognition and
Orientation Determination", IEEE Trans. PAMI, 7(6):662-673.
6. Efimov, N.V., 1980, Higher Geometry, translated by P.C. Sinha. Mir Publishers, Moscow.
7. Ellis, R.E., 1989, "Uncertainty Estimates for Polyhedral Object Recognition," IEEE
Int. Conf. Rob. Aut., pp. 348-353.
8. Forsythe, D., J.L. Mundy, A. Zisserman, C. Coelho, A. Heller, & C. Rothwell, 1991, "In-
variant Descriptors for 3-D Object Recognition and Pose", IEEE Trans. PAMI, 13(10):971-
991.
9. Grimson, W.E.L., 1990, Object Recognition by Computer: The role of geometric constraints,
MIT Press, Cambridge.
10. Grimson, W.E.L., 1990, "The Combinatorics of Heuristic Search Termination for Object
Recognition in Cluttered Environments," First Europ. Conf. on Comp. Vis., pp. 552-556.
11. Grimson, W.E.L. & D.P. Huttenlocher, 1990, "On the Sensitivity of the Hough Transform
for Object Recognition," IEEE Trans. PAMI12(3):255-274.
12. Grimson, W.E.L. & D.P. Huttenlocher, 1991, "On the Verification of Hypothesized Matches
in Model-Based Recognition", IEEE Trans. PAMI 13(12):1201-1213.
13. Grimson, W.E.L. & D.P. Huttenlocher, 1990, "On the Sensitivity of Geometric Hashing",
Proc. Third Int. Conf. Comp. Vision, pp. 334-338.
14. Grimsou, W.E.L., D.P. Huttenlocher, & D.W. Jaeobs, 1991, "Affine Matching With
Bounded Sensor Error: A Study of Geometric Hashing & Alignment," MIT AI Lab Memo
1250.
15. Huttenlocher, D.P. and S. UNman, 1987, "Object Recognition Using Alignment",
Proc. First Int. Conf. Comp. Vision, pp. 102-111.
16. Hutteulocher, D.P. & S. Ullman, 1990, "Recognizing Solid Objects by Alignment with an
Image," Inter. Journ. Comp. Vision 5(2):195-212.
17. Ja~obs, D., 1991, "Optimal Matching of Planar Models in 3D Scenes," IEEE Conf. Comp.
Vis. and Patt. Recog. pp. 269-274.
18. Klein, F., 1939, Elementary Mathematics from an Advanced Standpoint: Geometry, MacMil-
lan, New York.
19. Korn, G.A. & T.M. Korn, 1968, Mathematical Handbook for Scientists and Engineers,
McGraw-Hill, New York.
20. Lamdan, Y., J.T. Schwartz & H.J. Wolfson, 1988, "Object Recognition by Affine Invariant
Matching," IEEE Conf. Comp. Vis. and Putt. Recog. pp. 335-344.
21. Lamdan, u J.T. Schwartz & H.J. Wolfson, 1990, "Affine Invariant Model-Based Object
Recognition," 1EEE Trans. Rob. Aut., vol. 6, pp. 578-589.
22. Larndan, Y. & H.J. Wolfson, 1988, "Geometric Hashing: A General and Efficient Model-
Based Recognition Scheme," Second Int. Conf. Comp. Vis. pp. 238-249.
23. Lamdan, Y. & H.J. Wolfson, 1991, "On the Error Analysis of 'Geometric Hashing'," IEEE
Conf. Comp. Vis. and Patt. Recog. pp. 22-27.
24. Thompson, D. & J.L. Mundy, 1987, "Three-Dimensional Model Matching From an Uncon-
strained Viewpoint", Proc. IEEE Conf. Rob. Aut. pp. 280.
25. Van Gool, L., P. Kempenasrs & A. Oosterlinck, 1991, "Recognition and Semi-Differential
Invariants," IEEE Conf. Comp. Vis. and Putt. Recog. pp. 454-460.
26. Wayner, P.C., 1991, "Efficiently Using Invariant Theory for Model-based Matching," IEEE
Conf. Comp. Vis. and Part. Recog. pp. 473-478.
27. Weiss, I., 1988, "Projective Invariants of Shape," DARPA IU Workshop pp. 1125-1134.
28. Wolfson, H.J., 1990, "Model Based Object Recognition by Geometric Hashing," First Eu.
rop. Conf. Comp. Vis. pp. 526-536.
This article was processed using the I~TEX macro package with ECCV92 style
EPIPOLAR LINE ESTIMATION *
S~ren L Olsen
University of Copenhagen, Universitetsparken 1, DK-2100, Copenhagen, Denmark
1 INTRODUCTION
In the past decades a large number of algorithms for solving the computational stereo
correspondence problem have been proposed. To reduce the size of the correspondence
problem most algorithms have assumed that the equation for the epipolar lines are known
in advance. The right image epipolar line to an observed left image point is the projection
of the 3D line through the observed point and the focal point of the left camera on the
right image plane. In most algorithms it is assumed that the epipolar lines coincide with
the scan lines. This situation occurs only when the two image planes are contained in a
common 3D plane, and when the vertical camera coordinate axes are parallel. In practice,
it is nearly impossible to orient the cameras to satisfy this assumption. Alternatively,
by using a specially designed calibration object, the transformation between the two
camera coordinate systems may be estimated prior to the application. Based on the
transformation the equation for the epipolar lines can be found. The method of camera
calibration is well suited for experiments in laboratory or industrial environments where
time is available, where the cameras are not moved after the calibration has been made,
or where the change in the stereo geometry is known very accurately. Alternatively,
the epipolar line equation may be estimated directly from the images. This approach
is relevant when measures of disparity suffice, and may be useful when no calibration
object is available. In the present work it is shown that for a broad class of stereo images
the epipolar line equation can be obtained directly from the images without restrictive
assumptions on the stereo geometry, knowledge of the values of the focal lengths, etc.
We assume that the two cameras, referred to as the left and right camera, can be
modeled by pin-hole cameras with focal lengths of fL and fR. The two camera coordinate
systems are both assumed to be left-handed, and related by an affine transformation. The
image point of, say the left camera focal center is denoted by (aL, bz). Each of the image
coordinate systems is assumed to align with the camera coordinate system by fixing the
z-component to the value of the focal length, and by translating the Origo by (at., bL).
A point P with three-dimensional left camera coordinates (X, Y, Z) projects on the left
image coordinates ( xL, YL ):
Xs = f L ~ q - a L and YL = f L Y W b L (1)
* This work was supported by The Danish Natural Science Research Council, program no.
11-6969, and program no. 11-8345.
308
Cl -}- C2XL -b C3YL @ C4XR + CsyR "~ C6XLXR -4- CTXLYR -b C8YLXR "t- C9YLYR -- 0 (2)
where the coefficients ei are determined by the affine transformation, by the focal lengths,
and by the image coordinates of the two focal centers. In binocular stereo vision verti-
cal epipolar lines rarely occur. To make sure that the simple situation YL -- YR can be
modeled well we choose to fix the coefficient cs to - 1 . Assuming that a sufficient num-
ber of true correspondences have been established, we may then solve the set of linear
equations.
To establish the set of linear equations (2) a (relatively small) number of corresponding
feature points have to be localized in the images. In the present approach we will benefit
from the vast amount of knowledge gained from studying the behavior of extremes in the
so-called scale-space [9,3,4]. The local extremes in the scale-space mark centers of light
and dark image blobs. By the present approach, the feature points are detected as the
local extremes of the convolution of the images with the Laplacian of a Gaussian [5,1].
For computational practice the sampling in the scale-space must be sparse. By us-
ing a logarithmic sampling [3] a set of five sampling levels, corresponding to preferable
blob-diameters BJ of about 11, 16, 23, 32, and 45 pixels, were chosen. After all points
of local extremum have been marked for each of the sampling levels in the scale-space
the redundancy of the representation is pruned by checking for each local extremum if
an extremum of the same sign is localized within a small search area at the neighbor-
ing sampling levels. If so, the extreme with the smaller value disregarded. Next, weak
extremes caused by the side-lobes of the Laplacian-Gaussian, or by smooth non-blob
intensity surface changes) are removed.
As in previous methods [8,7] the support function S is chosen to be increasing with the
similarity of disparity, and decreasing in the distance r between the supporting image
points. In contrast to [8,7] we find it useful to make S depend on the a priori belief/~ of
the candidate matches. This is defined as the ratio of the minimal and maximal extremal
values of the feature points. By this definition candidate matches between feature points
which mark blobs of significantly different intensities is penalized. Defining Ad as the
length of the disparity difference vector, i.e. Ad = Idp - dQ[, the support function S
was defined by:
1 1
S(PL--*PR, QL--*QR) -~ ~P~Q r 1 q- Ad (3)
5 EXPERIMENTS
In Fig. 1 the results of applying the method on two synthetic ray-traced stereo image
pairs are shown. Both images are overlaid with the (approximately conjugated) estimated
epipolar lines. In the upper row images the transformation from the left to the right
camera is described by a rotation around the vertical axis and a translation along the
horizontal axis and the the z-axis, i.e. vergence stereo. The vergence angle is 30 degrees.
Both of the focal lengths were equal. The accepted matches are shown by small black
squares. In the second row of Fig. 1 the two camera coordinate systems are related by a
translation along all three axes. Rotations are made around the vertical axis and around
the optical axis. The vergence angle is 16 degrees. The rotation around the z-axis is 10
degrees. The values of the two focal lengths differ by about 8%. For clarity, only a subset
of epipolar lines are shown in this example.
Table 1 shows the number of detected feature points, of candidate matches, of ac-
cepted matches (i.e. matches for which w(rl) ~ 0), and the number of false matches.
810
Most false matches are located close to the true epipolar line. For all pixels in the left
image having a visible corresponding point in the right image the distance was measured
between the truly corresponding right image point and the estimated epipolar line. Ta-
ble 1 shows that both the average and the standard deviation of this error are small
numbers. The method has also been tested on real laboratory and outdoor stereo im-
ages. For these images the estimation accuracy was measured manually. Typically, the
largest error found was on the order of 2-3 pixels. Due to space limits these results are
not reported further here.
6 CONCLUSION
Based on the limited number of experiments made, the method proposed for estimating
the epipolar lines in stereo images indeed seems promising. The experiments indicate that
if the slope of the epipolar lines are not too steep then the epipolar lines may be estimated
within a few pixels. The method is not directly applicable for trinocular stereo. More
experiments are needed to find the range of values (e.g. for the affine transformation)
for which a reliable solution can be found. A weak point is the selection of prominent
extremes. By the present approach there is a risk that some truly corresponding feature
points will be detected at different levels in the scale-space. This may happen if fL and
fR are not approximately equal, or if the amount of foreshortening in the two images
differs significantly. To increase the robustness, alternative approaches might be to link
the feature points across scale, or to accept matches between feature points located
at neighboring levels of scale. However, preliminary experiments indicate that by such
approaches the number of false matches often increase to an unacceptable level.
References
1. Blostein D. and Ahuja N. Shape from Texture: Integrating Texture Element Extraction and
Surface Estimation. IEEE PAMI vol. 11 no. 12, 1989, pp. 1233-1251
2. Huber P. J. Robust Statistics. Wiley, New York, 1981
3. Koenderink J. J. and van Doorn A. J. The Structure of Images. Biol. Cyb. vol. 50, 1984,
pp. 363-370
4. Lindeberg T. On the Behavior in Scale Space of Local Extrema and Blobs. Proc. 7'th SCIA,
1991, pp. 8-17
5. Marr D. and Hildreth E. C. Theory of Edge Detection. Proc. R. Soc. London B, vol. 207
1980, pp. 187-217
311
Fig. 1. Two synthetic stereo images overlaid with the estimated epipolar lines. The matched
feature points are marked by black squares.
6. Meer P., Mintz D. and Rosenfeld A. Robust Regression Methods for Computer Vision: A
Review. International Journal of Computer Vision 6:1, 1991, pp. 59-70
7. Pollard S. B., Mayhew J. E. W. and Frisby J. P. PMF: A Stereo Correspondence Algorithm
Using a Disparity Gradient Limit. Perception vol. 14; 1985. pp. 449-470
8. Prazdny K. Detection of Binocular Disparities. Biological Cybernetics vol. 52; 1985, pp.
93-99
9. Witkin A. P. Scale Space Filtering. Proc. 7'th IJCAI 1983, p. 1019-1022
This article was processed using the IbTEX macro package with ECCV92 style
Camera Calibration Using Multiple Images
Paul Beardsley, David Murray, Andrew Zisserman *
Robotics Group, Dept. of Engineering Science, University of Oxford, Oxford OX1 3P J, UK.
tel: +44-865 273154 fax: +44-865 273908 email: [pab,dwm,az]~uk.ac.ox.robots
1 Introduction
The formation of images in a camera is described by the pinhole model. This model can
be specified by the camera's focal length, principal point, and the aspect ratio. These
parameters are called intrinsic because they refer to properties inherent to the camera.
In contrast, extrinsic parameters describe the translation and rotation of a camera with
respect to an external coordinate frame and are thus solely dependent on the camera's po-
sition, not on its inherent properties. The calibration method in this document measures
intrinsic parameters only.
Camera calibration is a central topic in the field of photogrammetry [PhSO]. In com-
puter vision, the best-known calibration method is due to Tsai [Ts85] [LT88]. Other im-
portant references include [Ga84], [FT8~, [PngO]. Recently, calibration methods based on
the use of vanishing points have been proposed [CTgO], [Kag$]. The calibration method
here uses vanishing points, the novel aspect of the work being that vanishing points are
shown to move in a predictable way for constrained scene motions. This offers a way of
integrating data over many images prior to calibrating the camera. The motivation for
utilising more data is to obtain resilience to noise.
2 Intrinsic Parameters
Figure l(a) shows the standard pinhole model of a camera. O is the point through which
all rays project and is referred to as the optic centre. The perpendicular from O to the
image plane intersects the image plane at the principal point, P. The line OP is the optic
axis and the distance OP is the focal length. Figure l(b) shows the customary way in
which the physical system is actually represented, with a right-handed coordinate frame
* This work was supported by SERC Grant No GR/G30003. PB is in receipt of a SERC stu-
dentship. AZ is supported by SERC.
313
~c$'cic 2 having origin 0 and z-axis aligned with O P . The ratio of the scaling along the
y-axis to the scaling along the z-axis is called the aspect ratio.
(a)
image plane
XC/"Aycl f ~ i m a g e plane
Fig. 1. The pinhole model. (a) shows the standard model for image formation - rays of light
from the scene pass through the pinhole and strike the image plane creating an inverted image;
(b) the usual diagrammatic representation of the system with a right-handed coordinate frame
set up at O.
A real camera lens does not act like a perfect pinhole, so there are deviations from
the model above. The most significant of these is known as radial distortion. The form
of radial distortion and methods for its correction are described in [PhSO], [Be91].
image plane
parallel lines
parallel lines
Fig. 2. The parallel lines in the scene appear as converging lines in the image. Each set of
parallel lines has its own vanishing point in the imag e . The line through the vanishing points is
the vanishing line (the horizon) of the plane in the scene.
Vanishing points and vanishing lines are important sources of information about the
geometry of the scene. A vector at the optic centre pointing towards a vanishing point
2 v is used to denote a vector of arbitrary magnitude, and ~t to denote a unit vector.
314
on the image plane is parallel to the physical lines in the scene which have given rise to
the vanishing point. Also, a plane which passes through the optic centre and a vanishing
line on the image plane has the same normal as the physical plane which has given rise
to the vanishing line [Ka92].
The main idea in the proposed calibration method is that, given a static camera observing
constrained motion of a calibration plane, vanishing points and vanishing lines move in
a predictable way. Figure 3 illustrates the physical setup of the system. From an image
of the calibration plane, it is possible to obtain a number of vanishing points and hence
the vanishing line of the plane. This section shows that rotation of the calibration plane
causes a vanishing point of the plane to move along a conic, and causes the vanishing
line of the plane to generate an envelope (Fig. 7) which is a conic.
^
^
~.~:7 ~OA n normal / I ! parallel lines
ptic axis ",,. / /
" J ~ ~'/Lr /,., ~r axis of rotation
............ ^
/
~ calibration plane
Fig. 3. The physical system. A camera is pointing at a calibration plane which is slanted away
from the ]ronto-parallel position. The calibration plane is mounted on an axis f~r about which it
can be rotated, and ~r is skewed away from the optic axis Lc. There is no requirement ]or any
exact positioning of the camera or calibration plane.
Referring to Fig. 3, consider a set of parallel lines on the calibration plane with
direction given by the unit vector ]. As the plane rotates, the direction o f l changes in the
way described by the following parametric representation (in spherical polar coordinates)
where ir$'r~.r is an orthonormal set of vectors with ~r parallel to the axis of rotation, r
is a constant given by cos r = ].~.r, and 0 is the angle of rotation with range 0 < 0 < 27r.
It was pointed out in Sect. 3.1 that a vector passing through the optic centre and a
vanishing point is parallel to the physical lines in the scene which have given rise to the
vanishing point i.e. given a set of parallel lines 1(00) on the calibration plane, there is a
parallel vector ]e(00) located at the optic centre and passing through the vanishing point
for the lines. Now, the motion of the physical lines is described by (1), so the motion
of the associated vector passing through the optic centre and the vanishing point is also
described by (1). A vector moving in this way sweeps out a circular cone. It follows that
the locus of the vanishing point is given by the intersection of a circular cone and the
image plane, and therefore the locus of the vanishing point is a conic section. This is
illustrated in Fig. 4. An analogous argument can be used to show that, as the calibration
plane rotates, the vanishing lines generate an envelope which is a conic section.
A mathematical analysis is omitted, but the main results are summarised.
315
1. Although the physical setup is actually capable of generating any type of conic, the
only conic sections considered in the remainder of the paper are ellipses. It can be
shown that the equation of the ellipse on the image plane is
image plane
i
", "-., ! (0)
Fig. 4. This figure illustrates the locus o] the vanishing point. The optic axis of the camera is
~e. The axis o] the circular cone is parallel to the axis of rotation of the calibration plane, ~r.
The circular cone is swept out by a vector parallel to the direction o/ the parallel lines on the
calibration plane, ](0). The vanishing point moves along the intersection of the cone and the
image plane. The semi-angle of the cone is 6, and the skew angle between the optic axis and
the axis of" the cone is r Note that the axis of the cone does not pass through the centre o] the
ellipse.
2. For a particular configuration of the system, the vanishing points and vanishing lines
generate ellipses with collinear major axes, although with different minor axes and
eccentricities (depending on their associated r value). The common major axis passes
through the principal point.
3. As shown in Fig. 4, the direction of the major axis is determined by the optic axis
and the axis of the cone (which is parallel to the axis of rotation of the calibration
plane). This direction is fixed for a particular configuration of the system, but can
be adjusted by moving the camera or the axis of rotation of the calibration plane.
4. The ellipses used in the actual experiments were generated from vanishing line en-
velopes rather than from vanishing points - a vanishing line is a best-fit line through
a number of vanishing points and should therefore be more resilient to noise.
4 Circular C o n e s and t h e O p t i c C e n t r e
This section considers what information an ellipse on the image plane provides about the
position of the optic centre.
Figure 5 shows an ellipse on the image plane, which is either the locus of a vanishing
point, or the envelope of a set of vanishing lines. The discussion in Sect. 3 assumed a
316
known camera geometry and showed how vanishing point and vanishing line information
can give rise to an ellipse. Camera calibration is the inverse process - given an ellipse
generated from vanishing points or vanishing lines, determine the camera geometry.
Now the ellipse can be regarded as the intersection of the image plane with a circular
cone whose vertex is at the optic centre. Thus, the problem is to use the ellipse to
determine the set of all circular cones which could have given rise to it; more specifically,
it is the vertices of this set of cones which are of interest since the optic centre must
coincide with one of these vertices. It is shown below that the vertices lie on a hyperbola
in a plane perpendicular to the image planel as illustrated in Fig. 5. Thus, a single ellipse
leaves the position of the optic centre underdetermined, since it can lie anywhere along
the hyperbola. Given two distinct ellipses, the optic centre lies at the intersection of the
two associated hyperbolae, and hence its position is uniquely determined (actually there
are four points of intersection, pairwise symmetric on either side of the image plane - the
incorrect solutions on the wrong side of the image plane are automatically eliminated,
and the other incorrect solution is easily eliminated because it lies far from the centre of
the image).
As shown in Fig. 5, a coordinate frame YCeStef~e is set up with the 5r plane coin-
cident with the image plane, the ~e axis aligned with the major axis of the ellipse, and
the origin at the centre of the ellipse.
A
ze
hyoqrbola on the ^ ,
xz-piane ...._...~..,. / Ye//
/
xyellipr'eplane~t h e ~ f- "~"x
Fig. 5. Note that this is a 319 diagram, with the ellipse on the the fce~e plane (the image plane),
and the hyperbola on the fCe~e plane. The set o] circular cones which could give rise to the
ellipse have their vertices on the hyperbola. As described in the text, possible positions of the
optic centre lie on this same hyperbola.
Points x on the surface of a cone with vertex v, axis @, and semi-angle r are described
by
ix - vl 2 cos 2 r - (,~. ( x - v ) ) 2 = 0 (2)
By expressing the intersection of the cone and the image plane in the canonical form
for an ellipse, which is
x2 y2
a-~ + ~ = 1 (3)
Thus, the cone vertices v lie on a hyperbola in the YCe~e plane. When vz = 0, vx =
- b2. This is the focus of the ellipse, so the hyperbola goes through the ellipse foci.
Equation (4) will be used in the computation of focal length.
317
5 Camera Calibration
The determination of aspect ratio relies on the following observation - given a set of
ellipses, all the major axes are concurrent at one point (the principal point). If the aspect
ratio is not correct, however, the ellipses are distorted and the major axes will not be
concurrent. Thus, adjusting the aspect ratio until the major axes are concurrent serves
to identify the true aspect ratio. It is evident that this method requires a minimum of
three ellipses with different major axes in order to carry out the concurrency test.
The experimental procedure is as follows. The system is set up so that the vanishing
line envelope obtained from an image sequence will be an ellipse (the type of conic
produced depends on the setting of the r and r angles defined in Sect. 3.2). An image
sequence is taken as the calibration plane is rotated from 0 to 21r. The envelope of the
vanishing lines measured during the sequence is used to determine an ellipse.
The system configuration is then adjusted so that the major axis of the vanishing line
envelope lies in a different direction (the direction of the major axis is determined by
the skew between the optic axis of the camera and the axis of rotation of the calibration
plane as described in Sect. 3.2). A new image sequence is taken, and again the envelope
of the vanishing lines is used to determine an ellipse.
This is repeated at least three times to produce the minimum of three ellipses required
for the aspect ratio determination. Starting from an initial estimate, the aspect ratio is
iteratively adjusted until the major axes of the ellipses are at their closest approach to
concurrency - at this point, the value of the aspect ratio is recorded as the true aspect
ratio. Concurrency is measured using a weighted least-squares test given in [KagP].
We emphasise here that no accurate positioning is required when setting the camera
and calibration plane using the guidelines above, and rough human estimation of position
is always sufficient.
Once the aspect ratio has been corrected, all ellipses have major axes which are concurrent
at the principal point. Thus, given two or more ellipses, it is possible to compute the
principal point.
5.3 D e t e r m i n a t i o n o f Focal L e n g t h
Once the principal point is known, the focal length can be found using a single ellipse.
Focal length is computed using (4) from Sect. 4
v~ v~
=1
a 2 - b2 b2
vr is the offset of the centre of the ellipse from the principal point and is known; a and
b are obtained directly from the ellipse; vz is the focal length and is the only unknown.
Six image sequences were taken, with varying system geometry as described in Sect. 5.1.
The camera-calibration plane distance was about 80cm. There were 36 frames per image
318
sequence and the calibration plane, which was mounted on a rotating table, was rotated
about 10~ between each frame. A typical image is shown in Fig. 6.
Edges were found using a local implementation of the Canny edge detector which
works to sub-pixel accuracy. The aspect ratio of the edge map was updated using a value
which was close to the true aspect ratio, and the edgels were corrected for radial distortion
using a priori knowledge of the radial distortion parameters [PhSO], [Be91].3 Best-fit lines
were found for the 8 horizontal and 8 vertical lines available from the calibration grid, and
64 vertex positions were generated. Vanishing points were found in 16 different directions
available from the vertices (0 ~ 90 ~ 4-450 etc.), and the vanishing line was determined
from the vanishing points.
At the end of the sequence, an ellipse was fitted to the envelope of the vanishing lines
using the Bookstein algorithm [Bo79] - the vanishing lines were represented in homoge-
neous coordinates and a line conic was determined. A point conic was then determined
by inverting the matrix for the line conic [SK5$].
Figure 7 shows the vanishing line envelopes for two example sequences. Figure 8 shows
the major axes of ellipses fitted to the vanishing line envelopes of six sequences, after
the aspect ratio has been corrected to bring them as close as possible to concurrency.
The values determined for aspect ratio, principal point, and focal length are given in
table 1. The focal lengths were converted from pixels into millimetres using the pixel size
provided in the camera specification.
The computed aspect ratio is in good agreement with the value obtainable from the
camera and digitisation equipment specifications. The worst-case error for the perpen-
dicular distances between the major axes and the computed principal point is about
12 pixels as shown in Fig. 8. The presence of this error is the subject of current in-
vestigation - Kanatani [Ka92] has pointed out that the Bookstein algorithm for conic
fitting is inappropriate when there is anisotropic error in the data points. It is the case
that there is anisotropic error in vanishing points and vanishing lines, and Kanatani's
own algorithm for conic fitting is to be investigated. The standard deviation of the focal
length indicates a 2% error - whether this is satisfactory or not depends on the particular
application making use of the calibration information.
We are currently working on an error analysis. The calibration method is a linear
process, so it should be possible to work through from the errors present in the Canny
edge detection to an estimate of the variance in the output parameters. We hope that
the analysis will show that the use of data from many frames causes cancelling of errors.
a Accurate radial distortion correction can be applied based on only approximate estimates of
the aspect ratio and principal point.
319
-2d~~O0 - 2 0 ~ ~
Fig. 7. This figure shows the envelope of vanishing lines for two example sequences. Axes are
labelled with image plane coordinates in pixels. The aspect ratio has been corrected to an initial
estimate of 1.51.
QO
Fig. 8. This figure shows the closest approach to concurrency of the major axes of the ellipses,
obtained by adjusting the aspect ratio. Axes are labelled with image plane coordinates in pixels.
The least-squares estimation of the point of concurrency gives the principal point. The outliers
are discussed in the text.
7 Conclusion
The most significant characteristic of the calibration method is that intrinsic parameters
are computed in a sequential manner - first, the aspect ratio is found using a concurrency
test which does not require knowledge of the other parameters; the principal point is then
computed as the intersection point of a set of lines, a graphical computation which does
not require the focal length; finally, the focal length is determined using the computed
principal point together with the ellipse parameters. This breakdown of the processing
gives rise to individual stages which are computationally simple and easy to implement,
but it could cause problems through errors in one stage being passed on to subsequent
stages.
The method makes use of image sequences, but the large amounts of data which this
320
T a b l e 1. Results for the measurement of aspect ratio, principal point and focal length.
involves are condensed to a single vanishing line from each image. T h e potential benefits
of using m a n y images are currently the subject of an error analysis.
References
[Be91] P.A. Beardsley et al. The correction of radial distortion in images. Technical report
1896/91. Department of Engineering Science, University of Oxford.
[Bo79] F.L. Bookstein Fitting conic sections to scattered data. Computer Graphics and Image
Processing, pages 56-91, 1979.
[CT90] B. Caprile and V. Torre Using vanishing points for camera calibration. International
Journal of Computer Vision, pages 127-140, 1990.
[FT87] O.D. Faugeras and G. Toscani The calibration problem for stereo. In Proc. of IEEE
Con] Computer Vision and Pattern Recognition, Miami, 1987.
[Ga8$] S. Ganapathy. Decomposition of transformation matrices for robot vision. In Proc. of
IEEE Conference on Robotics, pages 130-139, 1984.
[Ka9~] K. Kanatani. Geometric Computation for Machine Vision. Oxford University Press,
Due for publication 1992/3.
[LT88] R.K. Lenz and R.Y. Ts~i. Techniques for calibration of the scale factor and image center
for high accuracy 3-D machine vision metrology. In IEEE Transactions Pattern Analysis
and Machine Intelligence, pages 713-720, 1988.
[Ph80] Manual of Photogrammetry. American Society of Photogrammetry, 1980.
[Pu90] P. Puget and T. Skordas An optimal solution for mobile camera calibration. In Proc.
First European Conf Computer Vision, pages 187-188, 1990.
[SKS$] J.G. Semple and G.T. Kneebone. Algebraic Projective Geometry. Oxford University
Press, 1952.
[Ts86] R.Y. TsM. An efficient and accurate camera calibration technique for 3D machine vision.
In Prac. of IEEE Con] Computer Vision, pages 364-374, 1986.
This article was processed using the IbTEX macro package with ECCV92 style
Camera Self-Calibration: T h e o r y and E x p e r i m e n t s
1 Introduction
Camera calibration is an important task in computer vision. The purpose of the camera
calibration is to establish the projection from the 3D world coordinates to the 2D image
coordinates. Once this projection is known, 3D information can be inferred from 2D
information, and vice versa. Thus camera calibration is a prerequisite for any application
where the relation between 2D images and the 3D world is needed. The camera model
considered is the one most widely used. It is the pinhole: the camera is assumed to
perform a perfect perspective transformation. Let [su, sv, s] be the image coordinates,
where s is a non-zero scale factor. The equation of the projection is
=A 010
001
G
[i] [i]
=M
accounting for camera position and orientation. Projective coordinates are used in (1)
for the image plane and for 3D space. The matrix M is the perspective transformation
matrix, which relates 3D world coordinates and 2D image coordinates. The matrix G
depends on six parameters, called extrinsic: three defining a rotation of the camera and
three defining a translation of the camera. The matrix A depends on a variable number of
parameters, according to the sophistication of the camera model. These are the intrinsic
parameters. There are five of them in the model used here. It is the intrinsic parameters
that are to be computed.
In the usual method of calibration [4] [12] a special object is put in the field of view
of the camera. The 3D shape of the calibration object is known, in other words the 3D
coordinates of some reference points on it are known in a coordinate system attached to
the object. Usually the calibration object is a fiat plate with a regular pattern marked
on it. The pattern is chosen such that the image coordinates of the projected reference
points (for example, corners) can be measured with great accuracy. Using a great number
of points, each one yielding an equation of the form (1), the perspective transformation
matrix M can be estimated. This method is widely used. It yields a very accurate de-
termination of the camera parameters, provided the calibration pattern is carefully set.
The drawback of the method is that in many applications a calibration pattern is not
available. Another drawback is that it is not possible to calibrate on-line when the cam-
era is already involved in a visual task. However, even when the camera performs a task,
the intrinsic parameters can change intentionally (for example adjustment of the focal
length), or not (for example mechanical or thermal variations).
The problem of calibrating the extrinsic parameters on-line has already been ad-
dressed [11]. The goal of this paper is to present a calibration method that can be carricd
out using the same images required for performing the visual task. The method applies
when the camera undergoes a series of displacements in a rigid scene. The only require-
ment is that the machine vision is capable of establishing correspondences between points
in different images, in other words it can identify pairs of points, one from each image,
that are projections of the same point in the scene.
Many methods for obtaining matching pairs of points in two images are described
in the literature. For example, points of interest can be obtained by corner and vertex
detection [1] [5]. Matching is then done by correlation techniques or by a tracking method
such as the one described in [2].
A brief introduction is given to the theory underlying the calibration method. A longer
and more detailed account is given in [9].
Kruppa's equations link the epipolar transformation to the image w of the absolute conic
Y2. The conic w determines the camera calibration, thus the equations provide a way of
deducing the camera calibration from the epipolar transformations associated with a
sequence of camera motions. Three epipolar transformations, arising from three different
camera motions, are enough to determine w and hence the camera calibration uniquely.
The absolute conic is a particular conic in the plane at infinity. It is closely associated
with the Euclidean properties of space. The conic ~2 is invariant under rigid motions and
323
under uniform changes of scale. In a Cartesian coordinate system [zl, x2, x3, x4] for the
projective space p a the equations of 12 are
9. = o +,i+,i=o
The invariance of 12 under rigid motions ensures that w is independent of the position
and orientation of the camera. The conic w = M(/2) thus depends only on the m a t r i x A
of intrinsic parameters. T h e converse is also true [9] in that w determines the intrinsic
parameters.
Let the camera undergo a finite displacement and let k be the line joining the optical
centre of the camera prior t o the motion to the optical centre of the camera after the
9 i . ,
motion. Let p and p be the eplpoles associated with the displacement. The epipole p
is the projection of k into the first image and p ' is the projection of/r into the second
image. L e t / / b e a plane containing k. T h e n / / p r o j e c t s to lines I and l' in the first and
second images respectively. The epipolar transformation defines a homography from the
lines through p to the lines through p ' such that I and I' correspond. The symbol ~ is
used for the homographic correspondence, l-~l'.
I f / / i s tangent t o / 2 then I is tangent to w and l' is tangent to the projection w' o f / 2
into the second image 9 The conic w is independent of the camera position thus w = w'. It
follows that the two tangents to w from p correspond under the epipolar transformation
to the two tangents to w from p ' . The condition that the epipolar lines tangent to w
correspond gives two constraints linking the epipolar transformation with w. K r u p p a ' s
equations are an algebraic version of these constraints 9
Projective coordinates [Yl, Y2, Y3] are chosen in the first image. Two triples of coordi-
nates [Yl, Y2, Y3] and [Ul, u2, ua] specify the same image point if and only if there exists
a non-zero scale factor s such that Yi = sui for i = 1, 2, 3. The epipolar lines are param-
eterised by taking the intersection of each line with the fixed line y3 = 0. Let (p, y) be
the line through the two points p and y. A general point x is on (p, y) if and only if
(p x y ) . x = 0. Let D be the matrix of the dual conic to w. It follows from the definition
of D that (p, y) is tangent to w if and only if it lies on the dual conic,
(p x y ) T D ( p x y) = 0 (2)
The entries of D are defined to agree in part with the notation of Kruppa,
[-623 6a 67 ]
D = | 63 -613 61 (3)
L 62 61 --617 J
There are six parameters 6i, 6q in (3), but D is determined by w only up to a scale factor.
After taking the scale factor into account D has five degrees of freedom. On setting Y3 = 0
and on using (3) to substitute for D in (2) it follows that
2 -- 0
A11Y~ "}- 2A12yly2 "}"A 22Y2 (4)
where the coefficients Art, A12, A22 are defined by
An equation similar to (4) is obtained from the condition that the epipolar line (p', y ' )
in the second image is tangent to w,
The coefficnents All , Alg, A22 ' are obtained from (5) on replacing the coordinates Pi of
p with the coordinates p~ of p . The coordinate V'3 is set equal to zero9
The epipolar transformation induces a bilinear transformation N from the line V3 = 0
.to the line Y'3 = 0. If y = [Vt, ]/2, 0]f and y ' = [Y'I, V'2,0] T then .(p, y) ~ (p', .y') if and only
if y = N y . Let r = V2/Vl, r = V2/Vl. Then the transformation N is eqmvalent to
, ar+b
r = er + d (7)
T h e parameters a, b, c, d can be easily computed up to a scale factor from the two epipoles
p,p0 and a set of point matches qi ~ q~, 1 < i < n. A linear least square procedure
based on (7) is used. The ith image correspondence gives an equation (7) with
u s o t
ri = P3qi2 P2qi3
- 7.I __ PS,q$2 P2,qi3- (8)
Psql l -- Pl qi3 Psql l -- Pl qi3
Once a, b, c, d have been found (4), (6) and (7) yield
AII+ 2At2r + A 2 2 r 2 = 0
I n I
Each equation (9) is quadratic in I". The two equations have the same roots, namely
the values of ~" for which (p, y) is tangent to w. It follows that one equation is a scalar
multiple of the other. Kruppa's equations are obtained by equating ratios of coefficients,
Two camera motions yield two epipolar transformations and hence four constraints on
the image w of the absolute conic9 The conic w depends on five parameters, thus the
conics compatible with the four constraints form a one dimensional family c. The family
c is an algebraic curve which parameterises the camera calibrations compatible with the
two epipolar transformations 9
An algebraic curve can be m a p p e d from one projective space to another using trans-
formations defined by polynomials. A linear transformation is a special case in which the
defining polynomials have degree one. One approach to the theory of algebraic curves
is to regard each transformed curve as a different representation of the same underlying
algebraic object. For example a conic, a plane cubic with a node and a cubic space curve
can all be obtained by applying polynomial transformations to the projective line ]p1.
Each curve is a different representation of p1, even though the three curves appear to
be very different9
The properties of c are obtained in [9]9 It is shown that c can be represented as an
9 9 3 9
algebraic curve of degree seven in P or alternatively as an algebraic curve of degree six
9 2 9 9 2. 9
m P . The representation of c as a curve in P Is obtained as follows9
325
Let Pl, Pl be the two epipoles for the first motion of the camera. The epipolar
a
transformation is defined by the Steiner conic sl through Pl and Pl; two epipolar lines
(Pl, Y) and (P2, Y) correspond if and only i f y is a point of sl. The two tangents from pl
to w cut sl at points Xl, x2 as illustrated in Fig. 1. The chord (xl, x2) of sl corresponds
to the point Xl x2 in the dual of the image plane. The point xl x x2 lies on a curve
g which is an algebraic transformation of c. It is shown in [9] that ~? is of degree six and
genus four. The point Pl P'I of g corresponding to the line (Pl, Pl) in the image plane
is a singular point of multiplicity three. The curve g has three additional singular points,
each of order two. An algorithm for obtaining these three singular points is described
in [9]. The algorithm produces a cubic polynomial equation in one variable, the roots of
which yield the three singular points.
Three camera displacements yield six conditions on the camera calibration. This is
enough to determine the camera calibration uniquely.
3.1 S t u r r a ' s M e t h o d
The epipoles and the epipolar transformations can be computed by a method due to
Hesse [6] and nicely summarized by Sturm in [10]. Sturm's method yields the epipoles
compatible with seven image correspondences.
Let qi *-* ql, 1 < i < n, be a set of image correspondences. Then p, p' are possible
epipoles if and only if a #
(11)
= (P q ) . q l ) x qa).q2
It follows from (10) that r is equal to the cross ratio of the lines (p', q'~), 1 < i < 3 and
(p', q') in the second image. On equating the two cross ratios the following equation is
obtained.
where ~'i is the cross ratio of the lines (p, q~), 1 _< j _< 4, given by the formula (11), r' is
the same with primes, and where try, and ~r~ are the first order variances on 7"/and v/.
The notation qlj for 1 _<j < 4 indicates a subsequence of the qi. If the noise distribution
is the same for all image points ql, ~ then q~, = ~ixell[grad(r/)l[
2 2, where grad(rl) is
an eight dimensional gradient computed with respect to the qij for 1 < j < 4. The effect
of using the uncertainty in the criterion is that pairs of cross-ratios with large variances
will contribute little, whereas others will contribute more.
327
The problem is that non-linear minimisation techniques are needed. The results of
non-linear minimisation are often very dependent on the starting point. Another difficulty
is that the position of the minimum is quite sensitive to noise, as will be seen in the
experimental section below.
3.2 T h e F u n d a m e n t a l M a t r i x M e t h o d
The fundamental matrix F is a generalization of the essential matrix described in [8]. For
a given point m in the first image, the corresponding epipolar line era in the second image
is linearly related to the projective representation of m. The 3 3 matrix F describes
this correspondence. The projective representation era of the epipolar line em is given by
era = F m
Since the point m corresponding to m belongs to the line era by definition, it follows
that
m'TFm-- 0 (13)
If the image is formed by projection onto the unit sphere then F is the product of an
orthogonal matrix and an antisymmetric matrix. It is then an essential matrix and (13)
is the so-called Longuet-Higgins equation in motion analysis [8]. If the image is formed by
a general projection, as described in (1), then F is of rank two. The matrix A of intrinsic
parameters (1) transforms the image to the image that would have been obtained by
projection onto the unit sphere. It follows that F = A - I T E A -1, where E is an essential
matrix. Unlike the essential matrix, which is characterized by the two constraints found
by Huang and Faugeras [7] which are the nullity of the determinant and the equality of
the two non-zero singular values, the only property of the fundamental matrix is that it
is of rank two. As it is also defined only up to a scale factor, the number of independent
coefficients of F is 7. The essential matrix E is subject to two independent polynomial
constraints in addition to the constraint det(E) = 0. If F is known then it follows from
E = ATFA that the entries of A are subject to two independent polynomial constraints
inherited from E. These are precisely the Kruppa equations. It has also been shown, using
the fundamental matrix, that the Kruppa equations are equivalent to the constraint that
the two non-zero singular values of an essential matrix are equal.
The importance of the fundamental matrix has been neglected in the literature, as
almost all the work on motion has been done under the assumption that intrinsic pa-
rameters are known. But if one wants to proceed only from image measurements, the
fundamental matrix is the key concept, as it contains the all the geometrical information
relating two different images. To illustrate this, it is shown that the fundamental matrix
determines and is determined by the epipolar transformation. The positions of the two
epipoles and any three of the correspondences l ~ l ' between epipolar lines together de-
termine the epipolar transformation. It follows that the epipolar transformation depends
on seven independent parameters. On identifying the equation (13) with the constraint
on epipolar lines obtained by making the substitutions (8) in (7), expressions are ob-
tained for the coefficients of F in terms of the parameters describing the epipoles and
the homography:
Fll - - bp3pl3
F12 "- ap3Pl3
F13 = -ap2p'3 - bplp'~
328
F21 = --dPl3P3
F~2 = -cp~p3
F23 : cp~p2 + dp~pl
F~I = dp'~m - bpsp'~
F32 = cp'~p~ - ~p3p'~
F ~ = -cp'~p2 - dp'~pl + .p2p'~ + bp~p'~ (14)
From these relations, it is easy to see that F is defined only up to a scale factor. Let
cl, c2, c3 be the columns of F. It follows from (14) that plcl + p2c2 + p3c3 = 0. The
rank of F is thus at most two. The equations (14), yield the epipolar transformation as
a function of the fundamental matrix:
a--f12
b=Fll
c : --1;'22
d = -F21
F23F12 - F22F1s
pl = F22F11 - F21F12~
F11F23 - FieF21
P2 =
F22F11 - F21F12 p3
p ~ = F32F21 - ~F22F31
P 3
i
F22Fll F21F12
p~ = FllF32 - F31F12
~ P 3
,
(15)
F22Fll F21 12
The determinant of the homography is F22Fll -F21F12. In the case of finite epipoles, it
is not null.
A first method to estimate the fundamental matrix takes advantage of the fact that
equation (13) is linear and homogeneous in the nine unknown coefficients of F. Thus if
eight matches are given then in general F is determined up to a scale factor. In practice,
many more than eight matches are given. A linear least squares method is then used to
solve for F. As there is no guarantee, when noise is present, that the matrix F obtained is
exactly a fundamental one, the formulas (15) can not be used, and p has to be determined
by solving the following classical constrained minimization problem
where d is a distance in the image plane. The criterion has a better physical significance
in terms of image quantities. It is necessary to minimize on F and on F T simultaneously
329
- The solution must be of rank two, as all fundamental matrices have this prop-
erty. Rather than performing a constrained minimization with the cubic constraint
det(F) = 0, it is possible to use, almost without loss of generality, the following
representation for F proposed by Luc Robert:
( Xl z~ z3 )
F = z4 x5 x6
xTxl ~- xsx4 xTx2 -k xsX5 xTx3 ~- xsX6
This second method for computing the fundamental matrix is more complicated, as it
involves non-linear minimizations. However, it yields more precise results and allows the
direct use of the formulas (15) to obtain the epipolar transformation.
Symbolic methods for solving Kruppa's equations are described in [9]. These methods
are very sensitive to noise: even ordinary machine precision is not sufficient. Also they
require rational numbers rather than real numbers. In this section Kruppa's equations
are solved by an alternative method which is suitable for real world use. The current
implementation is as follows,
Three displacements yield six equations in the entries of the matrix D defined in Sect.
2.1. The equations are homogeneous so the solution for D is determined only up to a scale
factor. In effect there are five unknowns. Trying to solve the over-determined problem
with numerical methods usually fails, so five equations are picked from the six and solved
first. As the equations are each of degree two, the number of solutions in the general case
is 32. The remaining equation is used to discard the spurious solutions. In addition to the
six equations, the entries of D satisfy certain inequalities that are ,discussed later. These
are also useful for ruling out spurious solutions. The problem is that solving a polynomial
system by providing an initial guess and using an iterative numerical method will not
generally give all the solutions: many of the start points will yield trajectories that do
330
not converge and many other trajectories will converge to the same solution. However it
is not acceptable to miss solutions, as there is only one good one amongst the 32.
Recently developed methods in numerical continuation can reliably compute all so-
lutions to polynomial systems. These methods have been improved over a decade to
provide reliable solutions to kinematics problems. The details of these improvements are
omitted. The interested reader is referred to [13] for a tutorial presentation. The solu-
tion of a system of nonlinear equations by numerical continuation is suggested by the
idea that small changes in the parameters of the system usually produce small changes
in the solutions. Suppose the solutions to problem A (the start system) are known and
solutions to problem B (the target system) are required. Solutions to the problem are
tracked as the parameters of the system are slowly changed from those of A to those
of B. Although for a general nonlinear system numerous difficulties can arise, such as
divergence or bifurcation of a solution path, for a polynomial system all such difficulties
can be avoided.
Start System. There are three criteria that guide the choice of a start system: all of
its solutions must be known, each solution must be non-singular, and the system must
have the same homogeneous structure as the target system. The use of m-homogeneous
systems reduces the computational load by eliminating some solutions at infinity, so it is
useful to homogenize, but only inhomogeneous systems are discussed here for the sake of
simplicity. Thus an acceptable start system is: x~j - 1 = 0 for 1 < j < n where n is the
number of equations and dj is the degree of the equation j of the target system. Each
equation yields dj distinct solutions for xj, and the entire set of YI~'=Idj solutions are
found by taking all possible combinations of these.
Homotopy. The requirement for the choice of the homotopy (the schedule for transform-
ing the start system into the target system) is that as the transformation proceeds there
should be a constant number of solutions which trace out smooth paths and which are
always nonsingular until the target system is reached. It has been shown by years of
practice that the following homotopy suffices:
where G(x) is the start system, and F(x) is the target system.
In this section the relation between the image of the absolute conic and the intrinsic
parameters is given in detail. The most general matrix A occurring in (1) can be written:
A = -fk~cosec(o) (1T)
0
- ku, kv, are the horizontal and vertical scale factors whose inverses characterize the
size of the pixel in world coordinates units.
- u0 and v0 are the image center coordinates, resulting from the intersection between
the optical axis and the image plane.
- f is the focal length
- 9 is the angle between the directions of retinal axes. This parameter is introduced to
account for the fact that the pixel grid may not be exactly orthogonal. In practice 0
is very close to ~r/2.
As f cannot be separated from k, and kv it is convenient to define products c~, =
- f k , and av = - f k v . This gives five intrinsic parameters. This is exactly the num-
ber of independent coefficients for the image w of the absolute conic thus the intrinsic
parameters can be obtained from w. The equation of w is [3]:
yTA-1TA-ly = 0
It follows that D = A A T. Up to a scale factor the entries 5ij and 6i of D are related to
the intrinsic parameters by
61 -- v0
62 ---- uo
h = UoVo - auav cot(0)cosec(0)
512 = - 1
6 = = - . o ~ - ~.~ c o ~ c ~(0)
~13 = - ~ 0 ~ - ~ c ~ c ~ ( 0 )
From these relations it is easy to see that the intrinsic parameters can be uniquely
determined from the Kruppa coefficients, provided the five following conditions hold:
~13~12 > 0
~23~12 > 0
613612 - 6~ > 0
~23~12 -- &~ > 0
(~3612 + 6162) 2
(6~1~ - 6~)(62~6~2 - 6~) -< I (is)
If one of the conditions (18) doesn't hold then there is no physically acceptable calibration
compatible with the Kruppa coefficients 6ij and ~. This is a strong condition which rules
out many spurious solutions obtained by solving five of the Kruppa equations. It is
interesting to note that if a four-parameter model is used with 0 = lr/2 then there is the
additional constraint 6s = -6162/612 which replaces the last one of (18). It can be also
verified that the calibration parameters depend only on the ratios of Kruppa coefficients,
so that the scaling of them doesn't modify their value, as expected.
332
6 Experimental Results
The results of experiments with computer generated data are described. The coordinates
of the projections of 3D points are computed using a realistic field of view and realistic
values for the extrinsic and intrinsic parameters. For each displacement 20 point matches
are selected and noise is added.
6.1 C o m p u t a t i o n o f t h e e p i p o l e 8
The results for the determination of the epipoles in the first image are presented. The
values obtained by the two algorithms (the Sturm method based on weighted cross-ratios,
and the fundamental matrix method) are given, as well as the relative error with respect
to the exact solution. The results in the second image are always similar to those in the
first image.
From these results, it can be seen that the fundamental matrix method is more
robust. It is also computationally very efficient since it involves only a linear least squares
minimisation and a 3 x 3 eigenvector computation. A second point worth noting is that
the stability of the position of the epipole depends strongly on the displacement that is
chosen.
Other experiments not reported here due to lack of space show that if more matches
are available then the precision of the determination of the epipoles can be improved.
6.2 I n t r i n s i c p a r a m e t e r s
The intrinsic parameters that have been computed using two displacement sequences
are presented. The first sequence consists of motion 1, motion 4, motion 2. The second
sequence consists of motion 1, motion 4, motion 3.
~u ~v UO '~0 ~
0 pixels 640.125 ~)43.695 246.096 255.648 0
0.01 pixels 597.355 940.403 248.922 259.196 0.02
6,68 0.34 1.14 1.38
0.1 pixel8 520.126 904.744 275,120 280.601 0.09
18.7 4.1 11.8 9.7
0.2 pixels 175.204 ~67.214 565,234 291.162 0.4
72.6 8.1 129.6 13.8
333
References
1. R. Deriche and G. Giraudon. Accurate corner detection: An analytical study. In Proceed-
ings ICCV, 1990.
2. Rachid Deriche and Olivier D. Faugeras. Tracking Line Segments. Image and vision com-
puting, 8(4):261-270, November 1990. A shorter version appeared in the Proceedings of
the 1st ECCV.
334
This article was processed using the I~TEX macro package with ECCV92 style
Model-Based O b j e c t P o s e i n 25 L i n e s o f C o d e *
A b s t r a c t . We find the pose of an object from a single image when the rel-
ative geometry of four or more noncoplanar visible feature points is known.
We first describe an algorithm, POS (Pose from Orthography and Scaling),
that solves for the rotation matrix and the translation vector of the object
by a linear algebra technique under the scaled orthographic projection ap-
proximation. We then describe an iterative algorithm, POSIT (POS with
ITerations), that uses the pose found by POS to remove the "perspective
distortions" from the image, then applies POS to the corrected image in-
stead of the original image. POSIT generally converges to accurate pose
measurements in a few iterations. Mathematica code is provided in an Ap-
pendix.
1 Introduction
Computation of the position and orientation of an object (object pose) using images of
feature points when the geometric configuration of the features on the object is known
(a model) has important applications, such as calibration, cartography, tracking and
object recognition. Researchers have formulated closed form solutions when a few feature
points are considered in coplanar and noncoplanar configurations (see [51 for a review).
However, numerical pose computations can make use of larger numbers of feature points
and tend to be more robust; the pose information content becomes highly redundant;
the measurement errors and image noise average out between the feature points. Notable
among these computations are the methods proposed by Tsai [7] and by Yuan [9].
The method we describe here can also use many noncoplanar points and applies a
novel iterative approach. Each iteration comprises two steps.
1. In the first step we approximate the "true" perspective projection ( T P P ) with a
scaled orthographic projection approximation (SOP). Finding the rotation matrix
and translation vector from image feature points with this approximation is very
simple. We call this algorithm "POS" (Pose from Orthography and Scaling) (see [6]
for similar solutions without scaling, and [8] for similar equations applied to object
recognition without pose computation).
2. We use the approximate pose from the first step to displace the T P P image feature
points toward the positions they would have if they were SOP projections.
We stop the iteration when the image points are displaced by less than one pixel. Since
the POS algorithm in the first step requires an SOP image instead of a T P P image to
produce an accurate pose, using the displaced points of the second step instead of the T P P
* The support of the Defense Advanced Research Projects Agency (ARPA Order No. 6989)
and the U.S. Army Topographic Engineering Center under Contract DACA76-89-C-0019 is
gratefully acknowledged, as is the help of Sandy German in preparing this report.
336
points yields an improved pose at the second iteration, which in turns leads to displaced
image points closer to SOP points, etc. We call this iterative algorithm "POSIT" (POS
with ITerations). Four or five iterations are typically required to converge to an accurate
pose.
2 Notations
In Fig. 1, we show the classic pinhole camera model, with its center of projection O, its
image plane G at a distance f (the focal length) from O, its axes Oz and Oy pointing
along the rows and columns of the camera sensor, and its third axis Oz pointing along
the optical axis. The unit vectors for these three axes are called i, j and k.
An object with feature points Mo, M1 . . . . , Mi . . . . , Mn is positioned in the field of
view of the camera. The coordinate frame of reference for the object is centered at M0
and is (Mou, Mov, Mow). We call M0 the reference point for the object. Only the object
points M0 and Mi are shown in Fig. 1. The shape of the object is assumed to be known;
therefore the coordinates (Ui, Vi, Wi) of the point Mi in the object coordinate frame of
reference are known. The coordinates of the same point in the camera coordinate system
are called (Xi, Yi, Zi).
:~:i:i:i:i:!:''" :::::::::::::::::::::::::::::::::::::::::::::::::::::::::
I
., ..
= ==================================================================================================
i!iil ilililiiiiiiii!iiiiiiiiiiiii
/
:-ii~:,i::;ii::;::iii::i::ii!ii::i!i:.i::)ii::i::iiii)ii::iii!?::i::iii?iiil)::i::iii~
0 I -- ~x
Fig. 1. Perspective projection and scaled orthographic projection for object point Mi and object
reference point M0.
Consider a point Mi of the object (Fig. 1). In "true" perspective projection (TPP), its
image is a point rai of the image plane G which has coordinates
=~ = f x d z ~ , ~ = f~/z~ (1)
337
(2)
The ratio s = f / Z o is the scaling factor of the SOP. The reference point M0 has the
same image m0 with coordinates x0 and F0 in SOP and TPP. The image coordinates of
the SOP projection p~ can also be written as
The geometric construction for obtaining the T P P image point mi of Mi and the
SOP image point pi of Mi is shown in Fig. 1. Classically, the T P P image point rni is
the intersection of the line of sight of Mi with the image plane G. In SOP, we draw a
plane K through M0 parallel to the image plane G. This plane is at a distance Z0 from
the center of projection O. The point Mi is projected on K at Pi by an orthographic
projection. Then Pi is projected on the image plane G at pi by a perspective projection.
The vector mopl is parallel to MoPi and is scaled clown from MoPi by the scaling factor
s = f / Z o . Eq. (3) simply expresses the proportionality between these two vectors.
We find an approximate pose by assuming that the T P P image points mi can be ap-
proximated by the SOP image points Pl (Fig. 1). Our goal is to recover the coordinates
of the three unit vectors i,j, k of the camera coordinate system in the object coordinate
system using the SOP approximation. Indeed these three vectors expressed in the object
coordinate system are the row vectors of the rotation matrix R . The translation vector
T for the object is the vector OM0. Once we find the scaling factor of the SOP, this
vector OM0 is simply a scaled up version of the image vector Omo. We call this pose
calculation method POS (Pose from Orthography and Scaling).
We modify the two expressions of Eq. (3). After expressing the coordinates Xi - X0
and Yi - Y0 of the vector MoMi as dot products of M o M i with unit vectors i and j, we
obtain
zi-z0=si. MoMi, Y i - Z 0 = s j ' M o M i
We define I and J as scaled down versions of the unit vectors i and j
x = sl, a = sj (4)
which yields
zi-z0=I.MoMi, Yi-Y0-J-MoMi (5)
These can he viewed as linear equations where the unknowns are the coordinates of vector
I and vector J in the object coordinate system. The other parameters are known.
Writing Eq. (5) for the object points M0,M1,M2, M i , . . . , M n and their images, we
generate a linear system for the coordinates of the unknown vector I and a linear system
for the unknown vector J:
AI=x, AS=y (6)
338
where A is the matrix of the coordinates of the object points Mi in the object coordinate
system and x and y are the vectors of the x and y coordinates of the image points m~
offset by the coordinates of the image point m0.
Generally, if we have at least three visible points other than M0, and all these points
are noncoplanar, matrix A has rank 3, and the solutions of the linear systems in the least
square sense are given by
I=Bx, J=By
where B is the pseudoinverse of the matrix A. We call B the object matrix. Knowing
the geometric distribution of feature points Mi, we can precompute this pseudoinverse
matrix B.
Once we have obtained least square solutions for I and J, the unit vectors i and j are
simply obtained by normalizing I and J. As mentioned earlier, the three elements of the
first row of the rotation matrix of the object are then the three coordinates of vector i
obtained in this fashion. The three elements of the second row of the rotation matrix are
the three coordinates of vector j. The elements of the third row are the coordinates of
vector k of the z-axis of the camera coordinate system and are obtained by taking the
cross-product of vectors i and j.
Now the translation vector of the object can be obtained. It is vector OM0 between
the center of projection, O, and M0, the origin of the object coordinate system. This
vector, OM0, is aligned with vector Ore0 and is equal to ZoOmo/f, i.e. Om0/s. The
scaling factor s is obtained by taking the norm of vector I or vector J. The POS method
uses at least one more point than is strictly necessary to find the object pose. At least
four noncoplanar points including M0 are required for this method, whereas three points
are in principle enough if the constraints that i and j be of equal length and orthogonal
are applied (see [3] for a simple pose solution for three or more coplanar points). Since
we do not use these constraints in POS, we can verify a posteriori how close the vectors
i and j provided by POS are to being orthogonal and of equal length. Alternatively, we
can verify these properties with the vectors I and J which are proportional to i and j
with the same scaling factor s. We construct a goodness measure G, for example as
G = IX.JI + I I . X - J "JI
The goodness measure G becomes large when the results are poor and can be used for
quickly testing the quality of the computed pose and for detecting wrong correspondences
between image points and object points.
The POS algorithm provides a eomputationally inexpensive method for directly ob-
taining the translation and rotation of an object; the accuracy of POS may be sufficient
for tracking the motions of an object in space, finding initial estimates for iterative meth-
ods, or testing whether image and object points can be matched. Furthermore, when an
object is far from the camera, it is useless to try to improve on the pose found by POS.
In this section, we present an iterative algorithm, POSIT (POS with Iterations) , which
uses POS at each iteration. Less than five iterations are typically sufficient. The basic
idea for iterating toward a more accurate pose is the following:
339
Computing an exact SOP image requires knowing the exact pose of the object. However,
once we have applied POS to the actual image, we have an approximate depth for each
feature point, and we position the feature points at these depths on the lines of sight.
Then we can compute an SOP image. At the next iteration, we apply POS to the SOP
image to find an improved SOP image. The algorithm generally converges after a few
iterations and provides an accurate SOP image and an exact pose.
5.2 F i n d i n g a n S O P i m a g e f r o m a T P P i m a g e
Eq. (1) and Eq. (2) show that the SOP vector Cpi is aligned with the T P P vector Cm~
and the proportionality factor is Zi/Zo:
gl
Cpl = ~-~0Cml (7)
Zl = Z0 + k . MoMi (8)
where k is the unit vector along the optical axis Oz. Expressed in the object coordinate
system, k is the third row of the rotation matrix of the object, and MoMI is a known
vector. Eq. (7) and Eq. (8) yield for the SOP image points p~
Cpl = (1 + ~ ( k - M o M i ) ) C m l (9)
where we have replaced 1/Zo by s / f , the ratio of the scaling factor of the SOP by the
camera focal length.
Expression (9) provides at each iteration of the POSIT algorithm the approximated
positions of the SOP image points pi in relation to the image points ml if we use the
third row of the computed rotation matrix and the computed scaling factor.
6 Illustration of t h e Iteration P r o c e s s in P O S I T
To illustrate the iteration process of POSIT, we apply the method to synthetic data. The
object is a cube; the points of interest are the eight corners (one can easily experiment
with eight visible corners using light emitting diodes). The projection on the left of Fig. 2
is the given image for the cube (the shown projections of the cube edges are not used by
the algorithm). The distance-to-size ratio for the cube is small, thus some parallel cube
edges show strong convergence in the image. One can get an idea of the success of the
POS algorithm by computing T P P image of the cube at the found poses at successive
iterations (Fig. 2, top row). Notice that from left to right these projections become more
similar to the given image. POSIT does not compute these images. Instead, POSIT
computes SOP images using Eq. (9 (Fig. 2, bottom row). Notice that from left to right
the edges of the cube become more parallel in these SOP images, since orthographic
projection preserves parallelism.
340
~ageo - r ~ ( ~ )
Fig. 2. TPP images (top) and SOP images (bottom) for cube poses computed at successive
steps by POSIT algorithm.
7 Performance Characterization
8 Results
At very low to medium range and low to medium noise, POSIT gives poses with less
than 20 rotation errors and less than 2% position errors. POSIT provides dramatic im-
341
i'ilt!
!. :-I~'~ pol~: POI; 9
LowerPmnl: POSIT9
i:t! , t
II 1| lil 10 M M R iS 40 0 12 1| m ~4 rib $| M 44
.... "" L~]
I
II 1| 11 m0 ~Pl 2 9 || N 44 Ill ,| lS ml Ir,4 U |1 N 4~ iI 1| 11 a0 M B N 16 40
Cube Immge d h O u w 9 Cube Immpwllh PIx~ Pmlull~lloql Cube bleOe wllh :l:2Plxd PmturbJiom
Fig. 3. Orientation and position errors for a cube at various distances at three image noise
levels.
provements over POS when the objects axe very close to the camera, and almost no
improvements when the objects are far from the camera. When the objects are close to
the camera, the so-called perspective distortions are large, and the approximation that
the image is an SOP is poor; therefore the performance of POS is poor. When the ob-
jects are very far, there is almost no difference between SOP and TPP; thus POS gives
the best possible results, and iterating with POSIT cannot improve upon them. Also,
when the object is far, pose errors increase with the distance ratios, since at long range
perturbations of a few pixels are a large percentage of the image size.
9 Convergence Analysis
We now explore with simulations the effect of the distance of an object to the camera
on the convergence of the POSIT algorithm (Fig. 4). The convergence test consists of
quantizing (in pixels) the coordinates of the image points in the SOP images obtained
at successive steps, and terminating when two successive SOP images are identical (see
Appendix A). A cube is displaced along the camera optical axis. One face is kept parallel
to the image plane. The abscissa in the plots is the distance from the center of projection
to that face, in cube size units. Noise of -4- 2 pixels is added to the perspective projection.
Four iterations are required for convergence until the cube is at three times its size from
the center of projection. The number gradually climbs to eight iterations for a distance
of 1, and 20 iterations for 0.5. Then the number increases sharply to 100 iterations for
a distance ratio of 0.28 from the center of projection. Up to this point the convergence
is monotonic. At still closer ranges the mode of convergence changes to a nonmonotonic
mode, in which SOP images are subjected to somewhat random variations from iteration
to iteration until they hit close to the final result and converge rapidly. The number of
iterations ranges from 20 to 60 in this mode, i.e. less than for the worse monotonic case,
with very different results for small variations of object distance. We label this mode
"chaotic convergence" in Fig. 4. Finally, when the distance ratio becomes less than 0.12,
the algorithm clearly diverges. Note, however, that in order to see the close corners of
the cube at this range, a camera would require a total field of more than 150~ i.e. a focal
342
length of less than 1.5 m m for a 10 m m CCD chip, an improbable configuration. In all our
experiments, the POSIT algorithm has been reliably converging in a few iterations in the
range of practical camera and object configurations. We are in the process of analyzing
the convergence process by analytical means, but so far have succeeded only for objects
and orientations chosen to yield simple expressions. Convergence seems to be guaranteed
if the image features are at a distance from the image center shorter than the focal length.
iiiiiiiiiii+++iiiiii!i!iiii!ii!!iiii!iiiiiiiiiiii!!!i!!i!i!iiiiiii!iii!i!: IO0
I!i!i.++!iiii+i."+i
ii+ii!!!i!!!iiii!~ii
!i!il 9
1 '~ 6O ~i~ tvkmotor~c
Convergenoe
o,[++++++++++++++++++++++i++++i++++++++i++++++++++++++
i+++~+i++ 9
i++++++++i+++i+++++i+++++++++++++++++++++++++++
++++++++++++i++~+++++++++++++++++++++++i9++++i++++++:+++ i:.i:.e
20[~ii i i!i i i!!i~i~i~iiiiii!ii!i~-~i::~ii:~i:/:ii :i~:i:~i::i~:ii i::i::i:/:::ii~i~:ii 20
o iiiiiiiiiii~:iiiiii!iiiiiiiiiIiiiiii~iiiiiii!i
i i ::i::i:/:::i~i i ?'~i i i?~iii:'~:i:i::ii:i::i~?ii::i~::il .
0.I 0.2 0.3 0.4 0.5 1 1.5 2 2.5 3 3.5 4
Distance to Camera / Object S~e Distanoe to Camera / Objea Size
F i g . 4. Number of iterations as a function of distance to camera at very close ranges (left) and
for a wider range of distances (right).
Compute the pose of an object given a list of 2D image points, a list of corresponding
3D object points, and the object matrix (the pseudoinverse matrix for the list of object
points). The first point of the image point list is taken as a reference point. The outputs
are the pose computed by POS using the given image points and the pose computed by
POSIT.
GetPOSIT [imagePoint s_, obj ectPoints_, obj ectMatrix_, f ocalLength_] :- Nodule [
{objectVectors, imageVectors, IVect, 3Vect, ISquare, JSquare, I J,
illageDifference, roel, roe2, row3, scalel, scale2, scale, oldSOPIzlagePoints,
SOPImagePoints, translation, rotation, firetPose, count-O, converged - False},
objectVectors - (#-objectPoints [[l]])k /@ objectPoints;
oldSOPImagePo int sffiimagePoint s ;
(* loop until difference between 2 SOP images is less than one pixel *)
While [! converged,
If [count--O,
(* we get image vectors from image of reference point for POS: *)
imageVectors = (# - imagePoints[[l]])& /@ izlagePoints,
(* else count>O, we compute a SOP image first for POSIT: *)
SOPIllagePoints - imagePoints (I + (objectVectors.row3)/translation[[3]]);
imageDifference - Apply [Plus, Abe [Round [Flatten[SOPImagePoints] I-
Round [Plat ten [oldSOPImagePoint e]] ] ] ;
oldSOPImagePoints ffi SOPImagePoints;
imageVectors - (# - SOPImagePoints[[l]])~ /@ SOPImagePoints
]; (* end else count>O*)
343
(* Example of i n p u t : * )
{{POSRot,POSTrans},{POSITRot,POSITTrane}} =
GetP0SIT[cubeInage, cube, cubeMatrix, fLength];
References
1. H.S. Baird, "Model-Bases Image Matching Using Location", MIT Press, Cambridge, MA,
1985.
2. T.A. Cass, "Feature Matching for Object Localization in the Presence of Uncertainty",
MIT A.I. Memo 1113, May 1990.
3. D. DeMenthon and L.S. Davis, "Model-Based Object Pose in 25 Lines of C o d e ' , Center for
Automation Research Technical Report CAR-TR-599, December 1991.
4. R.M. Haralick, "Performance Characterization in Computer Vision", University of Wash-
ington C.S. Technical Report, July 1991.
5. R. Horand, B. Conio and O. Leboulleux, "An Analytical Solution for the Perspective-4-
Point Problem", Computer Vision, Graphics, and Image Processing, vol. 47, pp. 33-44,
1989.
6. C. Tomasi, "Shape and Motion from Image Streams: A Factorization Method", Technical
Report CMU-CS-91-172, Carnegie Mellon University, September 1991.
7. R.Y. Tsai, "A Versatile Camera Calibration Technique for High-Accuracy 3D Machine
Vision Metrology Using Off-the-Shelf TV Cameras and Lenses," IEEE J. Robotics and
Automation, vol. 3, pp. 323-344, 1987.
8. S. Ullman and R. Basri, "Recognition by Linear Combinations of Models", IEEE Trans.
on Pattern Analysis and Machine Intelligence, vol. 13, pp. 992-1006, 1991.
9. J.S.C. Yuan, "A General Photogrammetric Method for Determining Object Position and
Orientation", IEEE Trans. on Robotics and Automation, vol. 5, pp. 129-142, 1989.
This article was processed using the IbTF~ macro package with ECCV92 style
Image Blurring Effects Due to Depth Discontinuitites:
Blurring that Creates Emergent Image Details*
Abstract: A new model (called multi-component blurring or MCB) to account for image
blurring effects due to depth discontinuities is presented. We show that blurring processes operating in
the vicinity of large depth discontinuities can give rise to emergent image details, quite distinguishable
but nevertheless un-explained by previously available blurring models. In other words, the maximum
principle for scale space [Per90] does not hold. It is argued that blurring in high-relief 3-D scenes
should be more accurately modeled as a multi-component process. We present results form extensive
and carefully designed experiments, with many images of real scenes taken by a CCD camera with
typical parameters. These results have consistently support our new blurring model. Due care was
taken to ensure that the image phenomena observed are mainly due to de-focussing and not due to
mutual illuminations [For89], specularity [Hea87], objects' "finer" structures, coherent diffTaction, or
incidental image noises. [Gla88] We also hypothesize on the role of blurring on human depth-from-
blur perception, based on correlation with recent results from human blur perception. [Hes89]
Keywords: Multi-component image blurring (MCB), depth-from-bhr, point-spread functions
(kernels), incoherent imaging of 3-D scenes, human blur perception, active vision.
i Introduction
The objectives of this paper are: to present a simplified image blurring model that
is sufficiently general to account for blurring effects due to depth discontinuities, and
to explore some implications on depth-from-blur techniques. See [Ens91], [Gar87],
[Gro87], [Pen8?&89], [Sub88a], and many others. Previously none of those known
depth-from-blur formulations discussed such important cases. We realized that an
accurate model must be composite (i.e. consisting of a possibly unknown number of
sub-processes.) The composite nature of blurring due to depth-discontinuities give
rise to the net blurring effects, with new local extrema generated, very much in
discord with commonly employed blurring models.
The organization of this paper is as follows:
Section 2 discusses the radiometry of image formation in the presence of
sharp discontinuities. Only incoherent (or very weakly coherent), polychromatic
lighting was assumed to be present (this was enforced in the experiments), as
is often true in normal, everyday lighting, thus radiometric models approximate
the imaging process adequately. These simple radiometric considerations are then
seen to be capable of predicting blurring instances in which interesting resultant
image structures (emergent details like peaks and valleys) are created. The object-
and-camera configuration used in this analysis is also adapted to the real-scene
experiments presented in section 3.
Section 3 presents the results from extensive experiments with images of realistic
scenes taken with a C0hu-4815 CCD camera with 8-bit accuracy. Temporal averaging
over twenty frames each image was employed to subdue various image noises to
This work is supported by the National Science Foundation under the Creativity in Engineering
Award EID-8811553, and grant IRI-89-02728
348
about one gray level in variance. Various settings of camera parameters (focal length
f, aperture D and back-focal distance v) are employed to test the blurring model.
Also, controlled experiments for checking ground truth were performed to ensure
valid interpretation.
In section 4 we illustrate the implications for depth-from-blur, an active
vision algorithm suitable for close range. We simulated Pentland's "localized
power estimation" algorithm [Pen89] to estimate the blur widths for simple
(single-component) blurring profiles as well as multi-component blurring (MCB)
cases. Finally we discuss implications to the modeling of human monocular depth
perception. This last discussion could suggest further psychophysical investigation.
< w? > -~ }}
--co --r
( x - x-)"g~(x)dx; -Z = }}
--oo ~ o o
XK~(X)dX (1)
and this width is linearly related to the the blur circle diameter, Dlvi-v01/v0, and
inversely to the depth (distance) u F
where t~ is a small constant, f is focal length, u i is distance from the point Pi to first
principal plane of the lens system, and v i is that distance from the second principal
plane to the plane of best focus for Pi, image of point Pi. If u0 is set at infinity
(farther objects in better focus), then the relation above is simp|ified to:
giving only one solution. So, focusing camera at infinity is desirable to prevent
ambiguities in depth-from-blur. We will assume such a setting henceforth.
coherent diffraction experiments) and the real "edge" used was just a carefully hand-
cut edge (by a sharp blade) out of a high-quality foam-filled cardboard. In fact,
it is our objective to show that MCB effects are detectable in scenes containing
realistic objects.
This is perhaps the most important point of this paper: image blurring near
a depth discontinuity can be best analyzed separately for each surface patch at
different depth. We will concentrate on the cases where one of the blurring process is
dominant (i.e. having a much larger spread than others) in the image neighborhood.
(For example, in figure 1 the blurring due to E, the edge, is dominant.)
Toward modeling the imaging process of a 3-D scene, [Fri67] found that the
transfer function for a 3-D object cannot be cascaded. For example, in the case of a
3-D object imaged by a cascade of two lens systems. The tranfer function of such a
cascade is not the same as the product of each system's transfer. This is due to the
general 3-D nature of the resulting image (a 3-D object has its image also a 3-D
distribution of intensity.) Blurring on the image plane is then the result of projecting
the (3-D) image distribution onto the image plane. However, we have chosen to
model the blurring two-stage process as is fairly conventional:
a. Ideal image registration (geometric and radiometric) giving I0(x),'the idealized
unblurred image.
b. Blurring with blur width depending on u(x), or the depth value of the point P
= (X, Y, u ( x ) ) that has its image at x.
For the one-dimensional model in figure 1, we can see that, at each image
coordinate value xv, the resulting intensity also contains the sum of all blurring
(or diffusion) contribution from neighboring image regions (ie. pixels, etc.), each of
which may have a different blurring kernel. Concisely, then:
where Tb(x), describing the lens occlusion effect due to the edge, is similar to a
smeared step function. Typically, the occlusion effects is small, and an ideal step
function :(x) can be used for Tb(X). Or, the backround blurring kernel K*b(X, x') is
distorted from a simple Kb(x, x'). We will not go into details of the lens occlusion
effect, which is secondary. See figures 1 and 10.
With the analogy between Gaussian blurring and heat diffusion [Hum85],
[Per90], multi-component Gaussian blurring is analogous to multi-component particle
350
diffusion, whereas each type of particle has a different diffusion constant, and none
of them react chemically with each other. Note that analogy with heat diffusion
cannot be easily made, since temperature is a single entity, unless we distinguish
between different types of heat (due to different causes, and propagate at different
rates, for example.)
2.3 E m e r g e n c e of image details by m u l t i - c o m p o n e n t blurring effects
Continued form above, we now show that multi-component blurring can give
rise to new image features (or details), as opposed to the consistent suppression
of details by single-component blurring models. By new image details, we mean
specifically new local extrema, ie. local peaks and valleys. Just for ease of blurring
width estimation/verification later, we assume here that every component kernel is
some shift-invariant Gaussian, but other unimodal kernels can be used.
The 1-dimensional unblurred image is again taken to be approximately two
disjoint step functions with heights Ie0, Ib0. The blurred components are Ie(x), Ib(x)
respectively.
~(u) =
{ 1;,,>__o}
O; u < 0 unit step function
(~)
Where * denotes convolution. Figures 2(a), 2(b) show the two components, and
figure 2(c) and 2(d) show a resulting MCB profile with emergent extrema. Note
that with ideal step functions as shown, a continuous, uni-modal single blurring
kernel will not introduce new local extrema. (This is the maximum principle, a main
assumption in the Gaussian scale-space concept [Per90], however, MCB does not
obey such restrictions.) At the emergent extrema location Xz, the gradient vanishes:
or I~la-----b 2 - - "
if (ae > ab) and (a~lbo < abLo); or (ae < ab) and (a,:Ib,} > abl,~j)
(s)
then X'2z= (tre+ab)(ae--trb) k,I~lab]
Note that the MCB gradient profiles can be quite different from those of SCB
(single-component blurring). See also figures 3(ha), 3(bb) 3(cc) and 3(dd). For some
range of Ie0, Ib0, the MCB gradient actually is a weighted difference-of-gaussian, an
interesting fact.
Examples: Figure 3 shows a comparison of multi-component Gaussian blurring
effects (3(a), 3(b), 3(c), 3(d)) to the effects of comparable-width single-component
351
Gaussian blurring. Only Ib0 is varied, with Ie0 = 190, ere=5, erb=3. Values for
Ib0 are 210, 190, 150 and 105 respectively. The same set of Ie0, ere, and Ib0 was
used for single-kernel blurring (which has ere = erb = 5). It is quite evident that
multi-component blurring is capable of creating new interesting extrema. Note also
that, even though case 3(a) looks like Mach-band effect due to human retino-optic
ganglion cells [Lev85] (or, in image processing, edge-enhancement schemes using
filters similar to Laplacian-of-Gaussian kernels), MCB effects are not results of any
purposive image processing. We are talking of images as registered onto the camera
imaging sensor plane.
And there could be no confusion with Mach-band or edge-enhancement
processing in cases like figures 3(b) and 3(c), because the "peak" can occur well
below the "brighter" level (into the "darker" side, as long as the "dark" side is not
too dark.) Specifically, with given image parameters, for new extrema to be created,
Ib0 must satisfy:
I~ > a--~bL,= 114 (9)
O'e
which is larger than 105 (value of Ib0, the right image component in figure 3(d).)
Hence in figure 3(d), no extrema emerged.
Physical limitations such as blooming and smear of the imaging sensor elements
(pixels) [TI86] by the mechanism of charge spilling between adjacent pixels, also help
to blur the intensity difference between neighbor pixels, thus softening MCB features
somewhat. The net effect is that the local extrema by MCB are most detectable for
some range of Ie0/Ib0, with some upper limits dictated by CCD sensor characteristics,
and lower limits at least as high as given by equation (8). This suggests that, unlike
usual blurring effects, MCB effects are more detectable at lower local contrast, a
rather surprising prediction that was actually observed in real images, and have
possible implications to human perception. See figures 11, 12, 13, 14, and especially
figure 18.
Let us try to see what it takes for a single convolution kernel to describe well
the blurring effects shown). Then, the resulting kernel Kcomposite(X , x') is given by:
{ c,~(~ - ~'), ~' > 0
Kr x') = G~(x- z'), x' < 0 (I0)
which looks innocously simple, until we see some sample plots of it in figure 4.
As seen, with ae=5, ab-~ 3, Kcomposite(X,x') is neither Oaussian (it's a patching
of 2 truncated Gaussian segments), nor shift-invariant, and not even continuous at
x'--0 (blurring interface.) These characteristics are more pronounced for larger ratios
between the b|ur widths ~e, erb, and for smaller absolute values of xz. MCB blurring
can be very complex to estimate, because even for the simpler case of shift-invariant
single Gaussian blurring (directly analogous to heat diffusion) we cannot get exact
inverse solution (ie. for deblurring or estimation of the blur width.)[Hum85]
Note that even with anisotropic diffusion (blurring) model [Perg0], new details
(new local extrema) cannot be created (by the maximum priniciple), only that some
existing details can be preserved and possibly enhanced (ie. sharpened.)
3.1 Experimental s e t u p
The set up is quite similar to the imaging model in figure 1. Distances from
camera are: 1.33 meters to edge of board E, and 5.69 meters to the 3 card boards
that served as background B on the wall. The three backround boards have slightly
different reflectivities, thus enabling convenient investigation of MCB effect due local
contrast (see figure 3 and figures 13, 14.) To make sure that other phenomena
different than MCB blurring (de-focussing) were excluded from registering onto the
images, we have insisted that: [Ngu90a]
a. No specular reflections were present on or nearby the visible surfaces in the scene.
b. No shadowing of the background patch B by the edge E.
c. No interreflections (or mutual illuminations) between the two. Interreflections
(mutual illuminations) between edge E and background B can give spurious
details (local extrema) rather easy to be confused with MCB effects. See [For89].
d. Illuminations had low partial coherence. See [Gla88].
e. Image noise was reduced to about less than 1 gray level in variance, by temporal
averaging of each image by 20 frames. This is also good for suppressing any
noise due to the neon flicker.
3.2 Image data
Since a work of this nature must be extensively tested with carefully controlled
experiments, we have performed extensive experiments (over 300 image frames taken
for tens of scene set-ups) with consistent results. Here we included three typical sets
of images and their video scan lines for further discussions. Note that all middle scan
lines go through the medium-sized background card board.
[] Set {M} (figures 5 through 8) contains M0, an image of overall scene, and M1,
M,?, two images of the background (three patches) B (one close-up and one
distant), and also M~, close-up image of edge E. This set serves to check for
uniformity of B and E both separately and together. Note especially the "edge
sharpness!' and surface smoothness of the edge E.
[] Set {N} contains NI (figure 9), N$ (figure 10). The parameter sets for them are
back-focal distance, aperture diameter, and focal length, respectively (v, D, f):
9 N1 with (v, D, f), = (6375 mf, 7420 ma, 8760 mz) or (87 mm, 4 mm,
84 mm).
9 N~ taken with (v, D, f), = (6375 mf, 9450 ma, 8760 mz) or (87 mm, 6
mm, 84 ram).
All parameters are expressed in machine units corresponding to the zoom
lens digital controller readout: focus (mf), aperture (ma) and zoom (mz).
Corresponding physical values of (v, D, f) are beleived to be only accurate
to within 5 percent, due to lack of precise radiometric calibration for aperture
(which is a complex entity for any zoom lens.)
[] Set {P} has Pl (figure 11) and P'2 (figure 12) showing the MCB effects when
camera parameters are fixed but scene lighting changed non-uniformly (so that
local contrast can be controlled.) Both were taken with (v, D, f) = (6375
mf, 9450 ma, 4200 mz) or (48 ram, 6 mm, 46 mm), but P$ with a reduction
in foreground lighting (which illuminates the edge E), which did not affect
background lighting significantly since whole room was lit with 44 neon tubes
353
and only 2 small lamps (~100 watts each) were used for independent illumination
of E.
To estimate independently the blurring widths of the background and the front
edge, (so that we can compare MCB model with real image blurring effects due to
depth discontinuity), we followed the simple method of Subbarao [Sub88b]. The blur
widths (de, ab) estimated in (horizontal) pixels were found as follows:
Accounting also for video digitizer resampling, the effective pixel size is
approximately 16.5pm (horizontal) by 13.5pm (vertical).
3.3 I n t e r p r e t a t i o n s
Refer to the figures 9 through 14. All images are originally 512x512 pixels but
only central 500x420 image portion shown, and image coordinates (x, y) denote the
original column and row indices, left to right and top to bottom. Analyses are done
on horizontal slices at y --- 270, called middle slices. The point x = 243 on all slices
is at approximately "the interface" (corresponding to x=0 in figure 1) between the
image regions of the background {x > 243} and the edge {x < = 243}.
The middle slices for the "ground-truth" images MO, M1, M2, M3 (controlled
set), included with the images (figures 5 to 8), show negligible MCB effects. They
reveal nothing very interesting on the background surface, nor across the depth
discontinuity (figures MO and M2.) Even right at the edge in image M2, one can
only see a small dip in intensity mainly due to the remaining small roughness of the
hand-cut (which absorbed and scatter lighting a little more.) However, the thin-lined
curve in figure 5, which is the middle slice of image MO* (taken with same focal length
as for MO, but with back-focus set so that edge E is blurred) demonstrates significant
MCB blurring. However, MO it self (dark dots) shows no such interesting feature.
Middle slices for images NI and Ne (figures 9 and 10) reveal MCB effects with
rather broad spatial extents, again near x = 243. For this image pair NI and N2, since
the intensity ratio Ie/Ib is approximately unity (very low local contrasts), the MCB
effects are controlled by w b and Wc. Note also the persistence of MCB effects even
with reduced aperture: overall intensities in NI is lower, but the "MCB details" still
very pronounced. Compare these image slices to figures 3(a) through 3(d). Image
N2 shows effects of aperture occlusion, that is, the best-fitting Wb, value of 3.0 (for
background, x > 243) is significantly smaller than the unoccluded background blur
width w b (about 3.45 pixels, see section 3.2 above)
Middle slices of P1 and Pe (figures 11 and 12, whose close-ups are figures 13
and 14) illustrate the detectability of MCB effects as a function of local intensity
contrast Ie/I b. See also section 2.2 . That is, when Ie/I b is closer to unity (lower local
contrast), MCB effects are more pronounced. This is also sugested in comparison
of slices (y = 86) as well as (y = 270) of P1, P2: reduced Ie reveals the "MCB
spike" unseen with brighter foreground (and hence higher local contrast)! This could
imply that human depth-perception may be enhanced naturally by MCB effects in
low-contrast, large depth-range scenes. Section 4.2 next discusses this point.
354
different manifestations of MCB blurring. Power measures for image set in figure
16 mostly increase with larger blur widths, except perhaps for a small range around
O'blur/Crsharp < 3. Consequently, Pentland's model cannot be applied for reliable
determination of ~blur from these "power data".
Figures 17 gives not even a single case of valid power difference measure. This is
because for all O'blur = {1, ..., 6}, the "image power" consistently increases with blur
width, completely opposite to SCB case in figure 15(b). That is, the more blurring
occured, the higher the power measure. This last data set, as well as most of those
from figure 16, defies any "local power estimation" approach, due to emergent high
frequencies. We beleive a gradient-based approach to be more viable.
4.2 M C B b l u r r i n g effects a n d h u m a n b l u r p e r c e p t i o n
During the work in 1989 and published in [Ngug0b], we had speculated that
MCB effects could play some important role in human visual perception, especially
depth perception at low local contrast. This is a hypothesis arised naturally from
the observations in section 3.3 on the characteristics of the MCB effects (emergent
extrema). However, we had been unaware of any psychophysical data in favor of our
hypothesis until recently when we found a paper by Hess [Hes89], who argued:
a. that human blur discrimination (between blur edges slightly differing in blur
extent) may actually rely more on low-frequency information, rather than high-
frequency, near the vicinity of the blur edge transition.
b. that discrimination is consistently enhanced if one of the blur edges is pre-
processed so as to give an effect similar to MCB effects (he called phase-shifted
processing instead), that is, very similar to figures 3(d), 13(a), and 14(a). For
comparison, see figure 18, which contains our reproduction of his figures 10 and
11 in [Hes89].
The above conclusions came from Hess's study on blur discrimination without
any depth information. Human subjects looked at computer-generated 2-D intensity
profiles on a screen. [Wat83] However, conclusion (b) above was very favorable in
support of our hypothesis, which also involves depth. We strongly beleive that further
investigation into human perception of blurring effects due to depth discontinuities
could provide yet more clues into the working of human visual functions.
References
[Che88] Chen, Y. C., "Synthetic Image Generation for Highly Defocused Scenes",
Recent Advances in Computer Graphics, Springer-Verlag, 1988, pp. 117-125.
[Ens91] Ens, J., and Lawrence, P., "A Matrix Based Method for Determining
Depth from Focus", Proc. Computer Vision and Pattern Recognition 1991, pp.
600-606.
[For89] Forsyth, D., and Zisserman, A., "Mutual Illuminations", Proc. Computer
Vision and Pattern Recognition, 1989, California, USA, pp. 466-473.
[Fri67] Frieden, B., "Optical Transfer of Three Dimensional Object", Journal of
the Optical Society of America, Vol. 57, No. 1, 1967, pp. 56-66.
[GarB7] Garibotto, G. and Storace, P. "3-D Range Estimate from the Focus
Sharpness of Edges", Proc. of the 4th Intl. Conf. on Image Analysis and Processing
(1987), Palermo, Italy, Vol. 2, pp. 321-328.
[Gha78] Ghatak, A. and Thyagarajan, K., Contemporary Optics, Plenum Press,
New York, 1978.
[Gla88] Glasser, J., Vaillant, J., Chazallet, F., "An Accurate Method for
Measuring the Spatial Resolution of Integrated Image Sensor", Proc. SPIE Vol.
1027 Image Processing II, 1988, pp. 40-47.
[Gro87] Grossman, P., "Depth from Focus", Pattern Recognition Letters, 5,
1987, pp. 63-69.
[Hea87] Healey, G. and Bindford, T., "Local Shape from Specularity", Proc.
of the 1st Intl. Conf. on Computer Vision (ICCV'87), London, UK, (1987), pp.
151-160.
[Hes89] Hess, R. F., Pointer, J. S., and R. J. Watt, "How are spatial filters used
in fovea and parafovea?", Journal of the Optical Society of America, A/Vol. 6, No.
2, Feb. 1989, pp. 329-339.
[Hum85] Hummel, R., Kimia, B. and Zucker, S., "Gaussian Blur and the Heat
Equation: Forward and Inverse Solution", Proc. Computer Vision and Pattern
Recognition, 1985, pp. 668-671.
[Kro89] Krotkov, E. P., Active Computer Vision by Cooperative Focus and Stereo,
Springer-Verlag, 1989, pp. 19-41.
[Lev85] Levine, M., Vision in Man and Machine, McGraw-Hill, 1985, pp.
220-224.
357
[Ngu90a] Nguyen, T. C., and Huang, T. S., Image Blurring Effects Due to Depth
Discontinuities", Technical Note ISP-1080, University of Illinois, May 1990.
[Ngu90b] Nguyen, T. C., and Huang, T. S., "Image Blurring Effects Due to
Depth Discontinuities", Proc. Image Understanding Workshop, 1990, pp. 174-178.
[Per90] Perona, P. and Malik, J., "Scale-space and Edge Detection using
Anisotropic Diffusion", IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-12, No. 7, July 1990, pp. 629-639.
[Pen87] Pentland, A., "A New Sense for Depth of Field", IEEE Trans. on Pattern
Recognition and Machine Intelligence, Vol. PAMI-9, No. 4 (1987), pp. 523-531.
[Pen89] Pentland, A., Darrell, T., Turk, M., and Huang, W., "A Simple, Real-
time Range Camera", Proc. Computer Vision and Pattern Recognition, 1989, pp.
256-261.
[Sub88a] Subbarao, M., "Parallel Depth Recovery by Changing Camera
Parameters, Proc. of the end Intl. Conf. on Computer Vision, 1988, pp. 149-155.
[Sub88b] Subbarao, M., "Parallel Depth Recovery from Blurred Edges", Proc.
Computer Vision and Pattern Recognition, Ann Arbor, June 1988, pp. 498-503.
[TI86] Texas Instruments Inc., Advanced Information Document for T I Imaging
Sensor TC2~1, Texas, August 1986.
[Wat83] Watt, R. J., and Morgan M. J., "The Recognition and Representation
of Edge Blur: Evidence for Spatial Primitives in Human Vision", Vision Research,
Vol. 23, No. 12, 1983, pp. 1465-1477.
Ve
Ub ~,~
B__~~:.:.'.....':-::.......................................................................
:
...............
.,~:~,<~
" . . . i : ~<~-:::
.............
Vb
b2
/
.... . . . . . . . . . . . . . .....::..... .....,..... ..... :-;:::~ii
I~.~ [e(t)
e i: x~ ix)
sensor X
plane
intensity
profile
s~c ZOO
a~
ale III
Z~C 11o
-~o Io
a3
55 ~ % . m ~ 4 , m , .
5so z*Q
21
z)
25
21
Z:
230
9
Z|O
e*
ZSO
z~o
J "x]
~w
53 a~
53 5~
52 55
52
:: ................... - , . . . , . .
52 ~ " 21
Fig. 7 Image M2: Close-up of the edge Fig. 8 Image M3: Close-up of background
360
134 ~t30 I~
Z3Z
130 9 9 9 9 t2s
124
*111
115
la* tV
Fig. 9 Image NI: low local contrast Fig. 10 Image N2: lens occlusion effect
"~" . , a0 ~'o
Fig. 11 Image P1 with row 86 and 270 Fig. 12 Image P2 with rows 86 and 270.
361
10 * .
11 *
17
10 17 *
17 16 *
*-:-:::::*:-,:
21 * * 21 """
21 .... * * ** .*** 2O
20
19
11
Fig. 13(b) Closed-up row 270 of P1 Fig. 14(b) Closed-up row 270 of P2
16
14 //'' 9 " II
12 12
~o ....'./) i/" 1 ,
10 zo
X-,nuxla
30 40
.4 1o zo 30 40
2.''
oat.
69
2r
49
29
l - 1~28
fs ~o 2s
X - ~im
:l' .V !
,-'t. 3. 4'.
9
For the MCB cases in the middle column (figure 16) Pentland's is of very limited
See section 4.1 No estimator: 16(b), 17(b)
use (the bottom row in middle column is the only valid "power image difference"
between o'right = 1 and aright = 2.) For MCB cases in the last column (figure 17)
Pentland's method is inapplicable. See section 4.1. Note different power scales.
362
336 J. Opt. Soc. Am. A/Vol. 6, No. 2/February 1989 Hess e t al.
~ooo
-oI
,ooo
~ 4o0
~o
Olstance {deg}
0.0 0.5
Distance
t.O 1.5 2,0
(deg)
;t.5 0,0 0.5
Distance
I 0 1.5 2.0
{deg)
;~.fi
A B
,::] I
0~ j # ' / /
5 ,oJ
3 tO 30
mLUR (~)
,oo~ # /
, /
/ /
ao
:J ,,
!:i z~176 i ~"
Figure 18. Our reproduction of Hess and Pointer's results on blur discrimination.
The phase-shifted processing, second row of his figure 10, consistently enhanced
human blur discrimination. Compare his figure 10 with our figures 3(d), 13(a), 14(a).
E l l i p s e b a s e d stereo vision 1
Johannes Buurman
Abstract. We propose a new stereo vision algorithm for finding circles in a scene. In
both 2-D images, ellipses are found. The ellipses are matched in order to find circles
in 3-D space. The method does not require a special camera alignment, instead both
camera matrices must be known. Some results are presented, showing that the
method is sufficiently fast and accurate for object recognition. After edge detection, a
few seconds of CPU time are sufficient to find full circles with standard deviations of
the order of 1-2% of the radius of the circles.
1. Introduction
We propose to use ellipses as well as straight lines as primities for stereo vision, resulting
in circles and straight lines in our 3D description. The set of 3-D circles is a significant
extension of the traditional, straight line based approach, while the number of parameters is
still kept small enough to allow meaningful estimation. In this paper, we will concentrate on
the ellipse-based stereo algorithm. Some results will be presented.
For an overview of the relevant literature, see [Bu2]. There is very little reference to stereo
vision based on parametfized curves of higher order than a straight line. We have only found
[PP1], where it is mentioned without detail, and [RG1] which is a predecessor to our work.
The reason for this may be that it is mainly attractive when the curve considered is really a
primitive of the scene. Such is the case in our work, where cylindrical objects must be recog-
nized.
2. Problem Description
When is circle is seen under perspective projection, the result is an ellipse. Projecting lines
from the focal point through each point on the ellipse back into 3-D space yields a "cone"
with an elliptical section. Stereo vision results in one ellipse in each image. However, com-
puting the section of the "cones" does not lead to a single circle. If the ellipses found were
exactly correct, we would find the two different interpretations that are possible: the circle
and a very eccentric ellipse, see Fig. 1.
Fig. 1. Two interpretations of two ellipses in Fig. 2. 100 points on a typical section of inexact
stereo vision. Top: the circle, bottom: an ec- "cones". Both the circle mad the ellipse are still
centric ellipse. visible.
1. This work has been conducted as part of the Delft Intelligent Assembly Cell project, partially
sponsored by SPIN/FLAIR.
364
In general however, the two 2-D ellipses will not be exact, and the axes of the two "cones"
will not meet in one point. The section of the two cones will then look like the example in
Fig. 2., a single curve that is not planar, but instead switches between the circle and the
eccentric ellipse. Parts of the circle and the ellipse can still be distinguished. Of course, if
two entirely unrelated ellipses are matched, the section of the "cones" may well be empty.
So, the problem of finding 3-D circles through stereo vision can be split into three steps:
- Finding ellipses in 2-D images.
- Computing the section of the corresponding "cones".
- Identifying the circle in each of these sections.
In this section, we have written "cones" to denote that the set of all lines through one point
and an ellipse does not, in general, have a circular section. In the remainder of this paper, we
will just use the word cone assuming that the reader is aware of this.
3. Overview of the method
An overview of the method is shown in Fig. 3. Both images are treated identically by the
preprocessing steps: edge detection is applied and the detected edges are stored as chain-
codes representing one pixel thick lines (for the full procedure see [Bull). From these code
strings, candidate elliptical edges are selected, to which an ellipse fit algorithm is applied.
The output of this algorithm is a set of an ellipse equations.
One of the sets of ellipse equations is first converted to a set of cone equations using the
appropriate camera matrix. Then, this set of cones and the other set of ellipse equations with
its camera matrix are then passed to the stereo algorithm, which outputs a set of 3-D circles.
4. Ellipse selection and fitting
In order to determine which ellipses are present in each image, we resort to fitting, but this
requires that the set of points belonging to one ellipse are identified first. We need to split up
slrings of chaincodes, until each string only contains of one ellipse. Accuracy of the fit
requires that these strings are as large as possible. Splitting based on the derivative of the
curvature at each part of the string does not work very well, because of the eccentricity of
some ellipses. Instead, we apply a simple grammar to the strings and split where the gram-
mar no longer applies. This algorithm is simple and fast, and works as long as the smoothing
effect of the edge detector is sufficient to deal with most of the noise on each ellipse.
Let c i be a Freeman code for direction i, and ci+ 1 (Ci_l) be the code for the next direction
(counter-) clockwise. If we use c* to denote zero or more, and c+ to denote one or more
occurrences of c, a string belonging to an ellipse can either be written as:
Ci+ Ci_l+)*Ci*(Ci+l+Ci+)*Ci+l*(Ci+2+ Ci+l+)*... (1)
or
365
which, although somewhat biased towards eccentric ellipses, is a good and efficient esti-
mator of ellipse parameters. If the minimal error per pixel is below a certain threshold, and
the set of parameters that minimizes it corresponds to an ellipse (and not another conic sec-
tion), we accept the ellipse.
As a final step, we go through the set of ellipses found to see whether two ellipses can be
merged, keeping the average error over the union of the pixels below the threshold. This is
done repeatedly until no more mergers are possible.
5. Ellipse-based stereo
Our method is general in that it does not require the two cameras to be parallel. Rather, it
assumes that each camera can be described by its mapping of world coordinates Yi to camera
coordinates xi (in homogeneous coordinates, multiplication with a matrix C)
and thatboth camera matrices CL and CR are known. Using (4),the set of all world points
y that correspond to a given camera point (x1~x2)is the linedescribed by the two equations
( C l - x l C 3 ) .y = 0 and ( C 2 - x 2 C 3 ) " y = 0 (5)
where y is the column vector (yl,ye,y3,1) T and C1, C 2 and C 3 are the rows of C, see e.g.
[Hol]. The line corresponding to (5) can also be written in parameter form, as is show for
instance in [BB1]. In [Bu2], we derive how an ellipse cone equation can be obtained of the
from
(C 3.y) 2 = ( t 1 . y ) 2 + ( t 2 . y ) 2, (6)
N used to describe one ellipse, the size n of the plane fitting window, the thickness d,nax
allowed to such a plane (distance beyond which points are no longer considered part of the
plane), and a maximum error e allowed for each circle found. Except for plane thickness
these are relative parameters and depend only on the noise.
6. Constraints
The process of stereo matching can be sped up considerably by the use of constraints. Our
current implementation uses the following constraints explicitly:
- The epipolar conslraint. This well-known constraint indicates that points in one image
can only match points on a single line in the other image. This line is the section of the plane
of the other image with the plane through the point and the centres of projection, see [Hol].
In our implementation it is applied to ellipse centres: if the centres of two ellipses are not
within a given distance of an epipolar line, they are not considered in the matching process.
- The similarity constraint. Most structures in one image look almost like corresponding
structures in the other image, if that image is obtained from a viewpoint which is too distant
(which is usually the case in stereo vision). If difference in length of the major axes of two
ellipses is not below a certain threshold, a match is not considered.
- Spatial constraints. Only ellipses within a specific volume in space are accepted. Because
of our application, we now very well where real ellipses can be.
7. R e s u l t s
The algorithm has been implemented and applied to a number of tests, using the following
parameter values: N = 100, n = 5, d = 4 mm, e = 0.15. The principle of the method is illus-
trated in Fig. 4. Shown are: the stereo pair, edges extracted from each image, ellipses found
in each image and the circles found, projected back into the scene. The object shown is one
of the test objects in our project, with four circles of which three are partially visible. Note
that only the top side of the discs at each end are found, which is sufficient for recognition.
The object is about 10 cm wide and 5 cm high, and is one of the largest parts in the cell. Both
cameras were placed at 65 cm from the scene, 12 cm apart, without any special alignment.
Instead, the camera matrices were obtained using a robot controlled calibration procedure.
Processing was done using a Sun 4/330 workstation using 512"512 images. For edge
detection about 25 s of image processing were required, chaincode handling and ellipse fit-
ting took 1 s and the ellipse stereo algorithm 0.2 s (CPU times). Note that the image process-
ing time is independent of scene complexity and can be improved upon using edge detection
hardware.
An analysis of the accuracy of the method can be found in [Bu2]. The figures show that the
method allows position estimates of circle centres with a standard deviation of the order of
1% of the circle's radius. The standard deviation of radii is about 1.7%. Orientation angles
show standard deviations of the order of 1 degree. Further research is needed for quantitative
analysis of the performance of the algorithm where ellipses are only partially visible. A sig-
nificant increase of errors is to be expected. Furthermore, the integration of ellipse based ste-
reo in the full stereo vision system needs further evaluation.
8. C o n c l u s i o n s
We have described a stereo vision system using ellipses in images, representing circles in
3D. It is based on an algorithm to find ellipses or partial ellipses in gray value images, and an
algorithm to find the circle in 3D corresponding to two ellipses. The algorithms proves to be
sufficiently fast and accurate for use in robot vision. Speed improvements may be obtained
using edge detection hardware. The performance of the stereo algorithm where ellipses are
only partially visible must still be investigated quantitatively.
367
9. References
[BB 1] D.H. Ballard and C. M. Brown, Computer Vision, Prentice-Hall, Englewood Cliffs, New Jer-
sey, (1982).
[Bul] J. Buurman, The Diac Object Recognition System, to be presented at SPIE conference on
Applications of Artificial Intelligence X: Machine Vision and Robotics, Orlando (1992).
[Bu2] J. Buurman, Ellipse based stereo vision, Internal report, Pattern Recognition group, faculty
of Applied Physics, Delft University of Technology, (1992).
[FS 1] N.J. Foster and A.C. Sanderson, Determining Object Orientation Using Ellipse Fitting, SPIE
Intelligent Robots and Computer Vision, vol. 521, (1984) 34-43.
[Hol] B.K.P. Horn, Robot Vision, McGraw-Hill, New York, (1986).
[PP1] S.B. Pollard, J. Porrill, and J.E.W. Mayhew, Recovering partial 3D wire frames descriptions
from stereo data, Image and Vision Computing, vol. 9, no. 1, (1991) 58-65
[RG1] C.J. Rijnierse and F.C.A. Groen, Graph construction and matching for 3D object recogni-
tion, in: Pattern recognition and artificial intelligence - towards an integration, ed. L.N.
Kanal, North Holland, Amsterdam, (1988)
A p p l y i n g T w o - d i m e n s i o n a l D e l a u n a y Triangulation
to Stereo D a t a Interpolation
E . B r u z z o n e I M . C a z z a n t i 2 L. D e F l o ~ i a n i 2 F. M a a g i l i x
1 Introduction
Delaunay triangulation has turned out to be a very powerful tool in many application fields,
including finite element analysis, motion planning, digital terrain modeling and surface
reconstruction in computer tomography [LR1, Chl, Bol, DP1]. Such representation has
several important properties: it is invariant through rigid transformations, it adapts to the
data distribution, it is easy to update because of the local effect of inserting new points
or segments.
In classical computer vision problems, like scene reconstruction and autonomous nav-
igation, Delaunay triangulation has been often adopted for both 2D and 3D data. In
particular, its discontinuity-preserving nature makes it especially suitable to interpolate
passive stereo data, which usually correspond to scene discontinuities. The use of 3D
Delaunay triangulation for interpolation of data obtained by a stereo process was first
proposed in [Boll. A coherent and comprehensive presentation of this approach can be
found in [FL1], where the authors suggest a modification to standard Delannay triangula-
tion to include stereo segments as part of the triangulation, based on the addition of extra
points.
A new approach to 3D surface reconstruction which starts from stereo data and makes
use of a two-dimensional Delaunay triangulation including the projections of the segments
as part of the triangulation, has been proposed in [BG1]. The basic idea is to interpolate
the image segments which form the input for the stereo reconstruction process. The
computed 2D mesh is then backprojected into the 3D space using the corresponding
reconstructed stereo data. The result of the whole process is a triangular-faced piecewise
linear surface, in which the stereo segments are somehow preserved. Interesting features
of this approach are its fairly low computational cost, due to the fact that most of the
processing is done in 2D, and its robustness toward calibration a n d stereo reconstruction
errors. A drawback of this approach is in the splitting of the segments in the image
plane, which requires the computation of the 3D coordinates of the introduced points and
produces many small triangles in special segment configurations.
In this paper, we present a further development of that work by proposing an approach
to 3D surface reconstruction from stereo data based on the computation of constrained
Delaunay triangulation in the image plane which avoids the segment splitting and therefore
the computation of the 3D position of the added points [Chl, LL1, DP1].
369
The surface reconstruction process consists of three phases: stereo segment reconstruction,
constrained Delaunay triangulation in the image plane, and backprojection of the two-
dimensional tessellation.
The edge segment-based stereo process developed under the Esprit Project P940 [ALl,
Mull has been adopted. Three images are acquired from slightly different points of view.
On each image a low-level processing made of edge detection, edge linking and polygonal
approximation is performed, resulting in a set of 2D segments corresponding to relevant
scene features. One of the three images is selected as reference image. For each segment of
the reference image, possible matches, i.e., segments corresponding to the same feature in
the other two images, are selected, making use of the epipolar constraint. Then, for each
triple of matched segments the 3D segment is reconstructed, on the basis of perspective
projection.
The triangulation is computed on the 2D segments of the reference image plane selected
by the stereo process. Note that such segments are perspective projection of real observed
features, and therefore they reflect the visibility properties of the world features from
which they have been originated. As the low-level phases of edge linking and polygonal
approximation guarantee that the segments are disjoint, each triangle is bounded by only
one stereo segment. Moreover, as the image segments are directly the output of the stereo
matching, the triangulation can be computed independently of the stereo reconstruction,
avoiding the errors which may occur in the reconstruction phase.
The 2D mesh is then backprojected into the 3D space using the corresponding 3D
segments endpoints evaluated during the stereo phase. The result of the whole process
is a triangular-faced piecewise linear surface, in which the stereo segments are somehow
preserved. For each triangular face of the surface, the normal unit vector is computed,
achieving a space-variant needle map representation of the observed scene. The geometric
structure resulting from the backprojection can be defined by a function p -- p(~, ~) in a
system of spherical coordinates centered in the pin-hole of the camera. Therefore, possible
intersections among the triangular faces of the 3D surface can be caused only by errors
occurred in the stereo process.
The two-dimensional Delaunay triangulation of a set ~ = {Pi, P2,. 9 Pn~ of points in the
plane is the straight-line dual of the Voronoi diagram [PS1]. The Voronoi diagram of ~ is
a collection r = {V1, V2,..., Vn~ of convex regions, called Voronoi regions, such that V/is
the locus of the points of E 2 closer to Pi than to any other point in ~. Given a set ~ of
points in the plane and a set S of non-intersecting straight-line segments whose endpoints
are contained in ~, the pair G = ( ~ , S ) defines a planar straight-line graph, called the
constraint graph. A triangulation T o f ~ whose edge set contains S is called a constrained
triangulation of ~ with respect to S. A Constrained Delaunay Triangulation (CDT) T of
a set of points ~P with respect to a set 5 of line segments is a constrained triangulation
of ~ in which the circumcircle of each triangle t of T does not contain (in its interior)
any other vertex Pi of ~P which can be joined to each vertex of t by a line segment not
intersecting any constraint segment (see Figure 1).
Static algorithms for computing a C D T appeared recently in the computational geom-
etry literature [LL1, Chl]. The algorithm we use, proposed in [DP1], is instead based on
incremental refinements of a Delaunay triangulation. It starts from an initial Delaunay
triangulation of a specified subset of the input data, and then modifies the triangulation
370
by inserting the points of 7~ and the segments in ~q one at a time. Thus, the two major
computational steps of the algorithm are (i) C D T modification when inserting a point P,
(ii) C D T modification when inserting a segment I.
Step 1 is performed by extending a standard method for adding a point to a Delaunay
triangulation to the constrained case [Wal]. When a new point P is inserted, the trian-
gles whose circumcircle contains P are deleted and the resulting star-shaped polygon is
triangulated by connecting the vertices of such a polygon to P. The worst-case complex-
ity of this step is O(r~). Thus, inserting all n data points leads to an O(n 2) worst-case
complexity, which reduces to O(nlog n) if randomized algorithms are used [GK1].
Step 2 is performed by intersecting the new segment l with the existing triangulation
and retriangulating the region of the plane defined by the union of the triangles intersected
by I. The edges bounding the region of T intersected by l, called influence region, form
a simple polygon Qt, called i~fluence polygon, of which I is a diagonal. I splits Qt in two
simple polygons lrl and lr2, which are triangulated by recursively splitting them into three
subpolygons. The resulting triangulation of ~rl and ~r2 is then locally optimized by an
iterative application of the empty circle criterion for a C D T [DP1].
The time complexity of the influence region computation of a constraint segment I is
linear in the number of triangles intersected by I. Both rebuilding the constrained Delaunay
triangulation of a polygon and its optimization have a quadratic worst-case complexity in
the number of vertices of the influence polygon. The worst-case complexity of the segment
insertion algorithm is O(m~2), where m is the number of constraint segments (m -- 2r~ if
the points of 7~ are the endpoints of the segments of). By using an asymptotically optimal
Delaunay triangulation algorithm for simple polygons [LL1], the worst-case complexity of
the algorithm could be reduced to O(ranlog n), by losing the implementation simplicity.
An alternative approach to include sets of segments to a Delaunay triangulation, con-
sists of splitting the segments into subsegments (by adding additional vertices), so that
the constrained Delaunay triangulation of all subsegments is the same as the Delaunay
triangulation of the augmented vertex set. In [FL1] a preprocessing step is used to split
the segments according to their minimum distance.
A comparison between the CDT algorithm and the segment-splitting algorithm de-
scribed in [FL1] has been done. Experimental results show that the average number of
inserted points triples the number of original points (segments endpoints). The number
of points (and triangles) increases dramatically when close parallel segments occur in the
input data. Figure 2 shows the results of C D T and segment-splitting algorithms on a real
indoor scene.
371
Figure 2: Reference image of a trinocular stereo system (a) and matched segments (b).
C D T (c) and segment-splitting triangulations (d).
4 Experimental R e s u l t s on S c e n e R e c o n s t r u c t i o n
The complete process of scene reconstruction has been tested on a set of teal scenes.
Assuming as reference applications both scene surface characterization and free space
detection for autonomous navigation tasks, indoors images (i.e., office and laboratory
images) have been acquired.
The DMA machine, developed under the Esprit Project P940, has been used to get
both the 3D reconstructed segments and the corresponding 2D segments of the reference
image. First an unconstrained Delaunay triangulation is built on the segments endpoints.
Then, the resulting triangulation is updated, by adding the input segments as Delaunay
edges.
The surface obtained backprojecting the C D T into 3D is made of a minimum number
of triangles (for instance, two parallel segments define only two triangles). Besides, very
elongated triangles, which may occur in the image-plane C D T , often correspond to more
equiangular triangles in 3D, due to the perspective projection under which the scene has
been seen.
372
Experimental results have shown that the running time of the whole surface recon-
struction process reduces of about the 50% using the C D T algorithm, rather than the
segment-splitting one. Such a reduction is due to both the triangulation phase (without
the splittingof the constraint segments) and the backprojection phase.
5 Concluding Remarks
The proposed scene reconstruction process starting from stereo segments is based on a
two-dimensional Constrained Delaunay Triangulation done in the image plane and results
in a triangular-faced piecewise linear description of scene surfaces. With respect to what
presented in [BG1], the main novelty is in the use of a powerful algorithm which constrains
the triangulation to the input segments, avoiding the insertion of extra points. Some ex-
perimental tests on real data have confirmed the foreseen advantages of this new approach
in terms of both computational efficiency and improvement of the resulting surface de-
scription.
As the bottleneck of the whole strategy is in the computation of the CDT, a parallel
implementation of this phase on the Elsag Bailey multiprocessor machine EMMA2 has
been completed.
References
[ALl] Ayache N., Lustman F.: Fast an Reliable Passive Trinocular Stereovision,Proceed-
ings I't International Conference on Computer Vision, London (1987).
[Bol] Boissonnat J.D.: Geometric Structures for Three-Dimensional Shape Representa-
tion, A C M Transaction on Graphics, 3, 4 (1984).
[BGI] Bruzzone E., Garibotto G., Mangili F.: Three-Dimensional Surface Reconstruc-
tion using Dclaunay Triangulation in the Image Plane, Proceedings International
Workshop on Visual Form, Capri (1991).
[Chl] Chew, L.P.: Constrained Dclaunay Triangulation, Proceedings 3rd Symposium on
Computational Geometry, Waterloo (1987).
[DP1] De Floriani L., Puppo E.: Constrained Delaunay Triangulation for Multiresolution
Surface Description, Proceedings 9th International Conference on Pattern Recogni-
tion, R o m a (1988).
[FLI] Faugeras O.D., Le Bras-Mehlman E., Boissonnat J.D.: Representing Stereo Data
with the Delaunay Triangulation, ArtificialIntelligence,44 (1990).
[GK1] Guibas L.Y., Knuth D.K., Sharir M.: Randomized Incremental Construction of
Delaunay Triangulations and Voronoi Diagrams, Proceedings ICALP (1990).
[LLI] Lee D.T., Lin, A.K.: Generalized Dclaunay Triangulation for Planar Graphs, Dis-
crete Computational Geometry, I (1986).
[LR1] Lewis B.A., Robinson, J.S.: Triangulation of Planar Regions with Applications,
The Computer Journal, 21, 4 (1979).
[Mul] Musso G.: Depth and Motion Analysis: the Esprit Project P940, Proceedings 6 th
Annual ESPRIT Conference, Brussels (1989).
[PS1] Prcparata F., Shamos M.I.: Computation Geometry: an Introduction, Springer-
Verlag, New York (1985).
[Wal] Watson D.F.: Computing the a-dimensional Dclaunay Triangulation with Applica-
tion to Voronoi Polytopes, The Computer Journal, 24 (1981).
Local Stereoscopic Depth Estimation Using Ocular
Stripe Maps
1 Introduction
Multiframe analysis of images, such as stereopsis and time-varying image sequences, has
been a primary focus of activities within the last decade of computational vision research.
In both areas, the key problem has been identified as finding the correct correspondences
of homologous image points. The so-called correspondence problem has not been solved
to date to apply for general purpose vision tasks. For a review of relevant techniques for
finding stereo correspondences, we refer to e.g. [3].
Finding stereo correspondence can be identified as a mathematically ill-posed prob-
lem, which has to be regularized utilizing constraints imposed on the possible solution
(see e.g. [10]) . The majority of computational approaches is therefore formulated as
finding a solution in a high dimensional search or optimization space by minimizing a
functional which usually takes into account a data similarity term as well as a model
term (e.g. for achieving smoothness) to regularize the solution (see e.g. [1, 10]). In order
to avoid the complexity of most of the existing computational techniques, we investigated
biological findings about architecture and mechanisms for seeing stereoscopic depth.
Due to the limited space of the current conference proceedings, this contribution does
not cover all the topics we had to present. Neither does it provide you with the necessary
context to be able to fully understand the presented facts. If required use [5, 6] and the
references in there to get full background information.
* Em~il: ludwig~kogs26.informatik.uni-hamburg.de
** Em~il: neumann_h@rz.informatik.uni-hamburg.de
374
2 Biology
B i o l o g i c a l D a t a S t r u c t u r e s S u p p o r t Efficient C o m p u t a t i o n . An alternative to
the most commonly realized strategy in computational vision research ([8]) is to in-
fer information processing capabilities from the identification of structural principles in
the mammalian visual cortex (see e.g. [7]). These general principles include the discrete
mapping of different sensory features like orientation, color, ocularity, depth or motion
in 3D space to positions in a subspace of R2 ([4]). The "computational maps" discov-
ered so far, have been postulated to optimally support computational mechanisms of
different specificity. 3 Our computational model for local depth estimation is based on
the subdivision of the cortex into ocular dominance stripes. Starting with the idea that
a hypothetical disparity sensitive cell uses a local section of two neighboring stripes as
input to compute local disparity, we can subdivide the original left and right image into
local patches, whose size is chosen according to two major - in principle contradicting
- constraints: increasing stability of estimate with increasing stripe width and, increas-
ing accuracy with decreasing stripe width. Given the disparity at all single locations we
obtain a disparity map from which a (relative) depth map can be easily inferred.
3.1 A n a l y s i s a n d E v a l u a t i o n
To determine the values of the parameters of the technical model we have evaluated the
relevant and sometimes diverging biological data from various sources to get a reasonable
and consistent parameter setting. For a detailed discussion see [5]. Two corresponding
local image patches extracted from the left and right image, respectively, can be arranged
a However, from the set of maps and principles given above, only the principles of retinotopy
and ocular dominance stripe maps are fully established ([4]). The organization of alternating
bands of ipsi- and contralateral eye dominance has been modeled recently as being the result
of a structural transformation principle described via a non-linear mapping function ([7]).
These so-called ocular dominance columns subdivide the whole area of the cortex (area 17,
layer 4B) in alternating bands of ca. 0.5ram width.
4 As a part of an active vision system, a fixating binocular head necessarily requires an atten-
tional control module for the selection of appropriate fixation points and a module for the
vergence movement to fixate the selected points. Proposals have been published how fixation
points could be selected and how such a point may be tracked in time (see [5] for references).
5 In case of idealized circular retinae the iso-disparity lines are circles of different radii with the
horopter circle as one element of the set. In case of flat projection planes these iso-lines are
conic sections (see [5] for a detailed discussion).
375
in a local neighborhood to form a single joint signal. This idea has been originally utilized
in an algorithm proposed by Yeshurun & Schwartz [11] using rectangular patches butted
against each other. If such a combined signal is filtered with the cepstrum s, the filtered
image contains a strong and sharp peak at a position which codes the disparity shift
between the two original subsignals. This can be derived mathematically for the ideal
case of a pure translational shift (see e.g. [5, 11]). Excluding some special cases - which
will be named later in this paper - the disparity between the two subsignals can be
obtained by simple m a x i m u m detection in the cepstral plane.
Using this method for computing disparities has several advantages. First, it is fast,
because the disparities are computed in a single step without any iterations z. Second,
due to the local and therefore independent computation of the disparities, parallelization
is easy. Third, it is well-known from previous work, that the cepstrum is extremely
insensitive to noise. We showed in a systematic evaluation ([5]) that the cepstral filter is
insensitive to moderate image degradations due to rotation or scaling (6 degrees, 6%).
G e o m e t r y . It is reasonable to assume, that physical surfaces in the natural envi-
ronment are piecewise smooth and can hence be approximated locally by their Taylor
series expansion. We mathematically investigated the distortions in the disparity field
when fixating planes and second order surfaces. For a given point in 3D space, let the
left image coordinates be 1 = (XL, YL). Then the right image coordinates r = (xR, YR)
can be computed in the first case to be:
alXL + a2yL CyL
xR -- and YR = (1)
blxL + b2YL + b3 blXL + b2yL + b3
where ai, b~ and c are constants with respect to a given stereo arrangement and local
surface orientation (see [5] for further details on formulae and (graphical) results).
E v a l u a t i o n . The cepstral filter as used in the literature with rectangular windowing
functions suffers from some specific problems: If, for example, the double signal contains
a single straight edge segment, then up to five additional m a x i m a may appear in the
cepstrum. In the case of varying illumination an additional peak at zero disparity m a y
appear. These and other deficiencies can be overcome with different support functions.
Fig. 1. Cepstrum with gaussian support functions. Left: Double signal f(x) composed from
data of left and right image at same (retinal) locations multiplied by a gaussian window and
added with a fixed offset. Center: Amplitude spectrum. Right: Cepstrum{f(x, y)}. To enhance
the visual impression log(.) is displayed in the center and right image and a small region around
the origin has been removed (only right).
Fig. 2. Local depth map computed by the improved algorithm using equal gaussian window
functions for left and right image with a previous LoG filtering step (The rectangles only outline
the subdivision of the image). The image pair has been taken at a distance of 2 meters with stereo
base length 7.00 cm using a precision adjusting device to produce the fixating arrangement. The
(foveal) angle of extent is 200 minutes of arc. As can be observed the algorithm fails if one of
the two indicated conditions hold (see arrows).
References
1. S.T. Barnard and M.A. Fischler: Computational and Biological Models of Stereo Vision.
In Proc. IU Workshop, Pittsburgh, PA, USA, September 11-13 (1990) 439-448
2. B.P. Bogert, M.J.R. Healy, and J.W. Tukey. The quefrency alanysis of time series for echoes:
cepstrum, cross-cepstrum, and saphe cracking. In Proceedings: Symposium on Time Series
Analysis (1963) 209-243
3. U.R. Dhond and J.K. Aggarwal. Structure from Stereo - A Review. IEEE Trans. on
Systems, Man, and Cybernetics, 19(6) (1989) 1489-1510
4. B.M. Dow. Nested maps in macaque monkey visual cortex. In K.N. Leibovic, editor, The
Science of Vision, Springer, New York (1990) 84-124
5. K.-O. Ludwig. Untersuchung der Cepstrumtechnik zur Querdisparit~tsbestimmung ffir
die Tiefensch~tzung bei fixierenden Stereokonfigurationen. Technical Report, Fa~hbereich
Informatik, Universit'at Hamburg (1991)
6. K.-O. Ludwig, B. Neumann, and H. Neumann. Robust Estimation of Local Stereoscopic
Depth. In International Workshop on Robust Computer Vision (IWRCV '9~), Bonn, Ger-
many, October 9-1~ (1992)
7. H.A. Mallot, W. yon Seelen, and F. Giannakopoulos. Neural Mapping and Space-Variant
Image Processing. Neural Networks, 3 (1990) 245-263
8. D. Marr. Vision. W.H. Freeman and Company, San Francisco (1982)
9. T.J. Olson and D.J. Coombs. Real-Time Vergence Control for Binocular Robots. Technical
Report 348, Department of Computer Science, University of Rochester (1990)
10. T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature,
317 (1985) 315-319
11. Y. Yeshurun and E.L. Schwartz. Neural Maps as Data Structures: Fast Segmentation of
Binocular Images. In E.L. Schwartz, editor, Computational Neuroscience, Chap. 20, The
MIT Press (1990) 256-266
This article was processed using the LATEXmacro package with ECCV92 style
Depth Computations from Polyhedral Images *
Gunnar S p a r r
Dept. of Mathematics, Lund Institute of Technology,
Box 118, S-22100 Lund, Sweden
1 Introduction
The topic of this paper is depth computation and scene reconstruction in the case when
the scene is built up by planar surface patches, bounded by polygons. Having only this
information, from one single image no quantitative information can be drawn. A common
situation is that the scene contains patches which are parallelograms, often rectangles.
It will be seen that under this rather weak assumption, without any knowledge about
the sizes of these parallelograms, it is possible to compute a depth-map over the image,
modulo a common scaling factor. The method may be used also for patches of other
shapes. No camera calibration is needed.
The approach is inspired by the subjective experience that depth information seems to
be contained in the shape of an image of e.g. a rectangle. In a series of papers, e.g. [8], [9],
[10], [11], this hypothesis has been verified in quantitative terms. In the present paper
emphasis will be laid on examples and experiments, rather than on the mathematical
theory. For a thorough treatment of the latter, see [9].
The organization of the paper is as follows. In Sect. 2, the concept of 'shape' is
described, with examples. In Sect. 3 the same will be done for 'depth', with some new
theorems. In Sect. 4 is described a simple experiment, illustrating the applicability of
the method for realistic data. Also the robustness properties are investigated. In Sect. 5,
some degeneracies that may occur are treated. In Sect. 6, finally, the results and their
relations to previous work are discussed.
Throughout the paper, it is assumed that the correspondence problem is solved be-
forehand, i.e. that a set of point matches between points in the image and in the scene
is established.
2 Shape
Instead of working with individual points, we work with m-point configurations, by which
is meant ordered sets of points, planar or non-planar,
X = ( X ~ , . . . , X m) .
* The work has been supported by the Swedish National Board for Industrial and Technical
Development, (NUTEK).
379
It turns out to be fruitful to work with a kind of duality (for motivations and proofs,
see e.g. [9]), and consider the linear relations that exist between the points belonging
to a particular configuration. It can be proved that the set (1) below is independent of
coordinate representations. The following definition plays a crucial role in the sequel.
where the coefficient ~1 of the null-vector X 1 X 1 is chosen so that the coefficient sum
vanishes. This construction determines the shape-vector (~1, ~ 2 , ~ s , - 1 ) in the case of
4-point configurations. Any multiple of this vector belongs to s(2.) too.
Xt X3
X 1 X 4 = ~2X1X 2 + ~ s X 1 X 3
X 4 = ~ I X 1 + ~2X 2 + ~3X 3,
X2
Above the points X 1, X 2, X 3 form what is called an affine basis for the plane. The
coordinates {1,{2,{3, with ~ { i -- 1, are called the aIfine coordinates of X 4. An analo-
gous construction can be done in space. Thus, if X 1 , X 2, X 3, X 4 are vertices of a non-
degenerate tetrahedron, they form an affine basis, and X 5 can be described by its affine
coordinates { 1 , . . . ,{4, with ~-'~{i = 1. Again s(2.) is a one-dimensional linear space.
For two-dimensional configurations with more than 4 points, and three-dimensional
configurations with more than 5 points, s(2,) is a linear space of higher dimension.
Generally, for the shape of an m-point configuration, the following can be said:
dims(2,) = m - 2 if the points are collinear,
dims(2,) = m - 3 if the points are coplanar, but not collinear, (2)
dims(2,) = m - 4 if the points are not coplanar.
380
Example 1. In Fig. 2 are shown two 2D and one 3D configurations. The dotted lines have
no other meaning than to indicate relationship between the points.
In the left configuration, the point X 4 is the eentroid of the triangle with vertices
in X 1, X 2 , X a. In the middle one, two sides are parallel. The right configuration is a
"joint" in 3D, consisting of two rectangular 4-point configurations. Bases for the shapes
of these configurations are shown. They may be computed e.g. by means of the afline
basis construction above. A natural way to select a basis for the joint is by means of the
planar subconfigurations, as is done in the figure, hut other choices are possible too.
3 5--
":'" 4 .......... " '. ....... 6
" ", "'... " ' 3:, ~ 1 7 6 1 7 6 ".
*9o#". . . . * ~ "~ *=,~ . o .. o
1": ...... "'.',, ." . ...... :2 1-'.'.'.'~" ;;;3 4
~
. . . . . . , :" 2 1 ~ .... .--" .... .....~
...... 2"'"
To be of practical use, one needs algorithms for the computation of shape. One such
algorithm suggests itself by the definition, namely the solving of a homogeneous system
of linear equations. By means of e.g. a row echelon algorithm, a basis for s(X) can be
computed.
In the definition of shape, the coordinate invarianey is very important. It makes it
possible to compute the shape of an image configuration from intrinsic measurements
in the image plane, in terms of an arbitrary coordinate representation. The same holds
for the object configuration. These shapes can thus be computed independently, without
reference to the imaging process.
In the rest of this paper, point configurations defined by the vertices of polyhedral
objects will be considered. (Here the word 'polyhedral' is used in a wide sense for con-
figurations, not necessarily solid, built up by planar polygonal patches.) Besides being a
point configuration of its own,
X=(XZ,...,X ~) ,
such a configuration has a lot of additional structure. In fact, each of the f polygonal
faces of the object contributes with a sub-configuration, defined by the vertices of the
polygon
X~--(X~,...,X'~'), i= l,...,f .
The whole configuration may be considered as an ordered set of these sub-configurations
o x = xj) .
We will work in parallel with both these representations X and 0 x, and by abuse of
notation write 0 x in both instances.
381
3 Depth
In this section will be examined how the shape, as defined above, transforms under
projective transformations. The results are fundamental for the applications below.
By a perspectivity with center Z in 3-space is meant a mapping with the property
that every point on a line through Z is mapped onto the intersection of the line with
some plane lr, the image plane, where Z ~ ~r. For a perspectivity between two m-point
configurations, X -----*y , there exist ai, i = 1 , . . . , m, such that
Z X i = a I Z Y i, i=l,...,m .
Here o~i is called the depth of Xi with respect to Y/, i = 1 , . . . , m , and the vector
c~ = ( a l , . - - , am) is called the depth of X with respect to Y.
By a projectivity is meant a composition of perspectivities. The product of the depths
of these perspectivities defines the depth of the projectivity. (It can be shown, of. [9],
that this product is independent of decomposition.)
The following theorem shows how the knowledge of the shapes of two configurations
X and y makes it possible to characterise all projectivities P such that y ~ P ( X ) . In
particular, pose information is attained about the location of X relative to y .
Example 2. The right hand configuration of Fig. 2 was said to illustrate a three-dimensio-
nal figure, a joint, with rectangular faces. Looking upon the figure as it is printed on the
paper, i.e. as a two-dimensional perspective image of the joint, measurements in the
image give that its two parts have the following shapes:
s((y1, y2, y3, y4)) = (01, 02, r/s, 74) = (1.1, - 1 , -1.2, 1.1) ,
s((yS, y4, yS, y6)) = (r/s, r/4, 75, 76) = (-1.2, 1.1, 1.1, - 1 ) .
The result of applying Theorem 2 to the two parts separately may be summarised in a
matrix equation
diag(0/)
i101i11]
-1
-1 1
1 -1
0 -1
0 1
0
=
-1.0
-1.2-1.2 diag(1,-1)
1.1 1.1
1.1
-1.0
Here the diagonal matrix on the right hand side is needed to adjust for the arbitrariness
(3)
in the choice of the columns of the 6 x 2-matrices. The system has the depth solution
0/T = [0/1 0/2 0/3 0/4 Ol5 0/6] = [1.1 1.0 1.2 1.1 1.1 1.0]
For noisy data, the equation (4) can't be expected to be satisfied exactly. Moreover,
the system in 0/is in general overdetermined. In the next section it will be solved in the
least square sense.
For projective mappings from one plane to another, the following theorem gives an
analytic expression for the depth function.
4 An Experiment
A simple experiment will illustrate how the theory m a y be used. In Fig. 4 is shown an
image of a corridor scene, containing a number of rectangular objects: two doors, p a r t of
a wall, a b o a r d and the faces of a box. The doors are at a p p r o x i m a t e distances 6 m and
12 m from the camera. The dimensions of the box are 35 x 35 x 20 cm. The size of the
image is about 300 x 400 pixels, where the box occupies a b o u t 50 50 pixels.
Wall+doors
1 2 3 4 5 6 7 8
comp 1.00 1.01 1.19 1.20 2.08 2.09 2.23 2.23
meas 1.00 1.02 1.20 1.22 2.07 2.08 2.21 2.22
Board
9 10 11 12
comp 1.35 1.35 1.92 1.92
meas 1.35 1.36 1.90 1.91
Box
A B C D E F G
comp 1.02 1.01 1.02 1.00 1.07 1.05 1.05
meas 1.03 1.01 1.03 1.00 1.07 1.05 1.05
To get a feeling for the robustness, rectangularly distributed noise of 4-1 pixels was
a d d e d to each coordinate of the points of interest above. New depth-values were then
computed. In order to compare homogeneous depth-vectors, the normalised differences
O/ref -- [OLnew-""~O/new
were computed, where aref stands for the c o m p u t e d depth values of Fig. 4. Table 1
shows the outcome of 10 r a n d o m simulations. For the wall+doors, the results indicate
good robustness properties, while for the box the results are not equally favourable. T h e
l a t t e r isn't surprising, since the object is small and so distant t h a t the perspective effects
are small.
T a b l e 1. Effects of noise.
Wall+doors Box
1 2 3 4 5 6 7 8 A B C D E F G
-0.01 0.00 0.00-0.01 0.02-0.01 0.00-0.01 0.00 0.00 0.00-0.06 0.01-0.03 0.08
0.00 0.00 0.02 0.01 0.03-0.02 0.00-0.02 -0.04 0.02-0.01 0.00-0.01 0.00 0.04
0.01 0.00-0.01 0.00-0.01 0.00 0.00 0.01 -0.02-0.06 0.02 0.02 0.02 0.05-0.03
-0.01 0.00 0.00-0.01 0.02-0.01 0.02-0.01 0.03 0.01 0.05 0.05-0.05-0.02-0.05
0.02-0.02 0.01-0.01-0.01 0.01 0.00 0.01 -0.02-0.04 0.03-0.01 0.06 0.00-0.03
0.01 0.00 0.00 0.01 0.00-0.02 0.02-0.02 0.00-0.01 0.00 0.01 0.04 0.01-0.04
0.00 0.00 0.00 0.01 0.00 0.02-0.01-0.01 0.01 0.01-0.04 0.01 0.09 0.02-0.09
-0.01 0.01-0.01 0.01 0.00 0.04-0.03-0.01 -0.02 0.01-0.01 0.04-0.08 0.02 0.05
-0.02 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.03-0.06-0.04 0.00 0.01 0.06
90.00-0.01 0.01 0.00 0.03-0.03 0.01-0.02 -0.03 0.01-0.01 0.02-0.02 0.02 0.02
5 Degeneration
By construction, of. Theorem 3, all columns of S y belong to the shape s(OY). The
same holds for S x , s ( O x ) . From the depth consistency (4) it follows that S y and S x have
the same ranks. Hence d i m s ( O x ) > r a n k s x = r a n k s y. From (2) it is known that the
m - p o i n t configuration O x is non-planar if and only if d i m s ( O x ) = m - 4. Combining
these facts and definitions, we have proved the sufficiency part of the following theorem.
The necessity is omitted here.
Theorem 5. C)y is an impossible picture if and only if ranicS y >_ m - 3.
Having an image of a true three-dimensional polyhedral scene, this rank condition
must be fulfilled. If it is violated because of noise, it m a y be possible to "deform" O y to
fulfill the condition. In doing this, any deformation can't be allowed. Let us say that a
deformation is admissible if it doesn't change the topological and shape properties of the
configuration, where the latter claim m a y be formulated Two configurations C)y and C)~
are topologically shape-eq,ivalcnt iff for every choice of matrix S y in Theorem 3, there
exists a corresponding matrix S ~ for C)~, such that their non-vanishing elements have the
same distributions of signs. This gives a constructive criterion, possible to use in testing.
As a final definition, we say that C)y is a correctable impossible picture if there exists an
admissible deformation which makes rank S y < m - 4. An example of a configuration
with this property is given in Fig. 5. For a method to find admissible deformations, see
[7].
% I
I 0.65--0.60 0.00] J i%
-0.65 0.00 --0.61| ,, *" I %
** I ~t
0.00 0.60 0.54| P I %
--1.00 1.00 0.00|
1.00 0.00 1.06|
0.00 --1.00 -l.OOj
A more severe situation is met when it is impossible to correct the picture by means
of admissible deformations. We then talk about an absolutely impossible picture. When
dealing with such an image, one knows that the topology of the object isn't what it
seems to be in the image. Accidental alignments or occlusions have occurred, and must
be discovered and loosened.
A celebrated example of an "impossible picture" in the human sense is the tribar of
Fig. 5. It is alternately called the "Reutersv~rd tribar" or the "Penrose tribar", after two
independent discoveries (1934 and 1958 respectively.) For historical facts, see the article
of Ernst in [1]. For this configuration it can be proved that there exists no admissible
deformation which makes the tribar fulfill the rank condition of Theorem 5. For more
details, see [10], [11]. In terms of the concepts introduced above, the discussion of this
section m a y be summarised:
- T h e truncated tetrahedron is a correctable impossible picture.
- The Reutersvrd-Penrose tribar is an absolutely impossible picture.
386
6 Discussion
Above a method for the computation of depth, modulo scale, from one single image of a
polyhedral scene has been presented, under the assumption of known point corresponden-
ces between scene and image. Only affine information about the scene is used, e.g. that the
objects contain parallelogram patches, nothing about their sizes. Other affine shapes may
be used as well. In the image, no absolute measurements are needed, only relative (affine)
ones. The image formation is supposed to be projective, but the method is insensitive to
affine deformations in the image plane. No camera parameters are needed. The problem
considered may be called an "affine calibration problem", with a solution in terms of
relative depth values. The weak assumptions give them good robustness properties. All
computations are linear.
The relative depth values may be combined with metrical information to solve the full
(metrical) calibration problem (cf. [8], [9]). That problem is usually solved by methods
that make extensive use of distances and angles, cf. [3] for an overview.
Relative depth information is also of interest in its own. For instance, in the case
of rectangular patches in the scene, the relative depth values may be interpreted as the
"motion" of the camera relative a location from which the patch looks like a rectangle.
Looked upon in this way, our approach belongs to the same family as [2], [4], [6].
Crucial for the approach is the use of affine invariants (the 'shape'). In this respect
the work is related to methods for recognition and correspondence, cf. [5].
In the last part of the paper is sketched an approach to the line drawing interpre-
tation problem. Its relations to other methods, notably the one of [12], need further
investigations.
References
1. Coxeter, H.M.S., Emmer, M., Penrose, R., Teuber, M.L.: M.C. Escher: Art and Science.
Elsevier, Amsterdam (1986)
2. Faugeras, O.D.: What can be seen in three dimensions with an uncalibrated stereo rig?
Proc. ECCV92 (1992) (to appear)
3. Horn, B.K.P.: Robot Vision. MIT Press, Cambridge, MA. (1986)
4. Koenderink, J.J., van Doorn, A.J.: Affine Structure from Motion. J. of the Opt. Soc. of
America (1992) (to appear)
5. Lamdan, Y., Schwartz, J.T., Wolfson, H.J.: Ailine Invariant Model-Based Object Recogni-
tion. IEEE Trans. Robotics and Automation 6 (1990) 578-589
6. Mohr, R., Morin, L., Grosso, E.: Relative positioning with poorly calibrated cameras. In
Proc. DARPA-ESPRIT Workshop on Applications of Invariance in Computer Vision (1991)
7. Persson, A.: A method for correction of images of origami/polyhedral objects. Proc. Swedish
Society for Automated Image Analysis. Uppsala, Sweden. (1992) (to appear)
8. Sparr, G., Nielsen, L.: Shape and mutual cross-ratios with applications to exterior, interior
and relative orientation. Proc. Computer Vision - ECCV90. Springer Verlag, Lect. Notes
in Computer Science (1990) 607-609
9. Spaxr, G.: Projective invariants for affine shapes of point configurations. In Proc. DARPA-
ESPRIT Workshop on Applications of Invariance in Computer Vision (1991)
10. Sparr, G.: Depth computations from polyhedral images, or: Why is my computer so dis-
tressed about the Penrose triangle. CODEN:LUFTD2(TFMA-91)/7004, Lund (1991)
11. Sparr, G.: On the "reconstruction" of impossible objects. Proc. Swedish Society for Auto-
mated Image Analysis. Uppsala, Sweden. (1992) (to appear)
12. Sugihaxa, K.: Mathematical Structures of Line Drawings of Polyhedrons - Toward Man-
Machine Communication by Means of Line Drawings. IEEE Trans. Pattern Anal. Machine
Intell. 4 (1982) 458-469
This article was processed using the IbTEX macro package with ECCV92 style
Parallel Algorithms for the Distance Transformation
1 Introduction
The sequential algorithm is a known algorithm [1] consisting of two passes during which
the image is traversed, once from top to bottom and from left to right, and the second
time in reverse order. When a pixel is processed, its distance value (infinity if not yet
determined) is compared to the distance value of a number of neighbours augmented by
their relative distance and is replaced by the smallest resulting value. This causes the
distance values to propagate from the object boundaries in the direction of the scan and
yields, after the second pass, the correct DT-values.
* The following text presents research results of the Belgian Incentive Program "Information
Technology" - Computer Science of the future, initiatedby the Belgian State - Prime Minis-
ter's Service - Science Policy Office.The scientificresponsibilityis assumed by its authors.
388
+4 i+3 +4
I+1 +11 +3 0 +3
+4i+3 +4
City Block distance Chamfer 3-4 distance
Fig. 1. These masks show for the indicated distance measures the distance between the central
pixel and the neighbouring pixels. The distance between two image points a and b is defined
as the sum of the distances between neighbouring pixels in the path connecting a and b, that
minimizes this sum.
City Block
i i i11 ii iii i ::!!:::r:i!!
Chamfer 3-4
i
Fig. 2. The DT of an image with one foreground pixel centered in the middle of the image
for the City Block and Chamfer 3-4 distances. Growing distance is represented by a greytone
repeatedly varying from black to white (to accentuate the contours of the DT).
Parallelism is introduced by the 'divide-and-conquer' principle. This means that the im-
age is subdivided into as many subregions as there are processors available ; the operation
to be parallelized, in our case the DT, is computed on each subregion separately and
these local DTs have to be used to compute the global DT on the image. Let LDT (local
DT) denote the DT applied to a subregion or, where indicated, a union of neighbouring
subregions and let G D T (global D T ) d e n o t e the DT applied to the whole image.
The algorithm consists of the next three steps :
I. On each subregion the LDT is computed for the boundary pizels of that subregion.
II. The G D T values for the boundary pizels are computed out of the L D T values.
III. On each subregion the G D T values for the internal pixels are determined out of the
G D T values for the boundary pixels and the local image information. We call this
part I D T (internal DT).
The first step could be done by executing the sequential DT algorithm on each sub-
region and retaining the boundary values. However, in [3] we present a shorter one pass
algorithm which traverses each pixel at most once.
389
For step II we consider two possible solutions. In the first solution (hierarchical
algorithm) we consider a sequence of gradually becoming coarser partitions pl (l =
1 , 2 , . . . , L = log2p ) of the image, with the finest partition Pl being the chosen parti-
tion of the image containing as many subregions as there are processors available. Each
of the other partitions p~ (l > 1) consists of subregions that are the union of two subre-
gions of P~-l. The coarsest partition PL contains as only subregion the image itself. The
LDT on partition Pz is defined as the result of the DT on each of the subregions of Pz
separately. In this approach we calculate from the LDT on Pz for the boundary pizel8 of
its subregions the corresponding values on Pl+l for l -- 1, 2 , . . . , L - 1. The values of the
LDT on partition PL are by definition the GDT values. Then the GDT values for the
boundary pixels of the subregions of Pz are computed for decreasing I. This approach is
similar to the hierarchical approach we used for component labelling [4].
These computations can be implemented in two ways. In the first approach (agglom-
erated cornputatio~t), on a particular recursion level l each subregion of pz is processed
by one processor. This means that processors become idle on higher recursion levels. In
an alternative implementation (distributed computation), pixel values of a subregion are
not agglomerated into one processor, but are distributed in a way that each processor
contains a part of the boundary of one subregion.
The second solution (directional algorithm) for step II consists of an inter-subregion
propagation in successive directions. The feasibility of this approach, however, and the
complexity of the resulting algorithm depend on the distance measure used.
The step III of the parallel algorithm is done by executing the sequential algorithm
on each subregion starting from the original image and the GDT values obtained in step
2.
We refer to [3] for a full description and correctness proof of the algorithms.
4 Asymptotical Complexity
The calculation of the LDT-values of the boundary pixels of a subregion, as well as the
IDT, is local and can be performed in an amount of time asymptotically proportional to
the number of pixels of the image.
The calculation of GDT-values out of LDT-values for the border pixels of the subre-
gions is global and consists of computation and communication. The latter can be divided
into the initiation and the actual transfer of messages. A summary of the complexity fig-
ures for the global operations, derived in this section, is shown in table 1. We assume an
image of n x r~ pixels and p processors.
Table 1. A summary of the complexity analysis of the global computations of the presented
DT algorithms for the CB distance.
390
Agglomerated Computation. Since the number of messages sent on each recursion level
is constant and since initiating a message takes constant time, the total start up time is
proportional to the number of recursion levels L = log 2 p.
The transfer time is proportional to the amount of data sent. The amount of data
sent on recursion level l is proportional to the size of a subregion of pz being
s, : (1)
Therefore the total transfer time is t~,=~,/,~ = O(~"~= 1 Sz) : O(n). The computational
complexity is also O(n) as the data are processed in linear time.
The total start up time is therefore t,t=,t_~,p = O ( ~ = ~ l) = O(log 2 p) and the total
L
amount of execution and transfer time Lt~=,~,/~r = tc,,mp -- O(~z= 1 Dzl) = 0 ( : ~n ) .
We used as test images a number of realistic images and a few artificial images, among
which the one of Fig. 2. The execution time of the sequential DT algorithm on one node
of the iPSC/2 is proportional to the number of pixels of the image and is typically about
800 ms for a 256 x 256 image. For images of this size the LDT is typically 100 ms.
The parallel efficiency, as a function of the size of the image, is shown in Fig. 3 for
a sample image. From the asymptotical complexity figures of section 4 we learn that for
large image sizes the execution time of the global computations is negligible with respect
to the the execution time of the I D T and the LDT parts of the algorithm. The ratio of
the latter two mainly determines the parallel efficiency. For smaller images the LDT part
gets more important with respect to the IDT part. The image size for which the two
parts take an equal amount of time is typically 32 pixels for both distance measures. For
smaller images also the global computations get more important.
A factor that influences the efficiency too, is the load imbalance of the algorithm. It
occurs, when a part of the algorithm takes more time on one processor than on the others
and the processors have to wait for one another. A measure for the load imbalance of a
part of the algorithm is
1= t-- (3)
391
,,o,..,..|,;~.,.o~*.| ~ ; : I | I LLU.
75- 75 ........ -.:... :: ...... - ...-..-- -;
.~176176176176 .'~176 .'" I" ~ ,~"
o~176176
~176176176. . . ~ 1 7 6 1 7 6. ~ . . ~ ,,.*" .~ . - ~, ~- .~,
~176176176~ 9 ~~ ~ .,,
9 . o ~ 1 7 6 . ~ t , ~" ,,s "/
50- 50 ~176
~
~176
.~176
."
o~
,,.
/
J
," .s" ,~ f '
o," . 9
r I
25- 25
2-
I I I
0 I I I
128 256 512 1024 64 128 256 512 1024
size of the image size of the image
Fig. 3. The parallel efficiency, as a function of the image size, for the image of Fig. 2, when the
hierarchical algorithm with agglomerated calculation or the directional algorithm is used.
with t '~ffi and t ~ the maximal and average execution times of the part of the algorithm
under investigation. We can distinguish two sources of load imbalance.
A first source of load imbalance is caused by the data dependence of the L D T part of
the algorithm. This is practically unavoidable, because for most images at least one sub-
region contains a considerable amount of background pixels and determines the execution
time of the L D T part of the algorithm.
A second source of load imbalance is the data dependence of the I D T algorithm. This
part of the load imbalance I grows with the number of subregions. However, we can find
a hard upper limit for the possible load imbalance similar to the analysis in [4].
Acknowledgements
We wish to thank Oak Ridge National Laboratory for letting us use their iPSC/2 machine.
References
1 McGiU University, Dept. of ElectricM Engineering, Montrfial, PQ, Canada H3A 2A7
2 University of California, Berkeley, Computer Science Division, Berkeley, CA USA 94720
1 Introduction
Binocular stereopsis is based on the cue of d i s p a r i t y - - two eyes (or cameras) receive
slightly different views of the three*dimensional world. This disparity cue, which includes
differences in position, both horizontal and vertical, as well as differences in orientation
or spacing of corresponding features in the two images, can be used to extract the three-
dimensional structure in the scene. This depends, however, upon first obtaining a solution
to the correspondence problem. The principal constraints that make this feasible are:
* This work has been supported by a grant to DJ from the Natural Sciences and Engineering
Research Council of Canada (OGP0105912) and by a National Science Foundation PYI award
(IRI-8957274) to JM.
396
The difficulties with approaches based on area correlation are well known. Because of
the difference in viewpoints, the effects of shading can give rise to differences in brightness
for non-lambertian surfaces. A more serious difficulty arises from the effects of differing
amounts of foreshortening in the two views whenever a surface is not strictly fronto-
parallel. Still another difficulty arises at surface boundaries, where a depth discontinuity
may run through the region of the image being used for correlation. It is not even guar-
anteed in this case that the computed disparity will lie within the range of disparities
present within the region.
In typical edge-based stereo algorithms, edges are deemed compatible if they are near
enough in orientation and have the same sign of contrast across the edge. To cope with
the enormous number of false matches, a coarse-to-fine strategy may be adopted (e.g.,
Mart and Poggio, 1979; Grimson, 1981). In some instances, additional limits can be im-
posed, such as a limit on the rate at which disparity is allowed to change across the
image (Mayhew, 1983; Pollard et al., 1985). Although not always true, assuming that
corresponding edges must obey a left-to-right ordering in both images can also be used
to restrict the number of possible matches and lends itself to efficient dynamic program-
ming methods (Baker and Binford, 1982). With any edge-based approach, however, the
resulting depth information is sparse, available only at edge locations. Thus a further
step is needed to interpolate depth across surfaces in the scene.
A third approach is based on the idea of first convolving the left and right images with
a bank of linear filters tuned to a number of different orientations and scales (e.g., Kass,
1983). The responses of these filters at a given point constitute a vector that characterizes
the local structure of the image patch. The correspondence problem can be solved by
seeking points in the other view where this vector is maximally similar.
Our contribution in this paper is to develop this filter-based framework. We present
techniques that exploit the constraints arising from viewing geometry and the assumption
that the scene is composed of piecewise smooth surfaces. A general viewing geometry is
assumed, with the optical axes converged at a fixation point, instead of the simpler
case of parallel optical axes frequently assumed in machine vision. Exploiting piecewise
smoothness raises a number of issues - - the correct treatment of depth discontinuities,
and associated occlusions, where unpaired points lie in regions seen only in one view.
We develop an iterative framework (Fig. 1) which exploits all these constraints to obtain
a dense disparity map. Our algorithm maintains a current best estimate of the viewing
parameters (to constrain vertical disparity to be consistent with epipolar geometry), a
visibility map (to record whether a point is binocularly visible or occluded), and a scale
map (to record the largest scale of filter not straddling a depth discontinuity).
stereo pair 0/imagel
~ geome~7
I "----.
o c d u ~ r~ionm depth boundarieJ
(vlewlngparam~erl) (viJ~lity map) (scalemap)
In order to solve the correspondence problem, stereo algorithms attempt to match features
in one image with corresponding features in the other. Central to the design of these
algorithms are two choices: What are the image features to be matched? How are these
features compared to determine corresponding pairs.
It is important to recall that stereo is just one of many aspects of early visual pro-
cessing: stereo, motion, color, form, texture, etc. It would be impractical for each of
these to have its own specialized representation different from the others. The choice of
a "feature" to be used as the basis for stereopsis must thus be be constrained as a choice
of the input representation for many early visual processing tasks, not just stereo. For
the human visual system, a simple feature such as a "pixel" is not even available in the
visual signals carried out of the eye. Already the pattern of light projected on the retina
has been sampled and spatially filtered. At the level of visual inputs to the cortex, vi-
sual receptive fields are well approximated as linear spatial filters, with impulse response
functions that are the Laplacian of a two-dimensional Gaussian, or simply a difference of
Gaussians. Very early in cortical visual processing, receptive fields become oriented and
are well approximated by linear spatial filters, with impulse response functions that are
similar to partial derivatives of a Gaussian (Young, 1985).
Since "edges" are derived from spatial filter outputs, the detection and localization of
edges may be regarded as an unnecessary step in solving the correspondence problem. A
representation based on edges actually discards information useful in finding unambigu-
ous matches between image features in a stereo pair. An alternative approach, explored
here, is to treat the the spatial filter responses at each image location, collectively called
the filter response vector, as the feature to be used for computing stereo correspondence.
Although this approach is loosely inspired by the current understanding of processing
in the early stages of the primate visual system (for a recent survey, DeValois and DeVal-
ois, 1988), the use of spatial filters may also be viewed analytically. The filter response
vector characterizes a local image region by a set of values at a point. This is similar to
characterizing an analytic function by its derivatives at a point. From such a representa-
tion, one can use a Taylor series approximation to determine the values of the function
at neighboring points. Because of the commutativity of differentiation and convolution,
the spatial filters used are in fact computing "blurred derivatives" at each point. The
advantages of such a representation have been described in some detail (Koenderink and
van Doom, 1987; Koenderink, 1988). Such a representation provides an efficient basis
for various aspects of early visual processing, making available at each location of the
computational lattice, information about a whole neighborhood around the point.
The primary goal in using a large number of spatial filters, at various orientations,
phases, and scales is to obtain rich and highly specific image features suitable for stereo
matching, with little chance of encountering false matches. At this point, one might be
398
change of basis matrix. As an example of this decomposition, the orthonormal basis for
the set of filters in Fig. 2A is shown in Fig. 2B.
Fig. 2. A. Linear spatial filter set. B. Orthonormal basis set for vector sp~ce spanned by filters
in A.
One telltale sign of a poorly chosen set of filters is the presence of singular values that
are zero, or very close to zero. Consider, for example, a filter set consisting of the first
derivative of a Gaussian at four different orientations, 0.
The vector space spanned by these four filters is only two dimensional. Only two filters
are needed, since the other two may be expressed as the weighted sum of these, and
thus carry no additional information. If one did not already know this analytically, this
procedure quickly makes it apparent. Such filters for which responses at a small number
of orientations allow the easy computation of filter responses for other orientations have
been termed steerable fillers (Koenderink, 1988; Freeman and Adelson, 1991; Perona,
1991). For Gaussian derivatives in particular, it turns out that n + 1 different orientations
are required for the n th Gaussian derivative.
As a further example, the reader who notes the absence of unoriented filters in Fig. 2A
and is tempted to enrich the filter set by adding a V2G, Laplacian of Gaussian filter,
should think twice. This filter is already contained in the filter set in the sense that it may
be expressed as the weighted sum of the oriented filters G#2,o(z,y). Similar filters, such
as a difference of Gaussians, may not be entirely redundant, but they result in singular
values close to zero, indicating that they add little to the filter set.
At the coarsest scales, filter responses vary quite smoothly as one moves across an
image. For this reason, the filter response at one position in the image can quite accurately
be computed from filter responses at neighboring locations. This means it is not strictly
necessary to have an equal number of filters at the coarser scales, and any practical
implementation of this approach would take advantage of this by using progressively
lower resolution sampling for the larger filter scales. Regardless of such an implementation
decision, it may be assumed that the output of every filter in the set is available at every
location in the image, whether it is in fact available directly or may be easily computed
from the outputs of a lower resolution set of filters.
400
2.3 I m a g e E n c o d i n g a n d R e c o n s t r u c t i o n
What information is actually carried by the filter response vector at any given position
in an image? This important question is surprisingly easy to answer. The singular value
decomposition described earlier provides all that is necessary for the best least-squares
reconstruction of an image patch from its filter response vector. Since v = F T I , and
F T = U ~ V T, the reconstructed image patch can be computed using the generalized
inverse (or the Moore-Penrose pseudo-inverse) of the matrix F T.
I' = V 1/2Y U T v
The matrix 1 / ~ is a diagonal matrix obtained from ~ by replacing each non-zero diagonal
entry at by its reciprocal, 1/ai.
An example of such a reconstruction is given in Fig. 3. The finest detail is preserved
in the center of the patch where the smallest filters are used. The reconstruction is pro-
gressively less accurate as one moves away from from the center. Because there are fewer
filters than pixels in the image patch to be reconstructed, the reconstruction is necessar-
ily incomplete. The high quality of the the reconstructed image, however, confirms the
fact that most of the visually salient features have been preserved. The reduction in the
number of values needed to represent an image patch means this is an efficient encoding
- - not just for stereo, but for other aspects of early visual processing in general. Since
this same encoding is used throughout the image, this notion of efficiency should be used
with caution. In terms of merely representing the input images, storing a number of filter
responses for each position in the image is clearly less efficient than simply storing the
individual pixels. In terms of carrying out computations on the image, however, there
is a considerable savings for even simple operations such as comparing image patches.
Encoded simply as pixels, comparing 30 30 image regions requires 900 comparisons.
Encoded as 60 filter responses, the same computation requires one-fifteenth as much
effort.
Fig. 3. Image reconstruction. Two example image patches (leJt), were reconstructed (right)
from spatial filter responses at their center. Original image patches masked by a Gaussian
(middle) are shown for comparison.
401
How should filter response vectors be compared? Although corresponding filter response
vectors in the two views should be very similar, differences in foreshortening and shading
mean that they will rarely be identical. A variety of measures can be used to compare
two vectors, including the angle between them, or some norm of their vector difference.
These and similar measures are zero when the filter response vectors are identical and
otherwise their magnitude is proportional to some aspect of the difference between po-
tentially corresponding image patches. It turns out that any number of such measures
do indistinguishably well at identifying corresponding points in a pair of stereo images,
except at depth discontinuities. Near depth discontinuities, the larger spatial filters lie
across an image patch containing the projection of more than one surface. Because these
surfaces lie at different depths and thus have different horizontal disparities, the filter
responses can differ considerably in the two views, even when they are centered on points
that correspond. While the correct treatment of this situation requires the notion of an
adaptive scale map (developed in the next section), it is helpful to use a measure such
as the L1 norm, the sum of absolute differences of corresponding filter responses, which
is less sensitive to the effect of such outliers than the L2 norm.
This matching error er~ is computed for a set of candidate choices of (hr, vr) in a win-
dow determined by a priori estimates of the range of horizontal and vertical disparities.
The (hr, v~) value that minimizes this expression is taken as the best initial estimate
of positional disparity at pixel (i,j) in the right view. This procedure is repeated for
each pixel in both images, providing disparity maps for both the left and right views.
Though these initial disparity estimates can be quite accurate, they can be substantially
improved using several techniques described in the next section.
An implementation of this approach using the outputs of a number of spatial filters
at a variety of orientations and scales as the basis for establishing correspondence has
proven to give quite good results, for random-dot stereograms, as well as natural and
artificial grey-level images. Some typical examples are presented here.
The recovered disparity map for a ]ulesz random-dot stereogram is presented in
Fig.4A. The central square standing out in depth is clearly detected. Disparity values
at each image location are presented as grey for zero horizontal disparity, and brighter
or darker shades for positive or negative disparities. Because these are offsets in terms
of image coordinates, the disparity values for corresponding points in the left and right
images should have equal magnitudes, but opposite signs. Whenever the support of the
filter set lies almost entirely on a single surface, the disparity estimates are correct.
Even close to depth discontinuities, the recovered disparity is quite accurate, despite the
responses from some of the larger filters being contaminated by lying across surfaces at
different depths.
In each view, there is a narrow region of the background just to one side of the near
central square that is visible only in one eye. In this region, there is no corresponding
point in the other view and the recovered disparity estimates appear as noise. Methods
for coping with these initial difficulties are discussed in later sections. In the lower panels
of the same figure, the measure of dissimilarity, e,n, between corresponding filter response
vectors is shown, with darker shades indicating larger differences. Larger differences are
clearly associated with depth discontinuities.
402
Fig. 4. Initial disparity estimates: random-dot stereogram and fruit. For the stereo pairs shown
(top), the recovered disparity map (middle) and dissimilarity or error map (bottom) are shown.
(fruit images courtesy Prof. N. Ahuja, Univ. Illinois)
When approached as a problem of determining which black dot in one view cor-
responds with which black dot in the other, the correspondence problem seems quite
difficult. In fact, Julesz random-dot stereograms are among the richest stimuli - - con-
taining information at all orientations and scales. When the present approach based on
spatial filters is used, the filter response vector at each point proves to be quite distinctive,
making stereo-matching quite straightforward and unambiguous.
As an example of a natural grey-level image, a stereo pair of fruit lying on a table
cloth is shown in Fig. 4B. The recovered disparity values clearly match the shapes of
the familiar fruit quite well. Once again, some inaccuracies are present right at object
boundaries. The measure of dissimilarity, or error shown at the bottom of the figure
provides a blurry outline of the fruit in the scene. A mark on the film, present in one
view and not the other (on the canteloupe) is also clearly identified in this error image.
As a final example, a ray-traced image of various geometric shapes in a three-sided
room is depicted in Fig. 5. For this stereo pair, the optical axes are not parallel, but
converged to fall on a focal point in the scene. This introduces vertical disparities between
corresponding points. Estimated values for both the horizontal and vertical disparities
are shown. Within surfaces, recovered disparities values are quite accurate and there are
some inaccuracies right at object boundaries. Just to the right of the polyhedron in this
scene is a region of the background visible only in one view. The recovered disparity
values are nonsense, since even though there is no correct disparity, this method will
always choose one candidate as the "best". Another region in this scene where there
403
are some significant errors is along the room's steeply slanted left wall. In this case, the
large differences in foreshortening between the two views poses a problem, since the filter
responses at corresponding points on this wall will be considerably different. A method
for handling slanted surfaces such as this has been discussed in detail elsewhere (Jones,
1991; Jones and Malik, 1992).
Fig. 5. Initial disparity estimates: a simple raytraced room. For the stereo pair (top), the
recovered estimates of the horizontal (middle) and vertical (bottom) components of positional
disparity are shown.
4.1 E p i p o l a r G e o m e t r y
By virtue of the basic geometry involved in a pair of eyes (or cameras) viewing a three-
dimensional scene, corresponding points must always lie along epipolar lines in the im-
ages. These lines correspond to the intersections of an epipolar plane (the plane through
404
a point in the scene and the nodal points of the two cameras) with the left and right
image planes. Exploiting this epipolar constraint reduces an initially two-dimensional
search to a one-dimensional one. Obviously determination of the epipolar lines requires
a knowledge of the viewing geometry.
The core ideas behind the algorithms to determine viewing geometry date back to
work in the photogrammetry community in the beginning of this century (for some histor-
ical references, Faugeras and Maybank, 1990) and have been rediscovered and developed
in the work on structure from motion in the computational vision community. Given a
sufficient number of corresponding pairs of points in two frames (at least five), one can
recover the rigid body transformation that relates the two camera positions except for
some degenerate configurations. In the context of stereopsis, Mayhew (1982) and Gillam
and Lawergren (1983) were the first to point out that the viewing geometry could be
recovered purely from information present in the two images obtained from binocular
viewing.
Details of our algorithm for estimating viewing parameters may be found in (Jones
and Malik, 1991). We derive an expression for vertical disparity, vr, in terms of image
coordinates, (it,jr), horizontal disparity, hr, and viewing parameters. This condition
must hold at all positions in the image, allowing a heavily over-constrained determination
of certain viewing parameters. With the viewing geometry known, the image coordinates
and horizontal disparity determine the vertical disparity, thus reducing an initially two-
dimensional search for corresponding points to a one-dimensional search.
4.2 P i e c e w i s e s m o o t h n e s s
Since the scene is assumed to consist of piecewise smooth surfaces, the disparity map is
piecewise smooth. Exploiting this constraint requires some subtlety. Some previous work
in this area has been done by Hoff and Ahuja (1989). In addition to making sure that we
do not smooth away the disparity discontinuities associated with surface boundaries in
the scene, we must also deal correctly with regions which are only monocularly visible.
Whenever there is a surface depth discontinuity which is not purely horizontal, distant
surfaces are occluded to different extents in the two eyes, leading to the existence of
unpaired image points which are seen in one eye only. The realization of this goes back
to Leonardo Da Vinci (translation in, Kemp, 1989). This situation is depicted in Fig. 6.
Recent psychophysical work has convincingly established that the human visual sys-
tem can exploit this cue for depth in a manner consistent with the geometry of the
situation (Nakayama and Shimojo, 1990).
Any computational scheme which blindly assigns a disparity value to each pixel is
bound to come up with nonsense estimates in these regions. Examples of this can be
found by inspecting the occluded regions in Fig. 5. At the very minimum, the matching
algorithm should permit the labeling of some features as 'unmatched'. This is possible
in some dynamic programming algorithms for stereo matching along epipolar lines (e.g.,
Arnold and Binford, 1980) where vertical and horizontal segments in the path through
the transition matrix correspond to skipping features in either the left or right view.
In an iterative framework, a natural strategy is to try and identify at each stage the
regions which are only monocularly visible. The hope is that while initially this classifi-
cation will not be perfect (some pixels which are binocularly visible will be mislabeled
as monocularly visible and vice versa), the combined operation of the different stereopsis
constraints would lead to progressively better classification in subsequent iterations. Our
empirical results bear this out.
405
r I
L R
Fig. 6. Occlusion. In this view from above, it is cleat that at depth discontinuities there are
often regions visible to one eye, but not the other. To the right of each near surface is a region
r that is visible only to the right eye, R. Similarly, to the left of a near surface is monocular
region, I, visible only to the left eye, L:
The problem of detecting and localizing occluded regions in a pair of stereo images is
m a d e much easier when one recalls t h a t there are indeed a pair of images. T h e occluded
regions in one image include exactly those points for which there is no corresponding
point in the other image. This suggests t h a t the best cue for finding occluded regions in
one image lies in the disparity estimates for the other image!
Fig. 7. Visibility map. The white areas in the lower panels mark the regions determined to be
visible only from one of the two viewpoints.
Define a binocular visibility map, B(i,j), for one view as being 1 at each i m a g e
position t h a t is visible in the other view, and 0 otherwise (i.e., an occluded region). T h e
406
horizontal and vertical disparity values for each point in, say, the left image are signed
offsets that give the coordinates of the corresponding point in the right image. If the
visibility map for the right image is initially all zero, it can be filled in systematically as
follows. For each position in the left image, set the corresponding position in the right
visibility map to 1. Those positions that remain zero had no corresponding point in the
other view and are quite likely occluded. An example of a visibility map computed in
this manner is shown in Fig. 7.
Having established a means for finding regions visible only from a one viewpoint,
what has been achieved? If the disparity values are accurate, then the visibility map,
besides simply identifying binocularly visible points, also explicitly delimits occluding
contours. After the final iteration, occluded regions can be assigned the same disparity
as the more distant neighboring visible surface.
4.3 D e p t h D i s c o n t i n u i t i e s a n d A d a p t i v e Scale S e l e c t i o n
The output of a set of spatial filters at a range of orientations and scales provides a
rich description of an image patch. For corresponding image patches in a stereo pair of
images, it is expected that these filter outputs should be quite similar. This expectation
is reasonable when all of the spatial filters are applied to image patches which are the
projections of single surfaces. When larger spatial filters straddle depth discontinuities,
possibly including occluded regions, the response of filters centered on corresponding
image points may differ quite significantly. This situation is depicted in Fig. 8. Whenever
a substantial area of a filter is applied to a region of significant depth variation, this
difficulty occurs (e.g, in Fig. 5).
far
@
Fig. 8. Scale selection. Schematic diagram depicting a three-sided room similar to the one
in Fig. 5. When attempting to determining correspondence for a point on a near surface, larger
filters that cross depth boundaries can result in errors. If depth discontinuities could be detected,
such large scale filters could be selectively ignored in these situations.
the image patch has a greater effect on the filter response. The sum of these weighted
disparity differences provides a measure of the amount of depth variation across the image
patch affecting the response of this spatial filter. When this sum exceeds an appropriately
chosen threshold, it may be concluded that the filter is too large for its response to be
useful in computing correspondence. Otherwise, continuing to make use of the outputs
of large spatial filters provides stability in the presence of noise.
To record the results of applying the previous procedure, the notion of a scale map is
introduced (Fig. 9). At each position in an image, the scale map, S(i, j), records the scale
of the largest filter to be used in computing stereo correspondence. For the computation
of initial disparity estimates, all the scales of spatial filters are used. From initial disparity
estimates, the scale map is modified using the above criterion. At each position, if it is
determined that an inappropriately large scale filter was used, then the scale value at that
position is decremented. Otherwise, the test is redone at the next larger scale, if there is
one, to see if the scale can be incremented. It is important that this process of adjusting
the scale map is done in small steps, with the disparity values being recalculated between
each step. This prevents an initially noisy disparity map, which seems to have a great
deal of depth variation, from causing the largest scale filters to be incorrectly ignored.
Fig. 9. Scale map. The darker areas in the lower panels mark the regions where larger scale
filters are being discarded because they lie across depth discontinuities.
Once initial estimates of horizontal and vertical disparity have been made, additional
information becomes available which can be used to improve the quality of the disparity
estimates. This additional information includes estimates of the viewing parameters, the
408
location of occluded regions, and the appropriate scale of filters to be used for matching.
Our algorithm can be summarized as follows:
1. For each pixel P with coordinates (i, j) in the left image, and for each candidate dis-
parity value h, ~ in the allowable disparity range compute the error measure eij(h, ~).
2. Declare h(i, j) and v(i, j) to be the values of h, ~ that minimize eij.
3. Use the refined values of h(i, j) and v(i, j) to compute the new visibility map B(i, j)
and scale map S(i, j).
4. Perform steps 1-3 for disparity, visibility, and scale maps but this time with respect
to the right image.
5. Goto step 1 or else stop at convergence.
The error function e(h, f~) is the sum of the following terms
e(h, = raera(h, + + oeo(h, +
Each term enforces one of the constraints discussed: similarity, viewing geometry, consis-
tency, and smoothness. The )~ parameters control the weight of each of these constraints,
and their specific values are not particularly critical. The terms are:
9 era(h, ~) is the matching error due to dissimilarity of putative corresponding points.
It is 0 if B(i,j) = 0 (i.e., the point is occluded in the other view), otherwise it is
~ IFk * Ir(i,j) - Fk * Ii(i + hr,j + vr)l where k ranges from the smallest scale to
the scale specified by S(i, j).
9 ev (h, ~) is the vertical disparity error [~3- v* [ where v* is the vertical disparity consis-
tent with the recovered viewing parameters. This term enforces the epipolar geometry
constraint.
9 ec(h, r is the consistency error between the disparity maps for the left and right
images. Recall that in our algorithm the left and right disparity maps are computed
independently. This term provides the coupling - - positional disparity values for
corresponding points should have equal magnitudes, but opposite signs. If h I, vI is
the disparity assigned to the corresponding point P' = (i + h,j + ~) in the other
image, then h I = - h and v~ = -~3 at binocularly visible points. If only one of P and
P~ is labelled as monocularly visible, then this is consistent only if the horizontal
disparities place this point further than the binocularly visible point. In this case,
e~ = O, otherwise, e~ = Ih+ h' I + I~ + v'l.
9 e,(h, ~) = [h - hi + [fi - ~[ is the smoothness error used to penalize candidate dispar-
ity values that deviate significantly from h, %, the 'average' values of horizontal and
vertical disparity in the neighborhood of P. These are computed either by a local
median filter, within binocularly visible regions, or by a local smoothing operation
within monocularly visible regions. These operations preserve boundaries of binocu-
larly visible surfaces while providing stable depth estimates near occluded regions.
The computational complexity of this algorithm has two significant terms. The first
is the cost of the initial linear spatial filtering at multiple scales and orientations. Imple-
mentations can be made quite efficient by using separable kernels and pyramid strategies.
The second term corresponds to the cost of computing the disparity map. This cost is
proportional to the number of iterations (typically 10 or so in our examples). The cost in
each iteration is dominated by the search for the pixel in the other view with minimum e.
This is O(n2WhWv)for images of size n x n and horizontal and vertical disparity ranges,
Wh and wv. After the first iteration, when the viewing parameters have been estimated,
the approximate vertical disparity is known at each pixel. This enables wv to be restricted
to be 3 pixels which is adequate to handle quantization errors of 4-1 pixel.
409
6 Experimental Results
The algorithm describcd in the previous section has been implemented and tested on
a variety of natural and artificial images. In practice, this process converges (i.e., stops
producing significant changes) in under ten iterations. Disparity maps obtained using
this algorithm are shown in Fig. 10. The reader may wish to compare these with Figures
4 and 5 which show the disparity map after a single iteration when the correspondence is
based solely on the similarity of the filter responses. The additional constraints of epipolar
geometry and piecewise smoothness have clearly helped, particularly in the neighborhood
of depth discontinuities. Also note that the visibility map for the random dot stereogram
as well as the room image (bottom of Fig. 7) are as expected. From these representations,
the detection and localization of depth discontinuities is straightforward.
Fig. 10. Refined disparity estimates. For the stereo pairs (top), the recovered horizontal dis-
parities are shown in the middle panel. For the random dot stereogram, the lower panel shows
the visibility map. For the room image, the bottom panel shows the recovered vertical disparity.
We have demonstrated in this paper that convolution of the image with a bank of
linear spatial filters at multiple scales and orientations provides an excellent substrate on
which to base an algorithm for stereopsis, just as it has proved for texture and motion
analysis. Starting out with a much richer description than edges was extremely useful for
solving the correspondence problem. We have developed this framework further to enable
the utilization of the other constraints of epipolar geometry and piecewise smoothness as
well.
410
References
Arnold RD, Binford T O (1980) Geometric constraints on stereo vision. Proc SPIE
238:281-292
Ayache N, Faverjon B (1987) Efficientregistration of stereo images by matching graph
descriptions of edge segments. Int J Computer Vision 1(2):107-131
Baker HH, Binford T O (1981) Depth from edge- and intensity-based stereo.
Proc 7th IJCAI 631-636
Barnard ST, Thompson W B (1980) Disparity analysis of images. IEEE Trans P A M I
2(4):333-340
Burt P, Julesz B (1980) A disparity gradient limit for binocular function. Science
208:651-657
DeValois R, DeValois K (1988) Spatial vision. Oxford Univ Press
Faugeras O, Maybank S (1990) Motion from point matches: multiplicity of solutions. Int J
Computer Vision 4:225-246
Freeman WT, Adelson EH (1991) The design and use of steerable filters. IEEE Trans
PAMI 13(9):891-906
Gennery DB (1977) A stereo vision system for autonomous vehicles. Proc 5th IJCAI
576-582
Gillam B, Lawergren B (1983) The induced effect, vertical disparity, and stereoscopic
theory. Perception and Psychophysics36:559-64
Golub GH, Van Loan CF (1983) Matrix computations. The Johns Hopkins Univ Press,
Baltimore, MD
Grimson WEL (1981) Fromimages to surfa~:es. M.I.T Press, Cambridge, Mass
Hannah MJ (1974) Computermatching of areas in images. Stanford AI Memo #239
HoffW, Ahuja N (1989) Surfacesfrom stereo: integrating stereo matching, disparity
estimation and contour detection. IEEE Trans PAMI 11(2):121-136
Jones, DG (1991) Computational models of binocular vision. PhD Thesis, Stanford Univ
Jones DG, Malik J (1991) A computational frameworkfor determining stereo
correspondence from a set of linear spatial filters. U.C. Berkeley Technical Report
UCB-CSD 91-655
Jones DG, Malik J (1992) Determining three-dimensional shape from orientation and
spatial frequency disparities. Proc ECCV, Genova
Kass M (1983) Computing visual correspondence. DARPA IU Workshop 54-60
Kemp M (Ed) (1989) Leonardo on painting. Yale Univ. Press: New Haven 65-66
Koenderink J J, van Doom AJ (1987) Representation of local geometry in the visual
system. Biol Cybern 55:367-375
Koenderink JJ (1988) Operational significance of receptive field assemblies. Biol Cybern
58:163-171
Mart D, Poggio T (1979) A theory for human stereo vision. Proc Royal Society London B
204:301-328
Mayhew JEW (1982) The interpretation of stereo disparity information: the computation
of surface orientation and depth. Perception 11:387-403
Mayhew JEW (1983) Stereopsis. in Physiological and Biological Processing of Images.
Braddick O J, Sleigh AC (Eds) Springer-Verlag, Berlin.
Medioni G, Nevatia R (1985) Segment-based stereo matching. CVGIP 31:2-18
Moravec HP (1977) Towards automatic visual obst~le avoidance. Proc 5th IJCAI
Nakayama K, Shimojo S (1990) DaVinci Stereopsis: Depth and subjective occluding
contours from unpaired image points Vision Research 30(11):1811-1825
Perona P (1991) Deformable kernels for early vision. IEEE Proc CVPR 222-227
Pollard SB, Mayhew JEW, Frisby JP (1985) PMF: a stereo correspondence algorithm
using a disparity gradient limit. Perception 14:449-470
Young R (1985) The Gaussian derivative theory of spatial vision: analysis of cortical cell
receptive field line-weighting profiles. General Motors Research TR #GMR-4920
O n V i s u a l A m b i g u i t i e s D u e to T r a n s p a r e n c y in
Motion and Stereo *
Masahiko Shizawa
ATR Communication Systems Research Laboratories, Advanced Telecommunications Research
Institute International, Sanpeidani, Inuidani, Seika-cho, Soraku-gun, Kyoto 619-02, Japan
1 Introduction
Transparency perception arises when we see scenes with complex occlusions such as picket
fences or bushes, with shadows such as those cast by trees, and with physically transpar-
ent objects such as water or glass. Conventional techniques for segmentation problems
using relaxation type techniques such as coupled MRF(Markov Random Field) with a
line process which explicitly models discontinuities[5][13], statistical decision on veloc-
ity distributions using statistical voting[1] [2][3] or outlier rejection paradigm of robust
statistics[14] and weak continuity[15], cannot properly handle these complex situations,
since transparency is beyond the assumptions of these techniques. More recently, an iter-
ative estimation technique for two-fold motion from three frames has been proposed[16].
The principle of superposition(PoS), a simple and elegant mathematical technique,
has been introduced to build motion transparency constraints from conventional single
motion constraints[25]. PoS resolves the difficulties in analyzing motion transparency and
multiple motions at the level of basic constraints, i.e., of computational theory in contrast
to conventional algorithm level segmentation techniques[21]. Using PoS, we can analyze
the nature of transparent motion such as the minimum number of sets of measurements,
signal components or correspondences needed to determine motion parameters in finite
multiplicity arid to determine them uniquely. Another advantage is its computational
simplicity in optimization algorithms such as convexity of the energy functionals.
In this paper, the constraints of the two-fold transparent aptical flow is examined
and ambiguities in determining multiple velocities are discussed. It is shown that con-
ventional statistical voting type techniques and a previously described constraint-based
approach[23][24] behave differently for some particular moving patterns. This behavioral
difference will provide a scientific test for the biological plausibility of motion perception
models regarding transparency.
Then, I show that transparency in binocular stereo vision can be interpreted similarly
to transparent motion using PoS. The constraint equations for transparent stereo match-
ing are derived by PoS. Finally, recent results in studies on human perception of multiple
transparent surfaces in stereo vision[19] are explained by this computational theory.
* Part of this work was done while the author was at N T T Human Interface Laboratories,
Yokosuka, Japan.
2 Principle of Superposition
2.1 T h e O p e r a t o r F o r m a l i s m a n d C o n s t r a i n t s o f T r a n s p a r e n c y
Most of the constraint equations in vision can be written as,
a ( p ) f ( x ) = 0. (1)
where f ( x ) is a data distribution on data space G. f ( x ) may be the image intensity
data itself or outputs of a previous visual process, p is a point on a parameter space
7-[ which represents a set of parameters to be estimated and a(p) is a linear operator
parametrized by p. The linearity of the operator is defined by a(p){fl(x) + f2(x)} =
a ( p ) f l ( x ) + a(p)f2(x) and a(p)0 = 0. We call the operator a(p) the amplitude operator.
The amplitude operator and the data distribution may take vector values.
Assume n data distributions fi(x)(i = 1, 2 , . . . , n) on G, and suppose they are con-
strained by the operators a(pi)(pl e 7-[i,i = 1,2,... ,n) as a(pl)fi(x) = 0. The data
distribution f ( x ) having transparency is assumed to be an additve superposition of fi(x)
as f ( x ) = ~ fi(x). According to PoS, the transparency constraint for f ( x ) ' c a n be
i=1
represented simply by
a(pl)a(p2) ... a(pn)f(x) : 0. 2 (2)
It should be noted that if the constraint of n-fold transparency holds, then the con-
straint of m-fold transparency holds for any m > n. However, parameter estimation
problems based on the constraint of m-fold transparency are ill-posed because extra
parameters can take arbitrary values, i.e. are indefinite. Therefore, appropriate multi-
plicity n may be determined by a certain measure of well-posedness or stability of the
optimization as in [24].
2.2 S u p e r p o s i t i o n U n d e r O c c l u s i o n a n d T r a n s p a r e n c y
An important property of the transparency constraint equation is its insensitivity to
occlusion, ff some region of data fi(x) is occluded by another pattern, we can assume
that fi(x) is zero in the occluded region. The transparency constraint equation still holds
because of its linearity. Therefore, in principle, occlusion does not violate the assumption
of additive superposition.
In the case of transparency, there are typically two types of superposition: additive
and multiplicative. Multiplicative superposition is highly non-linear and therefore sub-
stantially violates the additivity assumption. However, taking the logarithm of the data
distribution transforms the problem into a case of additive superposition.
where (u, v) is a flow vector. Then, the fundamental constraints of optical flow can be writ-
ten as a(u, v)f(x, y, t) = 0 and 5(u, v)F(w~:, wy, wt) = 0 where .f(x, y, t) and F(w~:, wy, wt)
denote a space-time image and its Fourier transform[9][10][ll][12]. Using PoS, the con-
straints for the two-fold transparent optical flow are simply a(ul, v~)a(u2, v2).f(x, y, t) = 0
and 5(u~, v~)5(u2,v2)F(w=,w~,w~) = 0 where (Ul,V~) and (u2, v2) are two flow vectors
which coexist at the same image location.
These two constraints of two-fold motion transparency can be expanded into
dzzUlU2 + dyyvlv2 + dzu(u~v2 + VlU2) -~- dzt(Ul + u2) + d~t(Vl + v2) + dtt = 0, (4)
8~
where components of d -- ( d~z, d~, d=u, d~,t, dyt , dtt ) are for example d~t = o--~ f( x, y, t)
for the spatial domain representation and dut = (2~ri)2w~wtF(w~, w~, wt) for the frequency
domain representation. Therefore, we can simultaneously discuss brightness measuments
and frequency components.
G~(d(O,d(J);u,v) = q(1)q(j)
~u ~t - q(i)q(j)'
q~ qu G~(d(O'd(J);u'v) = q~i)q(j)
.(i).(j) q(1)q~j)
q(i)q(D' (5)
~g ,/y
If q(x0 ---- 0 then we can substitute the i by another index i ~ which is not equivalent to
i, j and k. Then q(xr -- 0 cannot hold if we have transparency, because two equations
q(O = 0 and q(~') = 0 imply single optical flow. Thus, we can substitute i by r without
loss of generality. Therefore, the cubic equation Gu~ (d(0, d(D, d(k); u, v) ----0 with respect
to u and v gives the constraint curve on the velocity space (u, v) under the assumption
of two-fold transparency. Intersecting points of two curves in uv-space
C1 : G,,~(d(1),d(2),d(8);u,v) = O, C2 : G~,~(d(1),d(2),d(4);u,v) = O, (7)
provide the candidate flow estimates for (ul, Vl) and (u~, v2). By using (5), we can make
pairs of solutions for {(ul, vl), (u2, v2)} from these intersections.
414
G4 V G1 G1
m U
0 =u
\
G2
True sol tk~ Two tahm
c = =
( ulu2,vlv2, , (ulv2 +
, +
, +
) , iS)
as a linear system. Component flow parameters ul, u2, vl and v2 can be obtained by
solving two quadratic equations, u 2 - 2 c x t u + c,~ = 0 and v 2 - 2%~v + c~ = 0. We
f__._
denote their solutions as u = c=t =t: - c=x and v -- c~t -4- ~ / c ~ t - %y. There are
constraints c2~ > c~, and c~t _ c ~ for the existence of real solutions. We now have two
possible solutions for (ul, vl) and (u2, v2) as {(ul, vl), (u2, v2)} = {(u+, v+), (u_, v_)}
and {(Ul, vl), (u2, v2)} = {(u+, v_), (u_, v+)}. However, we can determine a true solution
by checking their consistency with the remaining relation cx~ = 8 9 .-b v~u2) of (8).
Therefore, we have a unique interpretation for the general case.
Figure 2(a) shows an example of moving patterns which produces this behavioral
difference between the proposed approach and conventional statistical voting. The two
moving patterns A and B are superposed. Pattern A has two frequency components which
may be produced by two plaids G1 and G2; its velocity is VA. The other pattern B, which
has velocity VB, contains three frequency components produced by three plaids G3, G4
and Gs. If the superposed pattern is given to our algorithm based on the transparent
optical flow constraint, the two flow vectors VA and vB can be determined uniquely as
shown in the previous subsection. Figure 2(b) shows plots of conventional optical flow
constraint lines on the velocity space (u, v). There are generally seven intersection points
only one of which is an intersection of three constraint lines but other six points are of two
constraint lines. 3 The intersection of three lines is the velocity vB and can be detected
by a certain peak detection or clustering techniques on the velocity space. However, the
other velocity VA cannot be descriminated from among the six two-line intersections!
V
G G2 G=4 --G5 ~Vs
. G1 ~ ~ C'~
A V! G4
A S
(a) (b)
Fig. 2. Moving pattern from which statistical voting schemes cannot estimate the correct two
flow vectors
4.1 T h e C o n s t r a i n t o f S t e r e o M a t c h i n g
The constraints on stereo matching can also be written by the operator formalism. We
denote the left image patterns by L(z) and the right image patterns by R(z) where z
denotes a coordinate along an epipolar line. Then, the constraint for single surface stereo
3 Figure 2(b) actually contains only five two-line intersections. However, in general, it will
contain six.
416
9 ( 0 ) is a shift operator which transforms L(x) into L(z - 0 ) and R(x) into R(z - 0 ) . 4
It is easy to see that the vector amplitude operator a(D) is linear, i.e. both a(D)0 = 0
and a(O){fl(x) + f2(z)} = a(O)fl(x) + a(O)f~(z) hold.
Figure 3 is a schematic diagram showing the function of the vector amplitude operator
a(D). The operator a(D) eliminates signal components of disparity D from the pair of
stereo images, L(z) and R(z), by substitutions.
I _, I
I I oc0 ,- I _
4.2 T h e C o n s t r a i n t o f T r a n s p a r e n t S t e r e o
According to PoS, the constraint of the n-fold transparency in stereo can be hypothesized
as
a ( D n ) . . , a(D2)a(D1)f(x) = 0, (10)
where f(z) = ~ f;(z), and each fi(x) is constrained by a(O,)fi(x) = 0. It is easily proved
i=1
using the commutability of the shift operator 9 ( D ) that amplitude operators a(D~) and
a(Dj) commute, i.e. a(O,)a(Dj) : a(Dj)a(Oi) for i # j under the condition of constant
Di and Dj. Further, the additivity assumption on superposition is reasonable for random
dot stereograms of small dot density.
4.3 P e r c e p t i o n o f M u l t i p l e T r a n s p a r e n t P l a n e s
In this section, the human perception of transparent multiple planes in stereo vision
reported in [19] is explained by the hypothesis provided in the previous section.
We utilize a random dot image P(x). If L(x) = P(x - d) and R(z) -- P(z), then the
constraint of single surface stereo holds for disparity D = d, since
a(d)f(x) = r d) _ V(d)P(z) ] rP(x - d) - P(x - d)]
[P(x)-9(-d)P(z-d)] = LP(z)-P(z-d+d) --0. (11)
4 We can write this shift operator explicitly in a differential form as /~(D) = exp(-Da-~ ) =
2 2 3 3
1 - D -~-"
Ox -
D" ~ _ p__:
2! ~ x ~
~
3! a x ~ "" ""
However, only the shifting property of the operators is essential
in the following discussions.
417
where dL and dR are shift displacements for the pattern repetitions in left and right
image planes. According to [19], when dz ~ dR, we perceive four transparent planes
which correspond to disparities D = 0, dz, dR and dL + dR. The interesting phenomenon
occurs in the case of dL = dR = de. The stereogram produces a single plane perception
despite the fact that the correlation of two image patterns L(x) and R(x) has three strong
peaks at the disparities D = 0, dc and 2dc.
s P(x) s P(x)
,,,x.C/_/ .................... ;../
/ 9 ~:~"
dR
From the viewpoint of the constraints of transparent stereo, these phenomena can be
explained as shown below.
First, it should be pointed out that the data distribution f(x) can be represented as
a weighted linear sum of four possible unique matching components.
f(~) = ~ f l ( x ) + . f ~ ( x ) + (1 - . ) f ~ ( ~ ) + (1 - . ) f ~ ( ~ ) , (13)
where
Note that the weights have only one freedom as parameterized by or.
When assuming dL ys dR, the following observation can be obtained regarding the
constraints of the transparent stereo.
1. The constraint of single surface stereo a(D1)f(x) = 0 cannot hold for any values of
disparities Dr.
2. The constraint of two-fold transparent stereo a(D2)a(Di)f(x) -- 0 can hold only
for two sets of disparities {Dz, D2} -- {0, dL + dR} and {Dz, D2} = {dR, dL} which
correspond to cr = 1 and c~ = 0, respectively.
418
We can conclude that the stereo constraint of n-fold transparency is valid only for n = 2
and n = 4 by using the criterion of Occam's razor, i.e., the disparities should not take
continuous arbitrary values. Then, in both cases for n = 2 and n = 4, the theory predicts
coexistence of four disparities 0, dL, dR and dL + dR.
When dL = dn = dc, the constraint of single surface stereo a(D1)f(x) = 0 can hold
only for D1 = de, since
Therefore, the case of dL = dn must produce the single surface perception, if we claim
the criterion of Occam's razor on disparities.
5 Conclusion
I have analyzed visual ambiguities in transparent optical flow and transparent stereo
using the principle of superposition formulated by parametrized linear operators. Ambi-
guities in velocity estimates for particular transparent motion patterns were examined by
mathematical analyses of the transparent optical flow constraint equations. I also pointed
out that conventional statistical voting schemes on velocity space cannot estimate mul-
tiple velocity vectors correctly for a particular transparent motion pattern. Further, the
principle of superposition was applied to transparent stereo and human perception of
multiple ambiguous transparent planes was explained by the operator formalism of the
transparent stereo matching constraint and the criterion of Occam's razor on the number
of disparities.
Future work may include development of a stereo algorithm based on the constraints of
transparent stereo. The research reported in this paper will not only lead to modification
and extension of the computational theories of motion and stereo vision, but will also
help with modeling human motion and stereo vision by incorporating transparency.
Acknowledgments. The author would like thank Dr. Kenji Mase of NTT Human Interface
Laboratories and Dr. Shin'ya Nishida of ATR Auditory and Visual Perception Labora-
tories for helpful discussions.
He also thanks Drs. Jun Ohya, Ken-ichiro Ishii, Yukio Kobayashi and Takaya Endo of
NTT Human Interface Laboratories as well as Drs. Fumio Kishino, Nobuyoshi Terashima
and Kohei Habara of ATR Communication Systems Research Laboratories for their kind
support.
419
References
Abstract. In this work, we look at mean field annealing (MFA) from two
different perspectives: information theory and statistical mechanics. An iterative,
deterministic algorithm is developed to obtain the mean field solution for disparity
calculation in stereo images.
1 Introduction
Recently, a deterministic version of the simulated annealing (SA) algorithm, called mean field
approximation (MFA) [1], was utilized to approximate the SA algorithm efficiently and success-
fully in a variety of applications in early vision modules, such as image restoration [8], image
segmentation [3], stereo [12], motion [11] surface reconstruction [4] etc.
In this paper, we apply the approximation in the stereo matching problem. We show that
the optimal Bayes estimate of disparity is, in fact, equivalent to the mean field solution which
minimizes the relative entropy between an approximated distribution and the given posterior
distribution, if (i) the approximated distribution h a a Gibbs form and (ii) the mass of dis-
tribution is concentrated near the mean as the temperature goes to zero. The approximated
distribution can be appropriately tuned to behave as close to the posterior distribution as pos-
sible. Alternatively, from the angle of statistical mechanics, the system defined by the states
of disparity variables can be viewed as isomorphic to that in magnetic materials, where the
system energy is specified by the binary states of magnetic spins. According to the MRF model,
the distribution of a specific disparity variable is determined by two factors: one due to the
observed image data (external fidd) and the other due to its dependence (internal field) upon
the neighboring disparity variables. We follow the mean field theory in the usual Ising model of
magnetic spins [7] to modify Gibbs sampler [5] into an iterative, deterministic version.
2 A n I n f o r m a t i o n T h e o r e t i c A n a l y s i s of M F A
The optimal Bayes estimate of the disparity values at uniformly spaced grid points, given a pair
of images, is the maximum a posteriori(MAP) estimate when a uniform cost function is assumed.
To impose the prior constraints (e.g., surface smoothness etc.), we can add energy terms in the
objective energy (performance) functional and/or introduce an approximated distribution. The
posterior energy functional of disparity map d given the stereo images, fL and fr, can usually be
formulated in the form [2]:
M M
surface smoothness. If the disparity is modelled as an MRF given the image data, the posterior
distribution of disparity is given as
P(d,f,,f.) = l 1 exp
~ [ UP(d~"fr)
p (2)
where Zp and T axe the normalization and temperature constants respectively. The MAP
estimate of disparity map is the minimizer of the corresponding posterior energy functional
Up(d]f~,fr). It is desirable to describe the above equation by a simpler parametric form. If the
approximated distribution is Pa, which is dependent on adjustable parameters represented by
vector a ----{dx,x E D} and has the Gibbs form:
where Z . is the partition function and U.(d[d) is the associated energy functional. For the
specific Uo, the approximated distribution is Ganssian. In information theory, relative entropy
is an effective measure of how well one distribution is approximated by another [9]. Alternative
names in common use for this quantity are discrimination, Kunback-Liebler number, direct
divergence and Cross entropy. The relative entropy of a measurement d with distribution P~
relative to the distribution P is defined as
P,(dld)
S , ( a ) -~ P,(d[d) log p ( d [ f , f,) dd (4)
where P(dlfl, fr) is referred to as reference distribution. Kullback's principle of minimum relative
entropy [9] states that, of the approximated distributions P~ with the given Gibbs form, one
should choose the one with the least relative entropy. If d is chosen as the mean of the disparity
field d, the optimal mean field solution is apparently the minimizer of relative entropy measure.
After some algebraic manipulations, we can get
where the expectations, E(.), axe defined with respect to the approximated distribution P,. F ,
- T l o g Za, Fp ~ - T I o g Zp axe called free energy. In statistical mechanics [10], the difference
between the average energy and the free energy scaled by temperature is equal to entropy,
o r F = E - T S . From the divergence inequality in information theory, the relative entropy is
always non-negative [6] S t ( a ) _> 0, with the equality holding if and only if P~ - P. And since
temperature is positive,
F~ < Fo + E(V~) - E(Vo) (6)
which is known as Perieris's inequality [1]. The MFA solution, realized as the minimizer of
relative entropy, can be alternatively represented as the parameter a yielding the tightest bound
in (6). In other words, we have
min S~(d) = m~n[Fo + E(Up) - E ( V ' . ) ] (7)
d
since Fp in (5) is not a functional of the parameter a at all. The choice of U~ relies on a prior
knowledge of the distribution of the solution. Gibbs measure provides us with the flexibility in
defining the approximated distribution P~ as it depends solely on the energy function Us, which
in turn can be expressed as the sum of clique potentials [5]. Next we discuss an example of U,
which is both useful and interesting. For the energy function given in (3), the corresponding
approximated distribution is Ganssian and the adjustable parameters are, in fact, the mean
values of disparity field. As the temperature (variance) approaches zero, it will be conformed
t o the mean value with probability one. Since the disparity values at lattice points axe assumed
422
to be independent Gaussian random variables, both the free energy and expected approximate
energy can be obtained as:
M
F. -Tlog Z. -~log(rT), E(U,,)= EE(dx, -dx,) 2 =
MT
= =
2 (s)
i----1
The second term in the right hand side (RHS) can be rewritten as:
M M
E E E[(dx'-dxi)2l = E E [Y-I'(dx'-dxJ)2] (10)
iffil X j 6 N X i i=1 XjENxl
On the other hand, if the first term at RHS of (9) can be approximated by (the validity of
approximation will be discussed later)
M M
~,(r) ~ E (Ig,(x,)- g..[x, + (dx,, o)]1~) -- o~(r) ~ Ig,(x,)- gdx, + (,ix,, o)]l ~ (11)
i=1 i-~.1
then, by combining (10) and (11), the upper bound in Peierls's inequality becomes
M M
F~+ E(Up)- E(Uo)vr a(T)E [g,(x,) - g~[x, + (dx,,0)][ 2 + E E (dx, - dxj) 2 (12)
i-~l iffil X iENxi
It is interesting to note that the format of the above functional of mean disparity function,
d, is identical to that of the posterior energy functional, Up(difl,f~ ) up to a constant. Hence,
it is inferred that the MAP estimate of disparity function is, in fact, equivalent to the mean
field solution minimizing the relative entropy between the posterior and approximated Gaussian
distributions. Regarding the approximation in (11), as the temperature T ---, 0, all the mass of
P . ( d l d ) will be concentrated at mean vector d = d and (11) holds exactly. At least, in the low
temperature conditions, the MFA solution coincides with the MAP solution.
When it system possesses a large interaction degree of freedom, the equilibrium can be attained
through the mean field [10]. It serves as a general model to preview a complicated physical
system. In our case, each pixel is updated by the expected (mean) value given the mean values
of its neighbors [7].
With Gibbs sampler [2], we visit each site xi and update the associated site variable dx~
with a sample from the local characteristics
where the marginalenergy function Ui(dx,) is derived from (1) and (2). If the system is fury
specified by the interactions of site (disparity) variables and the given data, the uncertainty of
each variable is, in fact, defined by the local characteristics. In a magnetic material, each of
the spins is influenced by the magnetic field at its location. This magnetic field consists of any
external field imposed by the experimenter, plus an internal field due to other spins. During the
annealing process, the mean contribution of each spin to the internal field is considered. The
first term in (1) can be interpreted as the external field due to the given image data and the
423
second term as internal field contributed by other disparity variables. SA with Gibbs sampler
simulate the system with the samples obtained from the embedded stochastic rules, while MFA
tries to depict the system with the mean of each system variable.
In summary, the MFA version of Gibbs sampler can then be stated as:
1. Start with any initial mean disparity do and a relative high initial temperature.
2. Visit a site xl and calculate the marginal energy function contributed by given image data
and the mean disparity in the neighborhood Nxi as
exp [ - O i ( d x l ) / T ] (15)
dxi = E d x i P ( d x i ] d y , V y ~ xl,f,,f,) = E dx,
Z~
dx i EItD dx i ERz)
4. Update in accordance with steps 2 and 3 until a steady state is reached at the current
temperature, T.
5. Lower the temperature according to a schedule and repeat the steps 2, 3 and 4 until there
axe few changes.
Consequently, MFA consists of a sequence of iterative, deterministic relaxations in approximating
the SR. It converts a hard optimization problems into a sequence of easier ones.
4 Experimental Results
We have used a wide range of image examples to demonstrate that SR can be closely approxi-
mated by MFA. Due to the space limitation, we only provide azt image example: Pentagon (256 x
256). The matching primitives used in the experiments are intensity, directional intensity gra-
dients (along horizontal and vertical directions), i.e., gs(z, y) --- (f,(z, y),-~=, ~ ) , V(z, y) E
~ , s = l, r. We try to minimize the functional Up(c]]fz,f~) by deterministic relaxation at each
temperature and use the result at current temperature as the initial state for the relaxation at
the next lower temperature. The initial temperature is set as 5.0 and the annealing schedule used
is where the temperature is reduced 50% relative to the previous one. The neighborhood system
2 is used in describing surface smoothness. The computer simulation results are shown in Fig
1. One could compare the result with those obtained by SA algorithm using Gibbs sampler.
In MFA version of SA with Gibbs sampler, we follow the algorithm presented in Section 3.
The initial temperature and the annealing schedule are identical to those in above. The results
axe also shown in Fig 1. When they are compared with the previous results, we can see that the
MFA from both approaches yield roughly the same mean field solution and they approximate
the MAP solution closely.
5 Conclusion
In this paper, we have discussed, for stereo matching problem, two general approaches of MFA
which provide good approximation to the optimal disparity estimate. The underlying models can
be easily modified and applied to the other computer vision problems, such as image restoration,
surface reconstruction and optical flow computation etc. As the Gaussian distribution is the most
natural distribution of an unknown variable given both mean and variance [9], it is nice to see
that the meaa values of these independent variables that minimize the relative entropy between
the assumed Ganssian and the posterior distribution is equivalent to the optimal Bayes estimate
in MAP sense.
424
Fig. 1. Upper row (left to right): the left and right images of Pentagon stereo pa~r, the mean
field result based on information theoretic approach, and the result using SA. Bottom row
(left to right): the result using deterministic Gibbs sampler, the three dimensionaJ (3-D) surface
corresponding to information theoretic MFA, and the 3-D surface corresponding to deterministic
Gibbs sampler.
References
1. G.L. Bilbro, W.E. Snyder, and R.C. Mann. Mean-field approximation minimizes relative
entropy. Your. of Optical Soci. America, Vol-8(No.2):290-294, Feb. 1991.
2. C. Chang. Area-Based Methods ]or Stereo Vision: the Computational Aspects and Their
Applications. PhD thesis, University of Cedifornia, San Diego, 1991. Dept. of ECE.
3. C. Chang and S. Chatterjee. A hybrid approach toward model-based texture segmentation.
Pattern Recognition, 1990. Accepted for publication.
4. D. Geiger and F. Girosi. Mean field theory for surface reconstruction. In Proc. DARPA
Image Understanding Workshop, pages 617-630, Palo Alto, CA, May 1989.
5. S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Trans. on Part. Anal. ~J Mach. Intel., Nov. 1984.
6. R.M. Gray. Entropy and In]ormation Theory. Springer-Verlag, New York, NY, 1990.
7. J. Hertz, A. Krogh, and R.G. Palmer. Introduction to The Theory of Neural Computation.
Addison Wesley, Reading, MA, 1991.
8. H.P. Hiriyann&iah, G.L. Bilbro, W.E. Snyder, and R.C. Mann. Restoration of piecewise-
constant images by mean field annealing. Your. of Optical Soci. America, Vol-
6(No.12):1901-1912, Dec. 1989.
9. S. Kullback. Information Theory and Statistics. John Wiley & Sons, New York, NY, 1959.
10. G. Paxisi. Statistical Field Theory. Addison Wesley, Reading, MA, 1988.
11. A. Yuille. Generalized deformable models, statistical physics and matching problems. Neu-
ral Computation, Vol.2(No.1):l-24, 1990.
12. A. Yuille, D. Geiger, and H. Bulthoff. Stereo integration, mean field theory and psy-
chophysics. In Proc. Ist European Con]. on Comp. Vision, pages 73-88, Antibes, France,
April 1990.
This article was processed using the IbTEX macro packatge with ECCV92 style
Occlusions and Binocular Stereo
Abstract.
Binocular stereo is the process of obtaining depth information from a
pair of left and right cameras. In the past occlusions have been regions
where stereo algorithms have failed. We show that, on the contrary, they
can help stereo computation by providing cues for depth discontinuities.
We describe a theory for stereo based on the Bayesian approach. We
suggest that a disparity discontinuity in one eye's coordinate system always
corresponds to an occluded region in the other eye thus leading to an oc-
clusion co~s~rain~ or monotonicity constraint. The constraint restricts the
space of possible disparity values, simplifying the computations, and gives
a possible explanation for a variety of optical illusions. Using dynamic pro-
gramming we have been able to find the optimal solution to our system and
the experimental results support the model.
1 Introduction
Binocular stereo is the process of obtaining depth information from a pair of left and
right camera images. The fundamental issues of stereo are: (i) how are the geometry and
calibration of the stereo system determined, (ii) what primitives are matched between
the two images, (iii) what a priori assumptions are made about the scene to determine
the disparity and (iv) the estimation of depth from the disparity.
Here we assume that (i) is solved, and so the corresponding epipolar lines (see figure 1)
between the two images are known. We also consider (iv) to be given and then we
concentrate on the problems (ii) and (iii).
A number of researchers including Sperling[Sperling70], Julesz [:Julesz71]; Mart and
Poggio[MarPog76] [MarPog79]; Pollard, Mayhew and Frisby[PolMayFri87]; Grimson[Grimson81];
Ohta and Kanade[OhtKan85]; Yuille, Geiger and Bfilthof[YuiGeiBulg0] have provided a
basic understanding of the matching problem on binocular stereo. However, we argue that
more information exists in a stereo pair than that exploited by current algorithms. In
particular, occluded regions have always caused difficulties for stereo algorithms. These
are regions where points in one eye have no corresponding match in the other eye. Despite
the fact that they occur often and represent important information, there has not been a
consistent attempt of modeling these regions. Therefore most stereo algorithms give poor
results at occlusions. We address the problem of modeling occlusions by introducing a
constraint that relates discontinuities in one eye with occlusions in the other eye.
Our modeling starts by considering adaptive windows matching techniques [KanOku90],
and taking also into account changes of illumination between left and right images, which
provide robust dense input data to the algorithm. We then define an a prior/probability
for the disparity field, based on (1) a smoothness assumption preserving discontinuities,
and (2) an occlusion constraint. This constraint immensely restrict the possible solutions
of the problem, and provides a possible explanation to a variety of optical illusions that
426
so far could not be explained by previous theories of stereo. In particular , illusory dis-
continuities, perceived by humans as described in Nakayama and Shimojo [NakShi96],
may be explained by the model. We then apply dynamic programming to exactly solve
the model.
Some of the ideas developed here have been initiated in collaboration with A. Chain-
boll and S. Mallat and are partially presented in [ChaGeiMalgl]. We also briefly mention
that an alternative theory dealing with stereo and occlusions has been developed by
Belhumeur and Mumford[BelMumgl].
It is interesting to notice that, despite the fact that good modelling of discontinuities
has been done for the problem of segmentation (for example, [BlaZis87][GeiGir91]), it is
still poor the modeling of discontinuities for problems with multiple views, like stereopsis.
We argue that the main difficulty with multiple views is to model discontinuities with
occlusions. In a single view, there are no occlusions !
wmdow-I ~dndow-2
LEFt Row[ Iiii I I I III llf }I i l~l
IIIII=I',',', fill
~ Dhnensiona] line I
Col~n
T~-I+D 9 Colmna
~- eptpolar line I /
Fig. 1. (a) A pair of ~ a m e s (eyes) and an epipolar line in the lelt jCrame. (b) The two windows
in the left image and the respective ones in the right image. In the left image each window shares
the "center pixel" l. The window.1 goes one pixel over the right of l and window-~ goes one over
left to I.
2.1 P r o b a b i l i t y o f m a t c h i n g
If a feature vector in the left image, say W~, matches a feature vector in the right image,
say W ~ , If W~ - W ~ II should he small. As in [MarPog76][YuiGeiBu190], we use a
427
matching process Mlr that is 1 if, a feature at pixel I in the left eye matches a feature at
pixel r in the right eye, and it is 0 otherwise. Within the Bayes approach we define the
probability of generating a pair of inputs, W L and W R, given the matching process M
by
- ~'~,~{M,.[IIW~-W~II]+e(1-M,~)},,."
P~.p~t(W L, WRIM) = e /~1 (1)
where the second term pays a penalty for unmatched points ( Mzr = 0), with e being
a positive parameter to be estimated. C1 is a normalization constant. This distribution
favors lower correlation between the input pair of images.
Notice that these restrictions guarantee that there is at most one match per feature,
and permits unmatched features to exist. There are some psychophysical experiments
where one would think that multiple matches occur, like in the two bars experiments (see
figure 5). However, we argue that this is not the case, that indeed a disparity is assigned
to all the features, even without a match, giving the sensation of multiple matches. This
point will be clearer in the next two sections and we will asume that uctiquertessholds.
Than, it is natural to consider an occlusion process, O, for the left (O L) and for the right
(O R ) coordinate systems, such that
N-1 N-1
O~(M)= I - ~ Mi,r and O f ( M ) = 1 - ~ M,,,. (2)
r----0 I=0
The occlusion processes are 1 when no matches occur and 0 otherwise. In analogy, we
can define a disparity field for the left eye, D L, and another for the right eye, Da, as
N-I N-1
D~(M)(1-O~)= ZM',r(r-l) and D~(M)(1-O~)= ~M,,,(r-l). (3)
r=O 1=0
where D L and DR are defined only if a match occurs. This definition leads to integer
values for the disparity. Notice that D~ = DI+D~
R and D~ = D_D~.
r These two variables,
O(M) and D(M) (depending upon the matching process M), wiU be useful to establish
a relation between discontinuities and occlusions.
3 P i e c e w i s e s m o o t h functions
Since surface changes are usually small compared to the viewer distance, except at depth
discontinuities, we first impose that the disparity field, at each eye, should be a smooth
function but with discontinuities (for example, [BlaZis87]). An effective cost to describe
these functions, (see [GeiGir91]), is given by
428
where/~ and 7 are parameters to be estimated. We have imposed the smoothness criteria
on the left disparity field and on the right one. Assigning a Gibbs probability distribution
to this cost and combining it with (1), within the Bayesian rule, we obtain
(4)
lr
R D R 2 i E .oL .
where we have discarded the constant 29' + E(N - 1)N. This cost, dependent just upon
the matching process (the disparity fields and the occlusion processes are functions of
Ml,), is our starting point to address the issue of occlusions.
4 Occlusions
Giving a stereoscopic image pair, occlusions are regions in space that cannot be seen by
both eyes and therefore a region in one eye does not have a match in the other image,
To model occlusions we consider the matching space, a two-dimensional space where the
axis are given by the epipolar lines of the left and right eyes and each element of the
space, Mz~, decides whether a left intensity window at pixel / matches a right intensity
window at pixel r. A solution for the stereo matching problem is represented as a path
in the matching space(see figure 2).
4.1 O c c l u s i o n c o n s t r a i n t
We notice that in order for a stereo model to admit disparity discontinuities it also has
to admit occlusion regions and vice-versa. Indeed most of the discontinuities in one eye's
coordinate system corresponds to an occluded region in the other eye's coordinate system.
This is best understood in the matching space. Let us assume that the left epipolar line is
the abscissa of the matching space. A path can be broken vertically when a discontinuity
is detected in the left eye and, can be broken horizontally when a region of occlusion
is found. Since we do not allow multiple matches to occur by imposing u~iq~e~ess then
, almost always, a vertical break (jump) in one eye corresponds to a horizontal break
0ump) in the other eye (see figure 2).
x~
A ntinvity
~X ~D
IE
cx xB xA Left
9
BL~ N o Match
(~) (b)
Fig. 2. (a) A ramp occluding a plane. (b) The matching space, where the leftand right epipolar
lines are for the image of (a). Notice the S~lmmetr~l between occlusions and discontinuities. Dark
lines indicates where match occurs, Mr. = 1.
4.2 M o n o t o n i c i t y c o n s t r a i n t
An alternative way of considering the occlusion constraint is by imposing the monotonic-
ity of the function F~ = l + Dr, for the left eye, or the monotonicity of F ~ = r + DrR.
This is called the monotonicity constrain~ (see also [ChaGeiMal91]). Notice that F ~ and
F ~ are not defined at occluded regions, i.e.the functions F~ and F ~ do not have support
at occlusions. The monotonicity of F~, for an occlusion of size o, is then given by
L
Fi+o+1 - i~t > 0, or Df+o - D f > -o, V~
l+o
where L 1
01+o =Of=O and ~ (1-of,)=O
P=I+I
(8)
and analogously to F~R. The monotonicity constraint propose an ordering type of con-
straint. It differs from the known orders constraint in that it explicitly assumes (i)
occlusions with discontinuities,horizontal and verticaljumps, (ii)uniqueness. W c point
out that the monotonicity of F L is equivalent to the monotonicity of F R. The mono-
to~icit~Iconstraint can be applied to simplify the optimization of the effectivecost (5) as
we discuss next.
5 Dynamic Programming
Since the interactions of the disparity field D f and D~ are restricted to a small neigbor-
hood we can apply dynamic programming to exactly solve the problem.
We first constrain the disparity to take on integral values in the range of (-0, 8)
(Panum's limit, see [MarPog79]). We impose the boundary condition, for now, that the
disparity at the end sides of the image must be 0.
The dynamic program works by solving many subproblems of the form: what is the
lowest cost path from the beginning to a particular (/, r) pair and what is its cost? These
430
subproblems are solved column by column from left to right finally resulting in a solu-
tion of the whole problem (see figure 5). At each column the subproblem is considered
requiring a set of subproblems previously solved. Because of the mo~o~o~ici~y co~sgrai~
the set of previously solved subproblems is reduced. More precisely, to solve the subprob-
lem (l, r), requires the information from the solutions of the previous subproblems (z, y),
where y < r and m < I (see shaded pixels in figure 5). Notice that the mono~onici~y
eonscrai~ was used to reduce the required set of previously solved subproblems, thus
helping the efficiency of the algorithm.
In some unusual situations the mogoto~icicy constrain~ can be broken, still preserving the
uniqueness. We show in figure 4 an example where a discontinuity does not correspond
to an occlusion. More psychophysical investigation is necessary to asserts an agreement
of the human perception for this experiment with our theory. This experiment is a gen-
eralization of the double-nail illusion [KroGri82], since the head of the nail is of finite
size (not a point), thus we call it the double-hammer illusion.
Fig. 3. A pair of (a} left and (b) right images of the pentagon, with horizontal epipolar lines.
Each image is 8-bit and 51~ by 51~ pizels. (c} The final disparity map where the values changed
from - 9 to +5. The parameters used where: '7 = 10; /.L = 0.15; e = 0.15;8 = 40; oJ = 3. In a
SPARCstation 1-/-, the algorithm takes about 1000 seconds, mostly for matching windows (.~ 75
of the time}. (d} The occlusion regions in the right image. They are approzimately correct.
F i g . 4. The double-hammer illusion. This figure has a square in front of another larger square.
There is no region of occlusion and yet there is a depth discontinuity.
Left
12
b
,
I
I
rl
Right
i 2 + D2= ii+Dl=rl
F i g . 5. (a) An illustration of the dynamic programming. The subproblem being considered is the
(l, I + D~) one. To solve it we need the solutions from all the shaded pizels. (b) When fused,
a 3-dimensional sensation of two bars, one in front of the of the other one, is obtained. This
suggests that a disparity value is assigned to both bars in the left image. (c) A stereo pair of the
type of Nakayama and Shimojo experiments. When fused, a vivid sensation of depth and depth
discontinuity is obtained at the occluded regions (not matched features). We have displaced the
occluded features with respect to each other to give a sensation of different depth values for the
occlude.d featui'es, supporting the disparity limit conjecture. A cross fuser should fuse the left and
the center images to preceive the blocks behind the planes. An uncross fuser should use the center
and right images.
433
References
[BelMum91] P. Be]_humeur and D. Mumford, A Bayesian treatment of the stereo corre-
spondence using halfioccluded region, Harvard Robotics Lab, Tech. Repport:
December, 1991.
[GeiGir91] D. Geiger and F. Girosi, "Parallel and deterministic algorithms for mrfs: surface
reconstruction," IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, vol. PAMI-13, no. 5, pp. 401-412, May 1991.
[KanOku90] T. Kanade and M. Okutomi, "A stereo matching algorithm with an adaptive
window: theory and experiments," in Prec. Image Understanding Workshop
DARPA, PA, September 1990.
[KroGri82] J.D. Krol and W.A. Van der Grind, "The double-nail illusion: experiments on
binocular vision with nails, needles and pins.," Perception, vol. 11, pp. 615-619,
1982.
[MarPog79] D. Mart and T. Poggio, "A computational theory of human stereo vision,"
Proceedings of the Royal Society of London B, vol. 204, pp. 301-328, 1979.
[N~kS~90] K. Nal~yama and S. Shlmojo, "Da Vinci stereopsis: depth and subjective
occluding contours from unpaired image points," Vision Research, vol. 30,
pp. 1811-1825, 1990.
[OhtKan85] Y. Ohta and T. Kanade, "Stereo by intra- and inter-scanllne search Using
dynamic programming," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. PAMI-7, no. 2, pp. 139-154, 1985.
[YniGeiBul90] A. Ynille, D. Geiger, and H. Bulthoff, "Stereo, mean field theory and psy-
chophysics," in 1st. ECCV, pp~ 73-82, Springer-Verlag, Antibes, France, April
1990.
This article was processed using the I ~ macro package with ECCV92 style
Model-Based Object Tracking in Traffic Scenes
1 Introduction
The higher the level of abstraction of descriptions in image sequence evaluation, the more
a priori knowledge is necessary to reduce the number of possible interpretations as, for
example, in the case of automatic association of trajectory segments of moving vehicles
to motion verbs as described in [Koller et al. 91]. In order to obtain more robust results,
we take more a priori knowledge into account about the physical inertia and dynamic
hehaviour of the vehicle motion.
For this purpose we establish a motion model which describes the dynamic vehicle
motion in the absence of knowledge about the intention of the driver. The result is a
simple circular motion with constant magnitude of velocity and constant angular velocity
around the normal of a plane on which the motion is assumed to take place. The unknown
intention of the driver in maneuvering the car is captured by the introduction of process
noise. The motion model is described in Section 2.2.
The motion parameters for this motion model are estimated using a recursive maxi-
m u m a posteriori estimator (MAP), which is described in Section 4.
Initial states for the first frames are provided by a step which consists of a motion
segmentation and clustering approach for moving image features as described in [Koller
et al. 91]. Such a group of coherently moving image features gives us a rough estimate
for moving regions in the image. The assumption of a planar motion yields then a rough
estimate for the position of the object hypothesis in the scene by backprojecting the
center of the group of the moving image features into the scene, based on a calibration
of the camera.
To update the state description, straight line segments extracted from the image
(we call them data segments) are matched to the 2D edge segments - - a view sketch - -
obtained by projecting a 3D model of the vehicle into the image plane using a hidden-line
algorithm to determine their visibility.
438
The 3D vehicle model for the objects is parameterized by 12 length parameters. This
enables the instantiation of different vehicles, e.g. limousine, hatchback, bus, or van from
the same generic vehicle model. The estimation of model shape parameters is possible by
including them into the state estimation process. Modeling of the objects is described in
Section 2.1.
The matching of data and model segments is based on the Mahalanobis distance of
attributes of the line Segments as described in [Deriche & Faugeras 90]. The midpoint
representation of line segments is suitable for using different uncertainties parallel and
perpendicular to the line segments.
In order to track moving objects in long image sequences which are recorded by a
stationary camera, we are forced to use a wide field of view. This is the reason for a small
image of an individual moving object. In bad cases, there are few and/or only poor line
segments associated with the image of a moving object. In order to track even objects
mapped onto very small areas in the image, we decided to include the shadow edges in
the matching process if possible. In a very first implementation of the matching process it
was necessary to take the shadow edges into account to track some small objects. In the
current implementation the shadow edges appear not to be necessary for tracking these
objects but yield more robust results. The improvement of the very first implementation
compared to the current implementation was only possible by testing the algorithms in
various real world traffic scenes. The results of the last experiments are illustrated in
Section 5.
2.1 T h e p a r a m e t e r i z e d Vehicle M o d e l
Fig. 1. Example of five different vehicle models derived from the same generic model.
2.2 T h e M o t i o n M o d e l
We use a motion model which describes the dynamic behaviour of a road vehicle without
knowledge about the intention of the driver. This assumption leads to a simple vehicle
motion on a circle with a constant magnitude of the velocity v = Ivl and a constant
angular velocity w. The deviation of this idealized motion from the real motion is captured
by process noise due to v and w. In order to recognize the pure translational motion in
the noisy data, we evaluate the angle difference wv (v = tk+l - tk is the time interval). In
case wr is less then a threshold we use a simple translation with the estimated (constant)
angle r and w = 0.
Since we assume the motion to take place on a plane, we have only one angle r and
one angular velocity w -- r The angle r describes the orientation of the model around
the normal (the z-axis) of the plane on which the motion takes place. This motion model
is described by the following differential equation:
i~ = ~ c o s r r = ,o,
i u = v sin r ~5 = O, r = O. (1)
The matching between the predicted model data and the image data is performed on edge
segments. The model edge segments are the edges of the model, which are backprojected
from the 3D scene into the 2D image. The invisible model edge segments are removed by
a hidden-line algorithm. The position t and orientation r of the model are given by the
output of the recursive motion estimation described in Section 4. This recursive motion
estimation also yields values for the determination of a window in the image in which
edge segments are extracted. The straight line segments are extracted and approximated
using the method of [Korn 88].
440
:- ...::..:::::
\
\
Fig. 2. To illustrate the complexity of the task to detect and track small moving objects,
the following four images are given: the upper left image shows a small enlarged image
section, the upper right figure shows the greycoded maxima gradient magnitude in the
direction of the gradient of the image function, the lower left figure shows the straight
line segments extracted from these data, and the lower right figure shows the matched
model.
Like the method of [Lowe 85; Lowe 87] we use an iterative approach to find the set
with the best correspondence between 3D model edge segments and 2D image edge
segments. The iteration is necessary to take into account the visibility of edge segments
depending on the viewing direction and the estimated state of position and orientation,
respectively. At the end of each iteration a new correspondence is determined according
to the estimated state of position and orientation. The iteration is terminated if a certain
number of iterations has been achieved or the new correspondence found has already been
investigated previously. Out of the set of correspondences investigated in the iteration,
the correspondence which leads to the smallest residuM is then used as a state update.
The algorithm is sketched in Figure 3. We use the average residual per matched edge
segment, multiplied by a factor which accounts for long edge segments, as a criterion for
the selection of the smallest residual.
441
i~--0
Ci '-- get_correspondences( x - )
DO
z + ~- update_state( Ci )
ri 4-- residual( Ci )
Ci+l ~-- get_correspondences( :r + )
i~-i+l
W H I L E ( ( e i + I # Cj ; j = 0, 1 . . . , i ) A i < IMAX)
i,~i,, ~ {i]ri = min(rj) ; j = 0, 1 . . . , I i i X )
x+ ~- a:.I+m l n
Fig. 3. Algorithm for the iterative matching process, g~ is the set of correspondences between
p data segments 2~ = {Dj}j=~...p and n model segments A4 ----{Mj}jfa .... for the model inter-
pretation i: C, = {(Mj, D,j))j=I .....
Finding Correspondences
Correspondences between model and data segments are established using the Maha-
lanobis distance between attributes of the line segments as described in [Deriche &
Faugeras 90]. We use the representation X = (xm, y-n, O, l) of a line segment, defined
as:
Xm (2)
ym = 2 , t = (x2 - x l ) 2 + (y2 - y l ) 2 .
where (xl, yl) T a n d (x2, y2) T a r e the endpoints of a line segment.
Denoting by all the uncertainty in the position of the endpoints along an edge chain
and by or the positional uncertainty perpendicular to the linear edge chain approxi-
mation, a covariance matrix A is computed, depending on all,a and 1. Given the
attribute vector X m of a model segment and the attribute vector Xd of a data segment,
the Mahalanobis distance between X m and X d is defined as
d -- ( X m -- x d ) T ( A m "4- A d ) - l ( X m -- X d ) . (3)
The data segment with the smallest Mahalanobis distance to the model segment is
used for correspondence, provided the Mahalanobis distance is less than a given threshold.
Due to the structure of vehicles this is not always the best match. The known vehicles
and their models consist of two essential sets of parallel line segments. One set along
the orientation of the modeled vehicle and one set perpendicular to this direction. But
evidence from our experiments so far supports our hypothesis that in most cases the
initialisation for the model instantiation is good enough to obviate the necessity for a
combinatorial search, such as, e.g., in [Grimson 90b].
The search window for corresponding line segments in the image is a rectangle around
the projected model segments. The dimensions of this rectangle are intentionally set by
us to a higher value than the values obtained from the estimated uncertainties in order
to overcome the optimism of the IEKF as explained in Section 4.
In this section we elaborate the recursive estimation of the vehicle motion parameters.
As we have already described in Section 2.2, the assumed model is the uniform motion
of a known vehicle model along a circular arc.
442
By integrating the differential equations (1) we obtain the following discrete plant
model describing the state transition from time point tk to time point tk+l:
We introduce the usual dynamical systems notation (see, e.g., [Gelb 74]). The sym-
bols (~k, P~-) and (~+, P+) are used, respectively, for the estimated states and their
covariances before and after updating based on the measurements at time tk.
By denoting the transition function of (5) by f(.) and assuming white Gaussian
process noise wk ,,~ Af(0, Qk), the prediction equations read as follows
where (t~,k,tv,k , Ck) are the state parameters and zm,i are the known positions of the
vehicle vertices in the model coordinate system.
As already mentioned, we have included the projection of the shadow contour in the
measurements in order to obtain more predicted edges segments for matching and to avoid
false matches to data edge segments arising from shadows that lie in the neighborhood
of predicted model edges. The measurement function of projected shadow edge segments
differs from the measurement function of.the projections of model vertices in one step.
Instead of only one point in the world coordinate system, we get two. One point zs as
vertex of the shadow on the street and a second point zw = (x~, y~, z~) as vertex on the
object which is projected onto the shadow point zs. We assume a parallel projection in
443
-0--
// \
camera c.s.
-- -- --, . . . . . . . . . . . . . . . . . . . . -I
I 9 P
P
s
model c.s. ,"
P
IP I
p
i p / /
9 /
/ /
9
, Street p /
J_ . . . . . . . . . . . . . . . . . . . . . . . . . . -I
shadow generation. Let the light source direction be (cos a sin j3, sin a sin 8, cos j3)T where
and /3 - - s e t interactively o f f - l i n e - are the azimuth and polar angle, respectively,
described in the world coordinate system. The following expression for the shadow point
in the xy-plane (the road plane) of the world coordinate system can be easily derived:
- zw coso~tanfl~
X~ = x~-
Yw z~ on~tan~) (s)
The point mw can then be expressed as a function of the state using (7). A problem arises
with endpoints of line segments in the image which are not projections of model vertices
but intersections of occluding line segments. Due to the small length of the possibly
occluded edges (for example, the side edges of the hood and of the trunk of the vehicle)
we cover this case by the already included uncertainty aJl of the endpoints in the edge
direction. A formal solution uses a closed form for the endpoint position in the image as a
function of the coordinates of the model vertices belonging to the occluded and occluding
edge segments. Such a closed form solution has not yet been implemented in our system.
The measurement function hk is nonlinear in the state mk. Therefore, we have tested
three possibilities for the updating step of our recursive estimation. In all three approaches
we assume that the state after the measurement zk is normally distributed around the
estimate ~+-1 with covariance P+-I which is only an approximation to the actual a
posteriori probability density function (PDF) after an update step based on a nonlinear
measurement. An additional approximation is the assumption that the PDF after the
nonlinear prediction step remains Gaussian. Thus we state the problem as the search for
the maximum of the following a posteriori PDF after measurement zk:
exp { - - l ( z k -- ~ - ) T P k - - I ( X k -- ~ - ) } , (9)
444
resulting in the updated estimate ~+. In this context the well known Iterated Extended
Kaln~an Filter (IEKF) [Jazwinski 70; Bar-Shalom & Fortmann 88] is actually the Gauss-
Newton iterative method [Scales 85] applied to the above objective function whereas the
Extended Kalman Filter (EKF) is only one iteration step of this method. We have found
such a clarification [Jazwinski 70] of the meaning of EKF and IEKF to be important
towards understanding the performance of each method.
A third possibility we have considered is the Levenberg-Marquardt iterative minimiza-
tion method applied on (10) which we will call Modified IEKF. The Levenberg-Marquardt
strategy is a usual method for least squares minimization guaranteeing a steepest descent
direction far from minimum and a Gauss-Newton direction near the minimum, thus in-
creasing the convergence rate. If the initial values are in the close vicinity of the minimum,
then IEKF and Modified IEKF yield almost the same result.
Due to the mentioned approximations, all three methods are suboptimal and the com-
puted covariances are optimistic [Jazwinski 70]. This fact practically affects the matching
process by narrowing the search region and making the matcher believe that the current
estimate is much more reliable than it actually is. Practical compensation methods in-
clude an addition of artificial process noise or a multiplication with an amplification
matrix. We did not apply such methods in our experiments in order to avoid a severe
violation of the smoothness of the trajectories. We have just added process noise to the
velocity magnitude v and w (about 10% of the actual value) in order to compensate the
inadequacy of the motion model with respect to the real motion of a vehicle.
We have tested all three methods [Th6rhallson 91] and it turned out that the IEKF
and Modified IEKF are superior to the EKF regarding convergence as well as retainment
of a high number of matches. As [Maybank 90] suggested, these suboptimal filters are the
closer to the optimal filter in a Minimum Mean Square Error sense the nearer the initial
value lies to the optimal estimate. This criterion is actually satisfied by the initial posi-
tion and orientation values in our approach obtained by backprojecting image features
clustered into objects onto a plane parallel to the street. In addition to the starting values
for position and orientation, we computed initial values for the velocity magnitudes v and
w during a bootstrap process. During the first nboot (-----2, usually) time frames, position
and orientation are statically computed. Then initial values for the velocities are taken
from the discrete time derivatives of these positions and orientations.
Concluding the estimation section, we should mention that the above process requires
only a slight modification for the inclusion of the shape parameters of the model as un-
knowns in the state vector. Since shape parameters remain constant, the prediction step
is the same and the measurement function must be modified by substituting the model
points ~m,~ with the respective functions of the shape parameters instead of considering
them to have constant coordinates in the model coordinate system.
Figure 5). The image of the moving car covers about 60 x 100 pixels of a frame. In this
example it was not necessary, and due to the illumination conditions not even possible,
to use shadow edges in the matching process. The matched models for the three upper
frames are illustrated in the middle row of Figure 5, with more details given in the lower
three figures. In the lower figures we see the extracted straight lines, the backprojected
model segments (dashed lines) and the matched data segments, emphasized by thick
lines.
....... ,
I +.....
+..___
....................
Fig. 5. The first row shows the 4 th, 41 st and 798t frame of an image sequence. The three
images in the middle row give an enlarged section of the model matched to the car moving
in the image sequence. The lower three figures exhibit the correspondences (thick lines)
between image line segments and model segments (dashed lines) in the same enlarged
section as in the middle row.
process noise of av = 10-3 m and a~ = 10 -4 - ~ . Given this o-v and a~, the majority of
the translational and angular accelerations are assumed to be ~) < ( r . / r = .625 ~ and
< a ~ / r = 2.5.10 -a ~_~4,respectively, with r = tk+l - tk = 40ms.
The bootstrap phase is performed using the first two frames in order to obtain initial
estimates for the magnitudes of the velocities v and w. Since the initially detected moving
region does not always correctly span the image of the moving object, we used values equal
to approximately half of the average model length, i.e. cr,=o = trt~o = 3 m. An initial value
for the covariance in the orientation r is roughly estimated by considering the differences
in the orientation between the clustered displacement vectors, i.e ere0 = .35 rad.
The car has been tracked during the entire sequence of 80 frames with an average
number of about 16 line segment correspondences per frame. The computed trajectory
for this moving car is given in Figure 6.
8 I I I I I I I
14
translational velocity v [m/s] - -
13 position - - 6 -
12 y [m]
11
10 2
9 I I I I I
0 ! I
8 120 130 140 150 160 170 180 190 200
frame #
7
~ / 40 I I I I I I I
3 20
2
10
1
0 I I I I I I I
0
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 120 130 140 150 160 170 180 190 200
x [m] frame
Fig. 6. The estimated position as well as the translational and angular velocity of the
moving car of Figure 5.
'. . . . L_ .~ i":11~'~I I
Fig. 7. The first row shows the 3ru, 25th and 49th frame of an image sequence recorded
at a much frequented multilane street intersection. The middle row shows an enlarged
section of the model matched to the taxi (object #6) moving in the center of the frame.
The lower three figures exhibit the correspondences between image line segments and
model segments in the same enlarged section as in the middle row.
segments. We explicitly present this Figure in order to give an idea of the complexity of
the task to detect and track a moving vehicle spanning such a small area in the image. We
used the same values for the process noise and the initial covariances as in the previous
experiment. As in the previous example we used the first two frames for initial estimation
of v and w. In this experiment we used the shadow edges as additional line segments in
the matching process as described in Section 4.
Five of the vehicles appearing in the first frame have been tracked throughout the
entire sequence. The reason for the failure in tracking the other vehicles has been the
inability of the initialization step to provide the system with appropriate initial values.
To handle this inability an interpretation search tree is under investigation.
In the upper part of Figure 7 we see three frames out of this image sequence. In the
middle part of Figure 7, the matched model of a taxi is given as an enlarged section
448
~. ~.,-.~~/,~.
y .~.~ J
( ..../ /
Fig. 8. The first row shows the 3rd, 25 th and 49~h frame of an image sequence recorded
at a much frequented multilane street intersection. The middle row shows an enlarged
section of the model matched to the small car (object #5) moving left of the center of the
frame. The lower three figures exhibit the correspondences between image line segments
and model segments in the same enlarged section as in the middle row.
for the three upper images. In the lower three figures the correspondences of image line
segments and the model line segments are given. Figure 9 shows the resultant object
trajectory. Figure 8 shows another car of the same image sequence with the resultant
trajectory also displayed in Figure 9.
6 Related Works
In this section we discuss related investigations about tracking and recognizing object
models from image sequences. The reader is referred to the excellent book by [Grimson
90a] for a complete description of research on object recognition from a single image.
[Gennery 82] has proposed the first approach for tracking 3D-objects of known struc-
ture. A constant velocity six degrees of freedom (DOF) model is used for prediction and
449
6 I I I I I I I
translational velocity v [m/s]
5 position object # 5 - -
object # 6 I I ! I I I I I I
4 object # 5
10 object # 6
3
2 8
1 6
0
4
-i
-2 2
-3 0 I I I I ! i I i I
-4 5 10 15 20 25 30 35 40 45
frame #
-5
-6 angular velocity w [Grad/s]
-7
-8
-9
0
-10
-11 -90
-12 -180
-13
-270
-14 - y [m]
-15 i | I I I I I I I -360
-13 -12 - I i -i0 -9 -8 -7 -6 -5 5 10 15 20 25 30 35 40 45
x [m] frame #
Fig. 9. The estimated positions as well as the translational and angular velocities of the
moving cars in Figure 8 (object # 5) and Figure 7 (object # 6).
an update step similar to the Kalman filter - without addressing the nonlinearity - is
applied. Edge elements closest to the predicted model line segments are associated as
corresponding measurements.
[Thompson & Mundy 87] emphasize the object recognition aspect of tracking by ap-
plying a pose clustering technique. Candidate matches between image and model vertex
pairs define points in the space of all transformations. Dense clusters of such points
indicate a correct match. Object motion can be represented by a trajectory in the trans-
formation space. Temporal coherence then means that this trajectory should be smooth.
Predicted clusters from the last time instant establish hypotheses for the new time in-
stants which are verified as matches if they lie close to the newly obtained clusters. The
images we have been working on did not contain the necessary vertex pairs in order to
test this novel algorithm. Furthermore, we have not been able to show that the approach
of [Thompson & Mundy 87] is extensible to handling of parameterized objects.
450
[Verghese et al. 90] have implemented in real-time two approaches for tracking 3D-
known objects. Their first method is similar to the approach of [Thompson & Mundy 87]
(see the preceding discussion). Their second method is based on the optical flow of line
segments. Using line segment correspondences, of which initial (correct) correspondences
are provided interactively at the beginning, a prediction of the model is validated and
spurious matches are rejected.
[Lowe 90, 91] has built the system that has been the main inspiration for our match-
ing strategy. He does not enforce temporal coherence however, since he does not imply
a motion model. Pose updating is carried out by minimization of a sum of weighted
least squares including a priori constraints for stabilization. Line segments are used for
matching but distances of selected edge points from infinitely extending model lines are
used in the minimization. [Lowe 90] uses a probabilistic criterion to guide the search for
correct correspondences and a match iteration cycle similar to ours.
A gradient-ascent algorithm is used by [Worrall el al. 91] in order to estimate the
pose of a known object in a car sequence. Initial values for this iteration are provided
interactively at the beginning. Since no motion model is used the previous estimate is
used at every time instant to initialize the iteration. [Marslin et al. 91] have enhanced
the approach by incorporating a motion model of constant translational acceleration and
angular velocity. Their filter optimality, however, is affected by use of the speed estimates
as measurements instead of the image locations of features.
[Schick & Dickmanns 91] use a generic parameterized model for the object types. They
solve the more general problem of estimating both the motion and the shape parameters.
The motion model of a car moving on a clothoid trajectory is applied including trans-
lational as well as angular acceleration. The estimation machinery of the simple EKF is
used and, so far, the system is tested on synthetic line images only.
The following approaches do not consider the correspondence search problem but
concentrate only on the motion estimation. A constant velocity model with six DOF
is assumed by [Wu et al. 88] and [Harris ~ Stennet 90; Evans 90], whereas [Young &
Chellappa 90] use a precessional motion model.
A quite different paradigm is followed by [Murray et al. 89]. They first try to solve
the structure from motion problem from two monocular views. In order to accomplish
this, they establish temporal correspondence of image edge elements and use these cor-
respondences to solve for the infinitesimal motion between the two time instants and the
depths of the image points. Based on this reconstruction [Murray et al. 89] carry out a
3D-3D correspondence search. Their approach has been tested with camera motion in a
laboratory set-up.
Our task has been to build a system that will be able to compute smooth trajectories of
vehicles in traffic scenes and will be extensible to incorporate a solution to the problem
of classifying the vehicles according to computed shape parameters. We have considered
the task to be difficult because of the complex illumination conditions and the cluttered
environment of real world traffic scenes and the small effective field of view that is
spanned by the projection of each vehicle given a stationary camera. In all experiments
mentioned in the cited approaches in the last section, the projected area of the objects
covers a quite high portion of the field of view. Furthermore, only one of them [Evans
~01 is tested under outdoor illumination conditions (landing of an aircraft).
451
In order to accomplish the above mentioned tasks we have applied the following
constraints. We restricted the degrees of freedom of the transformation between model
and camera from six to three by assuming that a vehicle is moving on a plane known a
priori by calibration. We considered only a simple time coherent motion model because
of the high sampling rate (25 frames pro second) and the knowledge that vehicles do not
maneuver abruptly.
The second critical point we have been concerned about is the establishment of good
initial matches and pose estimates. Most tracking approaches do not emphasize the sever-
ity of this problem of establishing a number of correct correspondences in the starting
phase and feeding the recursive estimator with quite reasonable initial values. Again we
have used the a priori knowledge of the street plane position and the results of clustering
picture domain descriptors into object hypotheses of a previous step. Thus we have been
able to start the tracking process with a simple matching scheme and feed the recursive
estimator with values of low error covariance.
The third essential point we have addressed is the additional consideration of shadows.
Data line segments arising from shadows are not treated any more as disturbing data like
markings on the road, but they contribute to the stabilization of'the matching process.
Our work will be continued by the following steps. First, the matching process should
be enhanced by introducing a search tree. In spite of the good initial pose estimates,
we are still confronted occasionally with totally false matching combinations due to the
highly ambiguous structure of our current vehicle model. Second, the generic vehicle
model enables a simple adaptation to the image data by varying the shape parameters.
These shape parameters should be added as unknowns and estimated along time.
Acknowledgements
The financial support of the first author by the Deutsche Forschungsgemeinschaft (DFG,
German Research Foundation) and of the second as well as the third author by the
Deutscher Akademischer Austauschdienst (DAAD, German Academic Exchange Service)
are gratefully acknowledged.
References
[Bar-Shalom & Fortmann 88] Y. Bax-Shalom, T.E. Fortmann, Tracking and Data Association,
Academic Press, New York, NY, 1988.
[Deriche & Faugeras 90] R. Deriche, O. Faugeras, Tracking line segments, Image and Vision
Computing 8 (1990) 261-270.
[Evans 90] R. Evans, Kalman Filtering of pose estimates in applications of the RAPID video
rate tracker, in Proc. British Machine Vision Conference, Oxford, UK, Sept. 24-27, 1990,
pp. 79-84.
[Gelb 74] A. Gelb (ed.), Applied Optimal Estimation, The MIT Press, Cambridge, MA and
London, UK, 1974.
[Gennery 82] D.B. Gennery, Tracking known three-dimensional objects, in Proc. Conf. Ameri-
can Association of Artificial Intelligence, Pittsburgh, PA, Aug. 18-20, 1982, pp. 13-17.
[Grimson 90a] W.E.L. Grimson, Object recognition by computer: The role of geometric con-
straints, The MIT Press, Cambridge, MA, 1990.
[Grimson 90b] W. E. L. Grimson, The combinatorics of object recognition in cluttered environ-
ments using constrained search, Artificial Intelligence 44 (1990) 121-165.
[Harris & Stennet 90] C. Harris, C. Stennet, RAPID - A video rate object tracker, in Proc.
British Machine Vision Conference, Oxford, UK, Sept. 24-27, 1990, pp. 73-77.
452
[Jazwinski 70] A.H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, New
York, NY and London, UK, 1970.
[KoUer et al. 91] D. Koller, N. Heinze, H.-H. Nagel, Algorithmic Characterization of Vehicle
Trajectories from Image Sequences by Motion Verbs, in 1EEE Conf. Computer Vision and
Pattern Recognition, Lahalna, Maui, HawMi, June 3-6, 1991, pp. 90-95.
[Korn 88] A. F. Korn, Towards a Symbolic Representation of Intensity Changes in Images, IEEE
Transactions on Pattern Analysis and Machine Intelligence P A M I - 1 0 (1988) 610-625.
[Lowe 85] D. G. Lowe, Perceptual Organization and Visual Recognition, Kluwer Academic Pub-
fishers, Boston MA, 1985.
[Lowe 87] D. G. Lowe, Three-Dimensional Object Recognition from Single Two-Dimensional
Images, Artificial Intelligence 31 (1987) 355-395.
[Lowe 90] D. G. Lowe, Integrated Treatment of Matching and Measurement Errors for Robust
Model-Based Motion Tracking, in Proe. Int. Conf. on Computer Vision, Osaka, Japan,
Dec. 4-7, 1990, pp. 436-440.
[Lowe 91] D.G. Lowe, Fitting parameterized three-dimensional models to images, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 13 (1991) 441--450.
[Marslin et al. 91] R.F. Marslin, G.D. Sullivan, K.D. Baker, Kalman Filters in Constrained
Model-Based Tracking, in Proc. British Machine Vision Conference, Glasgow, UK, Sept.
24-26, 1991, pp. 371-374.
[Maybank 90] S. Maybank, Filter based estimates of depth, in Proc. British Machine Vision
Conference, Oxford, UK, Sept. 24-27, 1990, pp. 349-354.
[Murray et al. 89] D.W. Murray, D.A. Castelow, B.F. Buxton, From image sequences to rec-
ognized moving polyhedral objects, International Journal of Computer Vision 3 (1989)
181-208.
[Scales 85] L. E. Scales, Introduction to Non-Linear Optimization, Macmillan, London, UK,
1985.
[Schick & Dickmanns 91] J. Schick, E. D. Dickmanns, Simultaneous estimation of 3D shape
and motion of objects by computer vision, in Proc. IEEE Workshop on Visual Motion,
Princeton, N J, Oct. 7-9, 1991, pp. 256-261.
[Thompson & Mundy 87] D.W. Thompson, J.L. Mundy, Model-based motion analysis - motion
from motion, in The Fourth International Symposium on Robotics Research, R. Bolles and
B. Roth (ed.), M I T Press, Cambridge, MA, 1987, pp. 299-309.
[Thdrhallson 91] T. Thdrhallson, Untersuchung zur dynamischen Modellanpassung in monoku-
laren Bildfolgen, Diplomarbeit, Fakults ffir Elektrotechnik der Universits Karlsruhe (TH),
durchgeffihrt am Institut ffir Algorithmen und Kognitive Systeme, Fakults ffir Informatik
der Universits Karlsruhe (TH), Karlsruhe, August 1991.
[Tsal 87] R. Tsai, A versatile camera calibration technique for high accuracy 3D machine vision
metrology using off-the-shelf TV cameras and lenses, IEEE Trans. Robotics and Automation
3 (1987) 323-344.
[Verghese et al. 90] G. Verghese, K.L. Gale, C.R. Dyer, Real-time, parallel motion tracking of
three dimensional objects from spatiotemporal images, ill V. Kumar, P.S. Gopalakrishnan,
L.N. Kanal (ed.), Parallel Algorithms for Machine Intelligence and Vision, Springer-Verlag,
Berlin, Heidelberg, New York, 1990, pp. 340-359.
[Worrall et al. 91] A.D. Worrall, R.F. Marslin, G.D. Sullivan, K.D. Baker, Model-Based Track-
ing, in Proc. British Machine Vision Conference, Glasgow, UK, Sept. 24-26, 1991, pp. 310-
318.
[Wu et al. 88] J.J. Wu, R.E. Rink, T.M. Caelli, V.G. Gourishankar, Recovery of the 3-D location
and motion of a rigid object through camera image (an Extended Kalman Filter approach),
International Journal of Computer Vision 3 (1988) 373-394.
[Young & Chellappa 90] G. Young, R. Chellappa, 3-D Motion estimation using a sequence of
noisy stereo images: models, estimation and uniqueness results, IEEE Transactions on
Pattern Analysis and Machine Intelligence P A M I - 1 2 (1990) 735-759.
This article was processed using the LATEXmacro package with ECCV92 style
Tracking Moving Contours Using
Energy-Minimizing Elastic Contour Models
Naonori UEDA x and Kenji M A S E 2
1 Introduction
Detecting and tracking moving objects is one of the most fundamental and important
problems in motion analysis. When the actual shapes of moving objects are important,
higher level features like object contours, instead of points, should be used for the track-
ing. Furthermore, since these higher level features make it possible to reduce ambiguity
in feature correspondences, the correspondence problem is simplified.
However, in general, the higher the level of the features, the more difficult the ex-
traction of the features becomes. This results in a tradeoff, which is essentially insolvable
as long as a two-stage processing is employed. Therefore, in order to establish high level
tracking, object models which embody a priori knowledge about the object shapes are
utilized[I][2].
On the other hand, Kass et ai.[3] have recently proposed active contour models(Snakes)
for the contour extraction. Once the snake is interactively initialized on an object contour
in the first frame, it will automatically track the contour from frame to frame. T h a t is,
contour tracking by snakes can be achieved. It is a very elegant and attractive approach
because it makes it possible to simultaneously solve both the extraction and tracking
problems. That is, the above tradeoff is completely eliminated.
However, this approach is restricted to the case that the movement and deformation
of an object are very small between frames. As also pointed out in Ref.[2], this is mainly
due to the excessive flexibility of the spline composing the snake model.
In this paper, we propose a robust contour tracking method which can solve the
above problem while preserving the advantages of snakes. In the proposed method, since
the contour model itself is defined by elastics with moderate "stiffness" which does not
permit local major deformations, the influence of texture and occluding edges in or near
the target contour is minimal. Hence, the proposed method becomes more robust than
the original snake models in that it is applicable to more general tracking problems.
In this paper, we also present a new algorithm for solving energy minimization prob-
lems using dynamic programming technique. Amini et al.[4] have already proposed a
Lecture Notes in Computer Science, Vol. 588
G. Sandini (Ed.)
Computer Vision - ECCV '92
9 Springer-Verlag Berlin Heidelberg 1992
454
2.1 Elastic c o n t o u r m o d e l s
A model contour is defined as a polygon with n discrete vertices. That is, the polygonally
approximated contour model is represented by an ordered list of its vertices: C = {vi =
(zi, Yi)}, 1 < i < n. A contour model is constrained by two kinds of "springs" so that
it has a moderate "stiffness" which preserves the form of the tracked object contour in
the previous frame as much as possible. That is, each side of the polygon is composed of
a spring with a restoring force proportional to its expansion and contraction, while the
adjacent sides are constrained by another spring with a restoring force proportional to
the change of the interior angle. Assume that these springs are original length when the
contour model is at the initial contour position {vi0 }i=1
,, in the current frame. Therefore,
at that time, for the springs no force is at work. Clearly, the initial position in the current
frame corresponds to the tracking result in the previous frame.
2.2 E n e r g y m i n i m i z a t i o n f r a m e w o r k
Let {vi0}i=1
n denote a tracked contour in the preceding frame. Then our goal is to move
and deform the contour model from {vi0}i=1
n to the best position {v i9 }i=a
n in the current
frame such that the following total energy functional is minimized:
gt
i----1
Here, Eelastie is elastic energy functional derived from the deformation of the contour
model and can be defined as:
1 0
Ee,a:tir = ~ (pl(IVi+l - vii - Ivi+l - v~ 2
a long distance. Therefore, it can influence the contour model even ff the contour model
is remote from the target contour.
Assuming that z(vi) denotes the height or potential value at vi on the potential field,
then the potential energy, Elietd, can easily be defined by the classical gravitational
potential energy equation. That is,
3 Optimization algorithm
From Eqs.(2) and (3), the total energy functional shown in Eq.(1) can be formally brought
to the general form:
n
S : / 1 (Vl)-[-gl (v 1, v2)-[-hi (Vl, v2, ~)3)+hn-,(vn-1, v , , t'l)-{-g, (vn, Vl)+hn(vn, Vl, v2).
(5)
Then, the minimization of E, otal can be written as:
function of v2, v3, vn-1, and vn. Therefore, this minimization is made and stored for all
possible assignments of v~, v3, vn-1, and vn. Formally, the minimization can be written
as-
r v3, = (7)
minEtot,,, -- v_{v,}{(Etot,,-S)+r
min (8)
is of the same form as the original problem, and the function r va, vn-1, vn) can be
regarded as a component of the new objective function.
Applying the same minimization procedure for the rest of the variables, v2, v 3 , . . , in
this order, we can derive the following DP equations. That is, for 2 < i < n - 4,
4 Experiments
The proposed contour tracking method has been tested experimentally on several syn-
thetic and real scenes. Figure 1 compares the snake model(Fig.la) with our model(Fig.lb)
when occluding edges exist. The scene in Fig.1 is an actual indoor scene and corresponds
to one frame from a sequence of a moving bookend on a turntable over a static grid.
Since the snake model is influenced by occluding edges, the model was not able to track
the target contour. On the other hand, the proposed model successfully tracked it with-
out being influenced by the occluding edges. We also obtained successful results for the
trackings of moving car, deforming ball, and so on.
In this approach, since the contour model itself moves toward the target contour,
point correspondences are established between frames. That is, correspondence based
optical flows are also obtained. Therefore, feature point trajectories over several frames
can easily be obtained by the proposed method.
5 Conclusions
I=0 I = 4 I = 12(Result)
(a)Tracking by the snake model
I=0 I = 2 I = 7(Result)
(b)Tracking by the proposed model
F i g . 1 . Comparison of the results of tracking contour with occluding edges. I denotes the
number of iterations.
References
1. Dreschler and Nagel H. H.: "Volumetric model and 3D trajectory of a moving car derived
from monocular tv-frame sequence of a street scene", in Proc. I J C A I 8 1 , 1981.
2. YuiUeA. L., Cohen D. S. and Hallinan P. W.: "Feature extraction from faces using deformable
templates", in Proc., C V P R 8 9 , pp. 104-109, 1989.
3. Kass A., Witkin A. and Terzopoulos D.: "Snakes: Active contour modes", Int. J. Comput.
Vision, 1, 3, pp. 321-331, 1988.
4. Amini A. A., Weymouth T. E. and JaJn R. C.: "Using dynamic programming for solving
variational problems in vision", IEEE Trans. Pattern Anal. Machine lnteii., P A M I - 1 2 , 9,
pp. 855-867, 1990.
5. Rosenfeld A. and Pfaltz J. L.:"Distance functions on digital pictures", Pattern Recognition,
1, pp.33-61, 1968.
This article was processed using the I~TEX macro package with ECCV92 style
T r a c k i n g P o i n t s on D e f o r m a b l e O b j e c t s U s i n g
Curvature Information*
Abstract
The objective of this paper is to present a significant improvement to the approach of
Duncan et al. [1, 8] to analyze the deformations of curves in sequences of 2D images.
This approach is based on the paradigm that high curvature points usually possess an
anatomical meaning, and are therefore good landmarks to guide the matching process,
especially in the absence of a reliable physical or deformable geometric model of the
observed structures.
As Duncan's team, we therefore propose a method based on the minimization of an
energy which tends to preserve the matching of high curvature points, while ensuring a
smooth field of displacement vectors everywhere.
The innovation of our work stems from the explicit description of the mapping between
the curves to be matched, which ensures that the resulting displacement vectors actually
map points belonging to the two curves, which was n o t the case in Duncan's approach.
We have actually implemented the method in 2-D and we present the results of the
tracking of a heart structure in a sequence of ultrasound images.
1 Introduction
Non-rigid motion of deformable shapes is becoming an increasingly important topic in
computer vision, especially for medical image analysis. Within this topic, we concentrate
on the problem of tracking deformable objects through a time sequence of images.
The objective of our work is to improve the approach of Duncan et al. [1, 8] to
analyze the deformations of curves in sequences of 2D images. This approach is based
on the paradigm that high curvature points usually possess an anatomical meaning, and
are therefore good landmarks to guide the matching process. This is the case for instance
when deforming patients skulls (see for instance [7, 9]), or when matching patient faces
taken at different ages, when matching multipatients faces, or when analyzing images of
a beating heart. In these cases, many lines of extremal curvatures (or ridges) are stable
features which can be reliably tracked between the images (on a face they will correspond
to the nose, chin and eyebrows ridges for instance, on a skull to the orbital, sphenoid,
falx, and temporal ridges, on a heart ventricle to the papillary muscle etc... ).
As Duncan's team, we therefore propose a method based on the minimization of an
energy which tends to preserve the matching of high curvature points, while ensuring a
smooth field of displacement vectors everywhere.
The innovation of our work stems from the explicit description of the mapping between
the curves to be matched, which ensures that the resulting displacement vectors actually
* This work was partially supported by Digital Equipment Corporation.
459
map points belonging to the two curves, which was n o t the case in Duncan's approach.
Moreover, the energy minimization is obtained through the mathematical framework of
Finite Element analysis, which provides a rigorous and efficient numerical solution. This
formulation can be easily generalized in 3-D to analyze the deformations of surfaces.
Our approach is particularly attractive in the absence of a reliable physical or de-
formable geometric model of the observed structures, which is often the case when
studying medical images. When such a model is available, other approaches would in-
volve a parametrization of the observed shapes [14], a modal analysis of the displacement
field [12], or a parametrization of a subset of deformations [3, 15]. In fact we believe that
our approach can always be used when some sparse geometric features provide reliable
landmarks, either as a preprocessing to provide an initial solution to the other approaches,
or as a post-processing to provide a final smoothing which preserves the matching of re-
liable landmarks.
Let Cp and CQ be two boundaries of the image sequence, the contour CQ is obtained
by a non rigid (or elastic) transformation of the contour Cp. The curves Cp and CQ are
parameterized by P(s) and Q(s') respectively.
The problem is to determine for each point P on Up a corresponding point Q on
CQ. For doing this, we must define a similarity measure which will compare locally the
neighborhoods of P and Q.
As explained in the introduction, we assume that points of high curvature correspond
to stable salient regions, and are therefore good landmarks to guide the matching of the
curves. Moreover, we can assume as a first order approximation, that the curvature itself
remains invariant in these regions. Therefore, we can introduce an energy measure in
these regions of the form:
where Kp and K q denote the curvatures and s, s t parameterise the curves Cp and CQ
respectively. In fact, as shown by [8, 13], this is proportional to the energy of deformation
of an isotropic elastic planar curve.
We also wish the displacement field to vary smoothly around the curve, in particular
to insure a correspondence for points lying between two salient regions. Consequently we
consider the following functional (similar to the one used by Hildreth to smooth a vector
flow field along a contour [11]) :
E = Ec,,,,~ + R E,,g~,a, (2)
measures the variation of the displacement vector P Q along the curve Cp, and the ]].1]
denotes the norm associated to the euclidean scalar product (., .) in the space lR 2.
The regularization parameter R(s) depends on the shape of the curve Cp. Typically,
R is inversely proportional to the curvature at P, to give a larger weight to Ecur,e in
salient regions and conversely to Erea.za~ to points inbetween. This is done continuously
without annihiling totally the weight of any of these two energies (see [4]) .
460
Given two curves Cp and C O parameterized by s G [0, 1] and s' E [0, c~] (where a is the
length of the curve CO) , we have to determine a function f : [0, 1] ~ [0, o~]; s ---* s p
satisfying f(0) = 0 and f(1) = a (3)
and f -- ArgMin(E(f)) (4)
where
E(.f) = /cp (Ko(.f(s)) _ Kp(s))2 ds W R /cp O(Q(.f(s)) - 2ds (5)
The condition (3) means that the displacement vector is known for one point of the curve.
In the model above defined we assumed that:
- the boundaries have already been extracted,
- the curvatures K are known on the pair of contours (see [9]).
These necessary data are obtained by preprocessing the image sequence (see for more
details [4]).
The characterization of a function f satisfying f = ArgMin(E(f)) and the condi-
tion (3) is performed by a variational method. This method characterizes a local mini-
mum f of the functional E(f) as the solution of the Euler-Lagrange equation V E ( f ) : 0,
leading to the solution of the partial differential equation:
/ " IIQ'(.t')H2 + K~ (Np, Q'(.f)) + -~ [Kj, - KQ(.f)] K~Q(f) = 0
+ Boundary conditions (i.e. condition 3). (6)
where Q is a parametrization of the curve CQ, Qr(/) the tangent vector of CO, K~O the
derivative of the curvature of the curve C 0 and Np is the normal vector to the curve
Cp.
The term fcp (Ko (f(s)) - Kp (s)) 2 ds measures the difference between the curvature
of the two curves. This induces a non convexity of the functional E. Consequently, solving
the partial differential equation (6) will give us a local minimum of E. To overcome this
problem we will assume that we have an initial estimation f0 which is a good approxi-
mation of the real solution (the definition of the initial estimation fo will be explained
later). This initial estimation defines a starting point for the search of a local minimum of
the functional E. To take into account this initial estimation we consider the associated
evolution equation:
{ Of-~s + f " ( s ) N Q ' ( f ( s ) ) l l 2 + g p ( s ) ( g p ( s ) , Q'(f(s)))+ 1 [gp(s)-gQ(f(s)) ] gtQ(f(s))=O
f(0, s) = fo(s) Initial estimation.
(7)
Equation (7) can also be seen as a gradient descent algorithm toward a minimum of
the energy E, it is solved by a finite element method and leads to the solution of a sparse
linear system (see for more details [4]).
3.1 D e t e r m i n i n g t h e I n i t i a l E s t i m a t i o n fo
The definition of the initial estimation f0 has an effect upon the convergence of the algo-
rithm. Consequently a good estimation of the solution f will lead to a fast convergence.
The definition of f0 is based on the work of Duncan e~ al [8]. The method is as follows:
Let s~ E [0, 1], i -- 1 . . . n be a subdivision of the interval [0, 1]. For every point
Pi = (X(si),Y(si)) of the curve Cp we search for a point Qi = (X(s~),Y(s~)) on the
curve CO, and the function f0 is then defined by f0(si) -- s~.
461
For doing so we have to define a pair of points P0, Q0 which correspond to each other.
But, first of all, let us describe the search method. In the following, we identify a point
and its arc length (i.e. the point s~ denotes the point P~ of the curve Cp such that
P(s~) = P~, where P is the parametrisation of the curve Up).
For each point s~ of Cp we associate a set of candidates S~ on the curve CQ. The
set Si defines the search area. This set is defined by the point s t which is the nearest
distance point to s~ belonging to the curve CQ, along with (N, . . . . ~ - 1)/2 points of the
curve CQ on each side of s t (where N, . . . . n is a given integer defining the length of the
search area).
Among these candidates, we choose the point which minimizes the deformation en-
ergy (1).
In some situations this method fails, and the obtained estimation /0 is meaningless,
leading to a bad solution. Figure 1 shows an example where the method described in [8]
fails. This is due to the bad computation of the search area S~.
Fig. 1. This example shows the problem that can occur in the computation of the initial estimate
based only on the search in a given area. The initial estimation of the displacement field and
the obtained solution.
To compute more accurately this set, we have added a criterion based on the arc
length. Consequently, the set defining the search area Si is defined by the point s t which
is the nearest distance point to si belonging to the curve CQ such that s~ ~ si/cl, along
with (N, earc~ - 1)/2 points of the curve CQ on each side of s~. Figure 2 illustrates the
use of this new definition of the set Si for the same curves given in Fig. 1. This example
shows the ability to handle more general situations with this new definition of the search
area ~i.
Fig. 2. In the same case of the previous example, the computation of the initial estimate based
on the local search and the curvilinear abscissa, gives a good estimation fo, which leads to an
accurate computation of the displacement function.
462
As noted above, the search area S~ can be defined only if we have already chosen a
point P0 and its corresponding point Q0. The most salient features in a temporal sequence
undergo small deformations at each time step, thus a good method for choosing the point
P0 is to take the most distinctive point so that the search for the corresponding point
becomes a trivial task. Consequently the point P0 is chosen among the points of Cp with
maximal curvature. In many cases this method provides a unique point Po. Once we have
chosen the point P0, the point Q0 is found by the local search described above.
4 Experimental Results
The method was tested on a set of synthetic and real image sequences. The results are
given by specifying at each discretization point P~ i -- 1 . . . N of the curve Cp the
displacement vector u i -- P i Q i " At each point P~ the arrow represents the displacement
vector u i.
The first experiments were made on synthetic data. In Fig. 3, the curve CQ (a square)
is obtained by a similarity transformation (translation, rotation and scaling) of the curve
Cp (a rectangle). The obtained displacement field and a plot of the function f are
given. We can note that the algorithm computes accurately the displacements of the
9 f(u)
Fig. 3. The rectangle (in grey) is deformed by a similarity (translation, rotation and scaling)
to obtain the black square. In this figure we represent the initial estimation of the displacement
vector of the curves, the obtained displacement field and the plotting of the solution f.
four corners. This result was expected since the curves Cp and CQ have salient features
which help the algorithm to compute accurately the displacement vector u i. Figure 4
give an example of the tracking of each point on an ellipse deformed by a similarity.
In this case, the points of high curvature are matched together although the curvature
varies smoothly.
As described in section (3.1) the computation of the initial estimation is crucial. In
the following experimentation we have tried to define the maximal error that can be done
on the estimation of f0 without disturbing the final result. In Fig. 5 we have added a
gaussian noise (~ = 0.05) to a solution f obtained by solving (7). This noisy function
was taken as an initial estimation for Eq. 7. After a few iterations the solution f is
recovered (5).
It appears that if If - f01 _~ 4h (where h is the space discretization step), starting with
f0 the iterative scheme ? will converge toward the solution f. The inequality If - fol ~_
4h means that for each point P on the curve Cp the corresponding point Q can be
determined with an error of 4 points over the grid of the curve Q.
463
rcl~
Fig. 4. Another synthetic example, in this case the curvature along the curves Cp and CQ varies
smoothly. This often produces as a consequence in the computation of the initial estimation f0
that several points of the curve Cp (in grey) match the same point of CQ (in black). We remark
that, for the optimal solution obtained by the algorithm, each point of the black curve matches
a single point of the grey curve, and that, maximum curvature points are matched together.
EI~)
r
Fig. 5. In this example we have corrupted an obtained solution with a gaussian noise (~, = 0.05)
and considered this corrupted solution as an initial estimate fo. The initial displacement field,
the initial estimate fo and the obtained solution are shown in this figure.
T h e tracking of the moving boundaries of the valve of the left ventricle on an ul-
t r a s o u n d image helps to diagnose some heart diseases. The segmentation of the moving
boundaries over the whole sequence was done by the snake model [6, 2, 10]. In Fig 6 a
global tracking of a part of the image sequence is showed 2. This set of curves are pro-
cessed (as described in [4]) to obtain the curvatures and the n o r m a l vector of the curves.
The Fig. 7 shows a t e m p o r a l tracking of some points of the valve in this image sequence.
The results are presented by pairs of successive contours. One can visualize t h a t the
results meet perfectly the objectives of preserving the matching of high curvature points
while insuring a s m o o t h displacement field.
Fig. 6. Temporal tracking of the mitral valve, obtained by the snake model [10], for images 1
to 6.
5 3-D Generalization
In this section we give a 3-D generalization of the algorithm described in the previous
sections. In 3-D imaging we must track points on located surfaces, since the objects
boundaries are surfaces (as in [1]). In [16] the authors have shown on a set of experimen-
tal data, that the extrema of the larger principal curvature often correspond to significant
intrinsic (i.e. invariant by the group of rigid transformation) features which might char-
acterize the surface structure, even in the presence of small anatomic deformations.
Let Sp and SQ be two surfaces parameterized by P(s, r) and Q(s', r'), and let ~p
denote the larger value of the principal curvature of the surface Sp at point P.
Thus the matching of two surfaces, leads to the following problem:
find a function
f : II:L2 ~IPO; (s,r)~(s',r')
which minimizes the functional:
where H'II denotes the euclidean norm in IRa. Its resolution by a finite element method
can be done as in [5], and the results should be compared to those obtained by [1]. This
generalization has not been implemented yet.
465
Fig. 7. Applying the point - tracking algorithm to the successive pairs of contours of Fig. 6
(from left to right and top to bottom).
6 Conclusion
References
1. A. Amini, R. Owen, L. Staib, P. Anandan, and J. Duncan. non-rigid motion models for
tracking the left ventr~cular wall. Lecture notes in computer science: Information processing
in medical images. 1991. Springer-Verlag.
2. Nicholas Ayache, Isaac Cohen, and Isabelle Herlln. Medical image tracking. In Active
Vision, Andrew Blake and Alan Yuille, chapter 20. MIT Press, 1992. In press.
3. Fred L. Bookstein. Principal warps: Thin-plate spllnes and the decomposition of deforma-
tions. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMLll(6):567-
585, June 1989.
4. Isaac Cohen, Nicholas Ayache, and Patrick Sulger. Tracking points on deformable objects
using curvature information. Technical Report 1595, INRIA, March 1992.
5. Isaac Cohen, Laurent D. Cohen, and Nicholas Ayache. Using deformable surfaces to seg-
ment 3-D images and infer differential structures. Computer Vision, Graphics, and Image
Processing: Image Understanding, 1992. In press.
6. Laurent D. Cohen and Isaac Cohen. A finite element method applied to new active contour
models and 3-D reconstruction from cross sections. In Proc. Third International Conference
on Computer Vision, pages 587-591. IEEE Computer Society Conference, December 1990.
Osaka, Japan.
7. Court B. Cutting. Applications of computer graphics to the evaluation and treatment of
major craniofacial malformation. In Jayaram K.Udupa and Gabor T. Herman, editors, 8-D
Imaging in Medicine. CRC Press, 1989.
8. J.S. Duncan, R.L. Owen, L.H. Staib, and P. Anandan. Measurement of non-rigid motion
using contour shape descriptors. In Proc. Computer Vision and Pattern Recognition, pages
318-324. IEEE Computer Society Conference, June 1991. Lahaina, Maul, Hawaii.
9. A. Gu~ziec and N. Ayache. Smoothing and matching of 3D-space curves. In Proceedings
of the Second European Conference on Computer Vision I99~, Santa Margherita Ligure,
Italy, May 1992.
10. I.L. Her]in and N. Ayache. Features extraction and analysis methods for sequences of
ultrasound images. In Proceedings of the Second European Conference on Computer Vision
199~, Santa Margherita Ligure, Italy, May 1992.
11. Ellen Catherine Hildreth. The Measurement of VisualMotion. The MIT Press, Cambridge,
Massachusetts, 1984.
12. Bradley Horowitz and Alex Pentland. Recovery of non-rigid motion and structures. In
Proc. Computer Vision and Pattern Recognition,pages 325-330. IEEE Computer Society
Conference, June 1991. Lahaina, Maul, Hawaii.
13. L.D. Landau and E.M. Lifshitz. Theory of elasticity.Pergamon Press, Oxford, 1986.
14. Dimitri Metaxas and Demetri Terzopoulos. Constrained deformable superquadrics and
nonrigid motion tracking. In Proc. Computer Vision and Pattern Recognition, pages 337-
343. IEEE Computer Society Conference, June 1991. Lahaina, Maul, Hawaii.
15. Sanjoy K. Mishra, Dmitry B. Goldgof, and Thomas S. Huang. Motion analysis and epi-
cardial deformation estimation from angiography data. In Proc. Computer Vision and
Pattern Recognition, pages 331-336. IEEE Computer Society Conference, June 1991. La-
haina, Maul, Hawaii.
16. O. Monga, N. Ayache, and P. Sander. From voxel to curvature. In Proc. Computer Vision
and Pattern Recognition, pages 644-649. IEEE Computer Society Conference, June 1991.
Lahalna, Maul, Hawaii.
A n E g o m o t i o n A l g o r i t h m Based on the Tracking of
Arbitrary Curves
1 Introduction
We are interested in the analysis of non polyhedral scenes using a camera in motion. We
want in particular to determine the movement of the camera, relatively to the object
which is observed, by using visual cues only. This problem is known as the egomotion
problem.
The case of polyhedral objects is already well understood, and one knows how to
determine the motion parameters of the camera by tracking points [FLT87, WHA89,
TF87, LH81] or straight lines [LH86, FLT87] throughout a sequence of images. Dealing
with non polyhedral objects means that contours extracted from the images are no longer
necessarily straight. The problem is therefore expressed as the estimation of the motion
parameters of the camera using arbitrary curves.
O. Faugeras [Fau90] pioneered the field by working on a more general problem: de-
termine the movement and the deformation of a curve the arclength of which is constant
(the curve is perfectly flexible but not extensible). He is able to conclude that this es-
timation is impossible, but that a solution exists in the restricted case of rigid curves,
when the movement is reduced to a rigid movement. The approach described here derive
a constraint on the motion that can be set on each point of the tracked curve and provide
therefore a redundant set of equations which allows to extract motion and acceleration.
Equations are of the 5th degree in the unknown.
Faugeras has recently derived the same type of conclusion [Fau91] : for rigid curves
the full real motion field has not to be computed and this leads to 5th degree equations.
This paper is therefore just a simpler way to reach this point. However we show here a
results hidden by the mathematics: the parameterization of the spatiotemporal surface
has to be close to the epipolar parameterization in order to get accurate results. Such an
* This work was partially supported by the Esprit project First and the French project Orasis
within the Gdr-Prc "communication Homme-Ma~hine'.
468
observation was already made by Blake and Cipolla [BC90] for the probleme of surface
reconstruction from motion.
This paper first discusses what contours are and it introduces some notations and
concepts like the spatiotemporal surface. Next section then provides the basic equation
which is then mathematically transformed into an equation in motion parameters and the
algorithm for computing the motion is derived. Section 5 discusses the results obtained
on both synthetic and real data. It highligths the parameterization issue and the quality
of the motion estimation.
2 Notations
2.1 C o n t o u r classification
Non polyhedral scenes present multiple types of curves to the viewer. Contours on the
image plane are the projection of particular curves on the surface of the observed objects.
These curves can be classified into categories the intrinsic properties of which are different,
requiring a specific treatment.
D i s c o n t i n u i t y c u r v e s : A discontinuity curve is a curve on the surface of an object
where the gradient of the surface is discontinuous. It is therefore a frontier between two
C 1 surfaces. The edges of a polyhedron belong to this category.
E x t r e m a l curves: An extremal curve is a curve on the surface of an object such
that the line of sight is tangent to the surface.
S p a t i o t e m p o r a l surface When the camera moves relatively to the object, contours
that are perceived move on the image plane. These moving contours, stacked on top of
each other, describe a surface improperly called a spatiotemporal surface; spatiospatial
would be a more appropriate name since the displacement of the contours is due to the
displacement of the camera and not to time. The spatiotemporal surface represents the
integration of all observations of the object. We will prove that this surface is sufficient
to determine the motion parameters of the camera relatively to the object, under certain
conditions.
2.2 N o t a t i o n s o f t h e p r o b l e m
t 5t/
Case o f a e x t r e m a l c u r v e :
r (s,t ol
-~ s~
r (s,to)
spatlotemporalsurface (Y)
Fig. 2. Case of an extremal curve
When the camera moves relatively to the surface of an object (Y), a set of contours
is observed in the camera reference frame, creating the spatiotemporal surface. A curve
T ( s o , t ) at so = c o n s t a n t that passes by a point p = T ( s o , t o ) of that spatiotemporal
470
Case o f a d i s c o n t i n u i t y curve:
r (s,to~~so, t)
T (s,~
surface spatiotemporelle (Y)
Fig. 3. s=coustant curve for a discontinuity curve (Y)
In the case of a discontinuity curve, the general configuration is that of figure 3 where
the object (Y), a curve of discontinuity, is reduced to the arc corresponding to r(s, t0),
and where the curve r(s0,t) is a subset of this arc: for an arbitrary parameterization
of the spatiotemporal surface, the curve T(so,t) corresponds indeed to a curve r(so,t)
necessarily placed on that arc (and possibly locally degenerated into a single point). The
property enables us to state the fundamental property for egomotion:
The curves r(s, t0) and r(s0,t) both pass through point P, and correspond locally to
the same 3D arc. Their differentials in s and t at point r(s0,t0) are therefore parallel,
which is expressed by:
Or, ^o rt = 0 (1)
This constraint expressed in the frame {C} becomes an equation that only relates mea-
sures and motion parameters, which permits the computation of a solution for the motion
parameters. Notice that that there is a particular case when the parameterization r(s0, t)
leads locally to a constant function with respect with time. From one image to the other
the point P is in correspondence with itself and again equation (1) holds: ~ equals 0.
This case corresponds to the epipolar parameterization [BC90].
4 Mathematical analysis
The egomotion problem is related to kinematics and differential geometry. We introduce
a few results of kinematics before we actually solve the problem at hand. The constraint
(1) is indeed expressed in the frame {O} when it should be expressed in the frame {C},
where measures are known.
4.1 K i n e m a t i c s o f t h e solid
Given a vector U, function of t, mobile in {C}. cUt is the derivative of U in {C}, and
~ the derivative of U in {0}. c (out) is therefore the derivative of U in {0} expressed
in {C}.
The two key equations of the kinematics concern the derivation of U in {0} at first
and second order:
~ = ~ 4-0 g2c/o A ~ U (where A denotes the cross product)
~ = ~RcUtt + 2~ A~ 4- ~ A (~ A~ + ~ A ~
These equations expressed in {C} are simplified and the rotation matrix ~ disappears:
c (out) = cut +c s2o/o A c U (2)
(~ = CUrt + 2c~cio A cUt 4- %f-2c/oA (c~c/o ACT) + r A CT (3)
4.2 E g o m o t i o n e q u a t i o n s
The key equation for the egomotion problem is the constraint (1), which expresses the
fact that the observed curve is fixed in its reference frame {0}:
Ors A ~ rt ----0
Two independent scalar equations can be extracted for each point considered. It is pos-
sible to obtain a unique equivalent equation by stating that the norm of the vector is
zero.
The equation (1) is expressed in {O}; we have to express each term in {C} in order to
explicit the motion parameters, by using the results of the kinematics presented earlier.
One degenerate case exists, where ~176 = O. It corresponds to the case where the
camera motion is in the plane tangent to the surface. No information on the movement
can then be extracted for this particular point.
4.3 A l g o r i t h m i c solution
Equation (1) is a differential equation the unknowns of which correspond to the four
vectors c (o~)), c (o~), c~c/o and c~c/o.
It is possible to solve this problem by using finite differences. At each point, constraint
(1) is evaluated. The correct motion is one that minimizes the sum of all constraints at all
points available. Degenerate cases may arise where multiple solutions exist. Nevertheless,
by using a least squares approach instead of one with the minimal number of points, we
most likely get rid of this problem.
Solving the equations requires the initial values (c~c/o)0 and c (oT))0"
Equation (1) is in fact a ratio of polynomial functions and it is possible to transform
it into a 5th degree polynomial constraint in its 6 unknowns. This is why it is necessary
to know an good approximation of the solution in order to converge toward the right
values.
This result is to be compared with that of O. Faugeras [Fau91] where the constraint
is of similar complexity, hut intuitively hard to comprehend.
5 Experimental results
The first part of the experiments is using synthetic data in order to validate the theoretical
approach as well as the implementation. The second part deals with real data. The
algorithm's sensitivity to noise will be analyzed.
472
5.1 S y n t h e t i c d a t a
The contour used for testing is an arc of ellipse moving relatively to the camera. A purely
translational movement will first be used, then a rotation will be added. The main axes
of the ellipse have lengths 2 and 2.6 meters, and the plane containing the ellipse lies at
a distance of 8 meters from the camera.
Finite differences require initial values of the motion parameters to be known at time
step t = 0. Exact values of the movement at t = 0 are used and the motion parameters
for the translation and the rotation are then computed at time step 1.
R e m a r k : For a linear velocity of 0.4 meter per time unit, the displacement of the
contour on the image plane is of the order of 100 pixels, which is very far from the original
hypothesis of infinitesimal displacement.
For the same movement but an arbitrary correspondence, a direction orthogonal to
the contour for instance, results degrade quicker than with the epipolar correspondence.
If the movement of the camera is approximatively known, it is suggested to use it. If the
planar movement of the contour in the image plane can be estimated, it is wise to use it
to establish the correspondence.
For the same motion but using an arbitrary correspondence (orthogonal to the con-
tour), degraded results are obtained.
Noise on the contour obviously has very little influence on the computed motion
parameters.
Figure 4 presents 4 images of the sequence of the pear. The thicker contours are the ones
used to estimate the motion parameters.
Table 4 shows the egomotion results on the pear sequence. A priori values of the
motion parameters (obtained with the robot sensors) are plotted against the computed
values.
474
Iho~/g~larl~l~tllqs=lp~ ~][~ t l ~ l K ~ l ~ | m t l t = u ~ l p o i r o - t = l p o l r e - t t t ~ A . l q
6 Conclusions
A new egomotion technique was presented which can be applied when no point or straight
line correspondences are available. It is generalizing egomotion to the case of arbitrary
shaped contours which is especially valuable in the case of non polyhedral objects. The
computation uses a very simple finite differences scheme and quickly provides a good esti-
mation of the motion parameters. This technique is robust against noise on the contours
since it is using a least squares approach. The experiments we conducted on synthetic
as well as real data show the validity of that approach. Finite differences do not allow
though to perform the computation on a long sequence since the errors are accumulating.
475
It has also been experimented that close to epipolar p a r a m e t e r i z a t i o n provides better ac-
curacy. So rough motion estimation should be used to compute such a p a r a m e t e r i z a t i o n .
It has to be noted t h a t all discontinuity contours in the image can be used, by building
as m a n y s p a t i o t e m p o r a l surfaces as contours. Robustness is thus increased. If multiple
rigid objects are moving independently in the scene, it is i m p o r t a n t to compute each
movement separately. It is then necessary to segment the image into regions where the
associated 3-D movement is homogeneous. It would be interesting at this p o i n t to s t u d y
more sophisticated techniques such as finite elements for instance to o b t a i n more precise
results, now t h a t we proved feasibility with this simple finite differences scheme.
References
[BC90] A. Blake and R. Cipolla. Robust estimation of surface curvature from deformation of
apparent contours. In O. Faugeras, editor, Proceedings of the 1st European Conference
on Computer Vision, Antibes, France, pages 465-474. Springer Verlag, April 1990.
[Cra86] John J. Craig. Introduction to robotics. Mechanics and control. Addison-Wesley,
1986.
[Fau90] O. Faugeras. On the motion of 3D curves and its relationship to optical flow. Rapport
de Recherche 1183, , Sophia-Antipolis, March 1990.
[Fau91] O. Faugeras. On the motion field of curves. In Proceeding of the workshop on
Applications of Invariants in Computer Vision, Reykjavik, Iceland, March 1991.
[FLT87] O.D. Faugeras, F. Lustman, and G. Toscani. Motion and structure from point and
line matches. In Proceedings of the 1st International Conference on Computer Vision,
London, England, June 1987.
[LH81] H.C. Longuet-Higgins. A computer program for reconstructing a scene from two
projections. In Nature, volume 293, pages 133-135. XX, September 1981.
[LH86] Y. Liu and T.S. Huang. Estimation of rigid body motion using straight line cor-
respondences, further results. Proceedings of the 8th International Conference on
Pattern Recognition, Paris, France, pages 306-307, October 1986.
[TF87] G. Toscani and O.D. Faugeras. Mouvement par reconstruction et reprojection. In
l l d m e Colloque sur ie Traitement du signal et des images (GRETSI), Nice, France,
pages 535-538, 1987.
[WHA89] J. Weng, T.S. Huang, and N. Ahuja. Motion and structure from two perspective
views: algorithms, error analysis and error estimation. IEEE Transactions on PAMI,
11(5):451-476, May 1989.
This article was processed using the IFEX macro package with ECCV92 style
Region-Based Tracking in an Image Sequence *
1 Introduction
Digitized time-ordered image sequences provide an actually rich support to analyze and
interpret temporal events in a scene. Obviously the interpretation of dynamic scenes has
to rely somehow on the analysis of displacements perceived in the image plane. During
the 80's, most of the works have focused on the two-frame problem, that is recovering the
structure and motion of the objects present in the scene either from the opticM flow field
derived between time t and time t -t- 1, or from the matching of distinguished features
(points, contour segments, ...) previously extracted from two successive images.
Both approaches usually suffer from different shortcomings, like intrinsic ambiguities,
and above all numerical instability in case of noisy data. It is obvious that performance
can be improved by considering a more distant time interval between the two considered
frames (by analogy with an appropriate stereo baseline). But matching problems become
then overwhelming. Therefore, an attractive solution is to take into account more than
two frames and to perform tracking over time using recursive temporal filtering [1].
Tracking thus represents one of the central issues in dynamic scene analysis.
First investigations were concerned with tracking of points, [2], and contour segments,
[3, 4]. However the use of vertices or edges lead to a sparse set of trajectories and can
make the procedure sensitive to occlusion. The interpretation process requires to group
these features into consistent entities. This task can be more easily achieved when work-
ing with a limited class of a priori known objects [5]. It appears that the ability of
directly tracking complete and coherent entities should enable to more efficiently solve
for occlusion problems, and also should make the further scene interpretation step easier.
This paper addresses this issue. Solving it requires to deal with a dense spatio-temporal
information. We have developed a new tracking method which takes into account regions
as features and relies on 2D motion models.
2.1 T h e M o t i o n B a s e d S e g m e n t a t i o n A l g o r i t h m
The algorithm is fully described in [7]. The motion-based segmentation method ensures
stable motion-based partitions owing to a statistical regularization approach. This ap-
proach does not require neither explicit 3D measurements, nor the estimation of optic flow
fields. It mainly relies on the spatio-temporal variations of the intensity function while
making use of 2D first-order motion models. It also manages to link those partitions in
time, but of course to a short-term extent.
When a moving object is occluded for a while by another object of the scene and
reappears, the motion-based segmentation process may not maintain the same label for
the corresponding region over time. The same problem arises when trajectories of objects
cross each other. Labels before occlusion may disappear and leave place to new labels
corresponding to reappearing regions after occlusion. Consequently, tracking regions over
long periods of time requires a filtering procedure to be steady. A truly trajectory rep-
resentation and determination is required. The segmentation process will provide only
instantaneous measurements. In order to work with regions, the concept of region must
be defined in some mathematical sense. We describe hereafter the region descriptor used
throughout this paper.
2.2 T h e R e g i o n D e s c r i p t o r
The region representation
We need a model to represent regions. The representation of a region is not intended
to capture the exact boundary. It should give a description of the shape and location
that supports the task of tracking even in presence of partial occlusion.
We choose to represent regions with some of its boundary points. The contour is
sampled in such a way that it preserves shape information of the silhouette. We must
select points that best capture the global shape of the region. This is achieved through
a polygonal approximation of the region. A good approximation should be "close" to
the original shape and have the minimum number of vertices. We use the approach
478
developed by Wall and Danielson in [8]. A criterion controls the closeness of the shape
and the polygon.
The region can be approximated accurately by this set of vertices. This representation
offers the property of being flexible enough to follow the deformations of the tracked sil-
houette. Furthermore this representation results in a compact description which decreases
the amount of data required to represent the boundary, and it yields easily tractable mod-
els to describe the dynamic evolution of the region.
Our region tracking algorithm requires the matching of the prediction and an ob-
servation. The matching is achieved more easily when dealing with convex hull. Among
the boundary points approximating the silhouette of the region, we retain only those
which are also the vertices of the convex hull of the considered set of points. It must
be pointed out that these polygonal approximations only play a role as "internal items"
in the tracking algorithm to ease the correspondence step between prediction and ob-
servation. It does not restrict the type of objects to be handled as shown in the results
reported further.
The region descriptor
This descriptor is intended to represent the silhouette of the tracked region, all along
the sequence. We represent the tracked region with the same number of points during
successive time intervals of variable size. At the beginning of the interval we determine
in the segmented image the number of points, n, necessary to represent the concerned
region. We maintain this number fixed as long as the distance, defined in 2.3, between
the predicted region and the observation extracted from the segmentation is not too
important. The moment the distance becomes too large, the region descriptor is reset to
an initial value equal to the observation. This announces the beginning of a new interval.
We can represent the region descriptor with a vector of dimension 2n. This vector is
the juxtaposition of the coordinates (z~, Yl) of the vertices of the polygonal approximation
of the region : [xl, Yl, z2, Y2,-.., z , , yn]T.
2.3 T h e M e a s u r e m e n t V e c t o r
Measurement definition
Consequently this approach does not require the usual matching of specific features
which is often a difficult issue. Indeed the measurement algorithm works on the region
taken as a whole.
(i)
Fig. 1. The measurement algorithm : (1) Observation obtained by the segmentation (grey re-
gion), and prediction (solid line) ; (2) Convex hull of the observation ; (3) Matching of polygons ;
(4) Effective measurement : vertices of the grey region.
Measurement algorithm
If we represent the convex hull of the silhouette obtained by the segmentation and the
prediction vector as two polygons, the problem of superimposing the observation and the
prediction reduces here to the problem of matching two convex polygons with possibly
differentnumber of vertices.
Matching is achieved by moving a polygon and finding the best translation and rota-
tion to superimpose it on the other one. W e did not include scalingin the transformation,
otherwise in the case of occlusion the minimization process will scale the prediction to
achieve a best matching with the occluded observation.A distance is defined on the space
of shapes, [9],and we seek the geometrical transformation that minimizes the distance
between the two polygons. If PI and P2 are two polygons, T the transform applied on
the polygon P2, we minimize f with respect to T:
A previous version of the region-tracking algorithm, where each vertex of the region could
evolve independently from the others, with constant acceleration, is proposed in [10]. The
measurement is generated by the algorithm described in Sect. 2.3. A Kalman filter gives
estimates of the position of each vertex. Though the model used to describe the evolution
of the region is not very accurate, we nevertheless have good results with the method.
We propose hereafter a more realistic model to describe the evolution of the region. More
details can be found in [10].
480
Our approach has some similarities with the one proposed in [11]. The authors con-
straint the target motion in the image plane to be a 2D affine transform. An overde-
termined system allows to compute the motion parameters. However, the region repre-
sentation and the segmentation step are quite different and less efficient. Besides their
approach does not take into account the problems of possible occlusion, or junction of tra-
jectories. We propose an approach with a complete model for the prediction and update
of the object geometry and kinematics.
We make use of two models : a geometric model and a motion model, (Fig. 2). The
geometric filter and the motion filter estimate shape, position and motion of the region
from the observations produced by the segmentation. The two filters interact : the esti-
mation of the motion parameters enables the prediction of the geometry of the region in
the next frame. The shape of the region obtained by the segmentation is compared with
the prediction. The parameters of the region geometry are updated. A new prediction of
the shape and location of the region in the next frame is then calculated.
When there is no occlusion the segmentation process assigns a same label over time
to a region ; thus the correspondence between prediction labels and observation labels is
easy. If trajectories of regions cross each other, new labels corresponding to reappearing
regions after occlusion will be created while labels before occlusion will disappear. In this
case more complex methods must be derived to estimate the complete trajectories of the
objects.
I-------I r" . . . . -i
L ~me , i framel
i k , i k+l '
t- _ _ _ J L . . . . J
motion ~ ~ region
parameters[ Motion-based ~ _ ~ p e
I I Segmentati~ I
ll" . . . . . . . . |. . . . . . "l
moNel I: I m~ cl
- . . . . . . . . . . . -t. . . . . . . . . . . . . . . .J
We assume that each region R, in the image at time t + 1 is the result of an affine
transformation of the region R, in the image at time t. Hence every point (z(t), y(t)) E R
481
(y)(t+a,=~(t)(y)(t)+b(t) (2)
The affine transform has already been used to model small transformation between two
images, [11]. The matrix ~(t) and the vector h(t) can be derived from the parameters of
the affine model of the velocity field, calculated in the segmentation algorithm, for each
region moving in the image. Let M(t) and u(t) be the parameters of the affine model of
the velocity within the region R. We have :
Even if 2nd order terms generally result from the projection in the image of a rigid
motion, they are sufficiently small to be neglected in such a context of tracking, which
does not involve accurate reconstruction of 3D motion from 2D motion. Affine models
of the velocity field have already been proposed in [12] and [13]. The following relations
apply :
9 (t) = X2+ M(t) and ~(t) = _b(t) (3)
For the n vertices (zl, yl),..., (zn, Yn) of the region descriptor we obtain the following
system model :
where/2 is the 2 x 2 identity matrix. ~(t) and h(t) have been defined above in (3). (i =
[~, ff~]T is a two dimensional, zero mean Gaussian noise vector. We choose a simplified
model of the noise eovariance matrix. We will assume that :
r =
where 12. is the 2n 2n identity matrix. This assumption enables us to break the filter
of dimension 2n into n filters of dimension 2.
The matrix ~(t) and the vector h(t) accounts for the displacements of all the points
within the region, between t and t + 1. Therefore the equation captures the global de-
formation of the region. Even though each vertex is tracked independently, the system
model provides a "region-level" representation of the evolution of the points.
For each tracked vertex the measurement is given by the position of the vertex in the
segmented image. The measurement process generates the measurement as explained in
Sect. 2.3.
The following system describes the dynamic evolution of each vertex (zl, yi) of the
region descriptor of the tracked region. Let _s(t) = [zl, yi] T be the state vector, and rn(t)
the measurement vector which contains the coordinates of the measured vertex,
_s(t + 1) = ~(t)s_(t) + h(t) + i(t)
re(t) _s(t) + .(t) (4)
if(t) and ~(t) are two sequences of zero-mean Gaussian white noise, b(t) is interpreted
as a deterministic input. ~(t) is the matrix of the affine transform. We assume that the
482
above linear dynamic system is sufficiently accurate to model the motion of the region
in the image. We want to estimate the vector _s(t) from the measurement ra(t). The
Kalman filter [14] provides the optimal linear estimate of the unknown state vector from
the measurements, in the sense that it minimizes the mean square estimation error and
by choosing the optimal weight matrix gives a minimum unbiased variance estimate. We
use a standard Kalman filter to generate recursive estimates _~(t).
The first measurement is taken as the initial value of the estimate, Hence we have
_~(0) = m(0). The covariance matrix of the initial estimate is set to a diagonal matrix
with very large coefficients. This expresses our lack of confidence in this first value.
3.2 T h e K i n e m a t i c F i l t e r
The attributes of the kinematic model are the six parameters of the 1st order approxima-
tion of the velocity field. These variables are determined with a least-squares regression
method. Therefore these instantaneous measurements are corrupted by noise and we
need a recursive estimator to convert observation data into accurate estimates. We use
a Kalman filter to perform this task. We work with the equivalent decomposition :
1 ( div + hypl hyp2 - rot
This formulation has the advantage that the variables div, rot, hypl and hyp2 correspond
to four particular vector fields that can be easily interpretated, [7].
The measurement is given by the least square estimates of the six variables. We have
observed on many sequences that the correlation coefficients between the six estimates are
negligible. For this reason, we have decided to decouple the six variables. The advantage
is that we work with six separate filters.
In the absencei in the general case, of any explicit simple analytical function describing
the evolution of the variables, we use a Taylor-series expansion of each function about
t. After having experimented with different approximations, it appears that using the
first three terms performs a good tradeoff between the complexity of the filter and the
accuracy of the estimates. Let _0(t) = [~(t), &(t), ~(t)] T be the state vector, where ~ is
any of the six variables : a, b, div, rot, hypl and hyp2. z(t) is the measurement variable.
We derive the following linear dynamic system :
~(t) and 77(t) are two sequences of zero-mean Gaussian white noises of covariance
matrix Q, and variance cr~ respectively.
3.3 R e s u l t s
We present in Fig. 3 the results of an experiment done on a sequence of real images. The
polygons representing the tracked regions are superimposed onto the original pictures
at time tl, tg, and t12. The corresponding segmented pictures at the same instants are
presented on the right. The scene takes place at a crossroad. A white van is comming
from the left of the picture and going to the right (Fig. 3a). A black car is driving behind
the van so closely that the segmentation is enable to split the two objects (Fig. 3d). A
white car is comming from the opposite side and going left. The algorithm accurately
483
tracks the white car, even at the end of the sequence where the car almost disappears
behind the van (Fig. 3e and f). Since the segmentation process delivers a single global
region for the van and the black car (Fig. 3d), the filter follows this global region. Thus
the tracked region does not correspond exactly to the boundary of the van. This example
illustrates the good performanee of the region-based tracking in the presence of occlusion.
An improved version of the method, where the kinematics parameters are estimated using
a multiresolution approach is being tested. More experiments are presented in [10].
4 Conclusion
This paper has explored an original approach to the issue of tracking objects in a se-
quence of monocular images. We have presented a new region-based tracking method
which delivers dense trajectory maps. It allows to directly handle entities at an "object-
level". It exploits the output of a motion-based segmentation. This algorithm relies on
two interacting filters : a geometric filter which predicts and updates the region position
and shape, and a motion filter which gives a recursive estimation of the motion param-
eters of the region. Experiments have been carried out on real images to validate the
performance of the method. The promising results obtained indicate the strength of the
"region approach" to the problem of tracking objects in sequences of images.
References
1. T.J. Broida, R. Chellappa. Estimating the kinematics and structure of a rigid object from
a sequence of monoculax images. IEEE Trans. PAMI, Vol.13, No.6:pp 497-513, June. 1991.
2. I. K. Sethi and R. J~a. Finding Trajectories of Feature Points in a Monocular Image
Sequence. IEEE Trans. PAMI, Vol. PAMI-9, No l:pp 56-73, January 1987.
3. J.L. Crowley , P. Stelmaszyk, C. Discours. Measuring Image Flow by Tracking Edge-Lines.
Proc. ~nd Int. Conf. Computer Vision, Tarpon Springs, Florida, pp 658-664, Dec. 1988.
4. R. Deriche, O. Faugeras. Tracking Line Segments. Proc. Ist European Conf. on Computer
Vision, Antibes, pp 259-268, April 1990.
5. J. Schick, E.D. Dickmanns. Simultaneous estimation of 3d shape and motion of objects by
computer vision. Proceedings of the IEEE Workshop on Visual Motion, Princeton New-
Jersey, pp 256-261, October 1991.
6. G. L. Gordon. On the tracking of featureless objects with occlusion. Proc. Workshop on
Visual Motion, Irving California, pp 13-20, March 1989.
7. E. Francois, P. Bouthemy. Multiframe-based identification of mobile components of a scene
with a moving camera. Proe. CVPR, Hawaii, pp 166-172, June 1991.
8. Karin Wall and Per-Erik Danielsson. A fast sequential method for polygonal approximation
od digitized curves. Computer Vision, Graphics and linage Processing, 28:pp 220-227, 1984.
9. P. Cox, H. Maitre, M. Minoux, C. Ribeiro. Optimal Matching of Convex Polygons. Pattern
Recognition Letters, Vol 9 No 5:pp 327-334, June 1989.
10. F. Meyer, P. Bouthemy. Region-based tracking in an image sequence. Research Report in
preparation, IRISA/INRIA Rennes, 1992.
11. R.J. Schalkoff, E.S. McVey. A model and tracking algorithm for a class of video targets.
1EEE Trans. PAMI, VoI.PAMI-4, No.l:pp 2-10, Jan. 1982.
12. P.J. Butt, J. R. Bergen, R. Hingorani, R. Kolczynski, W.A. Lee, A. Leung, J. Lubin, H.
Shvaytser. Object tracking with a moving camera. IEEE Workshop on Visual Motion, pp
2-12, March 1989.
13. G. Adiv. Determining three-dimensional motion and structure from optical flow generated
by several moving objects. 1EEE Trans. PAM1, Vol 7:pp 384-401, July 1985.
14. Arthur Gelb. Applied Optimal Estimation. MIT Press, 1974.
454
Fig. 3. Left : original images at time tl, t~, tl~ with tracked regions. Right : segmented images
at the same instants
Combining Intensity and Motion for Incremental
Segmentation and Tracking Over Long Image
Sequences*
Michael J. Black
1 Introduction
Our goal is to efficiently and dynamically build useful and perspicuous descriptions of
the visible world over a sequence of images. In the case of a moving observer or a dy-
namic environment this description must be computed from a constantly changing retinal
image. Recent work in Markov random field models [7], recovering discontinuities [2], seg-
mentation [6], motion estimation [1], motion segmentation [3, 5, S, 10], and incremental
algorithms [1, 9] makes it possible to begin building such a structural description of the
scene over time by compensating for and exploiting motion information.
As an initial step towards the goal, this paper proposes a method for incrementally
segmenting images over time using both intensity and motion information. The result is
a robust and dynamic segmentation of the scene over a sequence of images. The approach
has a number of benefits. First, discontinuities are extracted and tracked simultaneously.
Second, a segmentation is always available and it improves over time. Finally, by com-
bining motion and intensity, the structural properties of discontinuities can be recovered;
that is, discontinuities can be classified as surface markings or actual surface boundaries.
By jointly modeling intensity and motion we extract those regions which correspond
to perceptually and physically significant properties of a scene. The approach we take
is to formulate a simple model of image regions using local constraints on intensity
and motion. These regions correspond to the location of possible surface patches in the
image plane. The formulation of the constraints accounts for surface patch boundaries as
discontinuities in intensity and motion. The segmentation problem is then modeled as a
Markov random field with line processes.
* This work was supported in part by a grants from the National Aeronautics and Space Ad-
ministration (NGT-50749 and NASA RTOP 506-47), by ONR Grant N00014-91-J-1577, and
by a grant from the Whitaker Foundation.
486
To model our assumptions about the intensity structure and motion in the scene we adopt
a Markov random field (MRF) approach [7]. We formalize the prior model in terms of
constraints, defined as energy functions over local neighborhoods in a grid. For an image
of size n x n pixels we define a grid of sites:
Associated with each site s is a random vector X(t) = [u, i, l] which represents the
horizontal and vertical image motion u = (u, v), the intensity i, and the discontinuity
estimates l at time t. A discrete state space As(t) defines the possible values that the
random vector can take on at time t.
To model surface patches we formulate three energy terms, Era, Ez, and EL: which
express our prior beliefs about the motion field, the intensity structure, and the organi-
zation of discontinuities respectively. The energy terms are combined into an objective
function which is to be minimized:
The terms u - , i - , and l - are predicted values given the history of the sequence, and are
used to express temporal continuity.
We convert the energy function, E, into a probability measure H by exploiting the
equivalence between Gibbs distributions [7, i0] and MRF's:
where Z is the normalizing constant, and where T(t) is a temperature constant at time
t. Minimizing the objective function is equivalent to finding the maximum of/7.
The constraints are summarized in figure 1 and described briefly below:
Intensity Model
Ez(I,i,i-,l,s)=WDzDz(I,i,s)+wTzTz(i,i-,s)+wszSz(i,l,s) (3)
Dz(I, i, s) = (I(s) - i(s)) a (4)
r~(i, i-, s) = (i(s) - i-(s)) ~ (5)
s~(i, z, s) = ~ t(s,.)(i(s) - i(.)) ~ (6)
nEG,
Boundary Model
Motion Model
E.~(I.,1,~+l,u,u-,l,s) =
wD~D.~(I,~,I.+a,u,s)+wT~T.~(U,U-,S)+ws~S~(u,I,s) (S)
D~(I~,In+l,u,s) = E ~D(In(it,jt)--In+l(it+u, j t + v ) ) (9)
rEG.
0 0 0 0 0 0
Fig. 2. Examples of local surface patch discontinuities; (sites: (o), discontinuities: (I, - ) ) .
o olo o o Io o o o o o o
o o o o olo o o Io o o I o
o olo o olo o o Io o o o
The objective function defined in the previous section will typically have many local
minima. Simulated annealing (in this case a Gibbs Sampler [7]) can be used to find the
minimum X(t) by sampling from the state space A according to the d i s t r i b u t i o n / / w i t h
logarithmicly decreasing temperatures.
As mentioned earlier, each site contains a random vector X(t) = [u, i,/] which rep-
resents the motion, intensity, and discontinuity estimates at time t. The discontinuity
component of this state space is taken to be binary, so that l E {0, 1}.
The intensity component i can take on any intensity value in the range [0,255]. For
efficiency, we can restrict i to take on only integer values in that range. We make the
further approximation that the value of i at site s is taken from the union of intervals of
intensity values about i(s), the neighbors i(t) of s, and the current data value I,(s). Small
intervals result in a smaller state space without any apparent degradation in performance.
The motion component u = (u, v) is defined over a continuous range of displacements
u and v. Continuous annealing techniques [1] allow accurate sub-pixel motion estimates
by making the state space for the flow component adapt to the local properties of the
function being minimized.
3.1 I n c r e m e n t a l M i n i m i z a t i o n
Imagen- 1 _ [ E(u,v)
Surface Increnaental
Image n ~1 IntemityModel - Stochastic
-[ Minimization
Motion Model
Predicted
Intensity
Boundary
Flow
........ 1 .... I=
i t . . . . . . . . . . . . . . . . . . . . . . . d
without many of the shortcomings. As opposed to minimizing the objective function for
a pair of frames, the ISM approach is designed to minimize an objective function which
is changing slowly over time. The assumption of a slowly changing objective function is
made possible by exploiting current motion estimates to compensate for the effects of
the motion on the objective function. With each new image, current estimates are propa-
gated by warping the grid of sites using the current optic flow estimate. The estimates are
then refined using traditional stochastic minimization techniques. Additionally, during
the warping process motion discontinuities are classified as occluding or disoccluding.
490
4 Experimental Results
9 P~+_+.~
9 f'~ ~ . ~"I I~,..~.
+'~"[.+ ~....:+ ..,'~I.,I""" ' ~ ~lJ
I
!"~
~ - - e "
. . +, , ~ , ~ ~ _
B
i-
~ : _ ~ :-C " -
Fig. 6. Incremental Feature Extraction. The images show the evolution (left to right, top
to bottom) of features over a ten image sequence.
492
f r ,~...-
..... , r . . / . . . .,--:.., - . / ,
'' .'.:-.[.. 7t " "
';i"~"f :. . . . ' : : . .i. i . i.: .-" ,
'" / , , ~ ~ "'~
"1 i
9 $ '
There are a number of issues to be addressed regarding the approach described. First, the
current implementation employs only simple first order models of intensity and motion.
To cope with textured surfaces more complicated image segmentation models will be
required.
A second issue which must be addressed is one shared by many minimization ap-
proaches; that is the parameter estimation problem. The construction of an objective
function with weights controlling the importance of the various terms is often based on
intuition or empirical studies. The problem becomes more pronounced as the complexity
493
of the model increases. Experiments with the current model indicate that it is relatively
insensitive to changes in the parameters.
6 Conclusion
References
1. Black, M.J., and Anandan, P., "Robust dynamic motion estimation over time,"
Proc. Comp. Vision and Pattern Recognition, CVPR-91, Maui, Hawaii, June 1991, pp. 296-
302.
2. Blake, A. and Zisserman, A., Visual Reconstruction, The MIT Press, Cambridge, Mas-
sachusetts, 1987.
3. Bouthemy, P. and Lalande, P., "Detection and tracking of moving objects based on a sta-
tistical regulaxization method in space and time," Proc. First European Conf. on Computer
Vision, ECCV-90, Antibes, France, April 1990, pp. 307-311.
4. Chou, P. B., and Brown, C. M., "The theory and practice of bayesian image labeling," lnt.
Journal of Computer Vision, Vol. 4, No. 3, 1990, pp. 185-210.
5. Francois, E. and Bouthemy, P., "Multiframe-based identification of mobile components of a
scene with a moving camera," Proc. Comp. Vision and Pattern Recognition, CVPR-91, Maui,
Hawaii, June 1991, pp. 166-172.
6. Geman, D., Geman, S., Graffigne, C., and Dong, P., "Boundary detection by constrained
optimization," 1EEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12,
No. 7, July 1990, pp. 609-628.
7. Geman, S. and Geman, D., "Stochastic relaxation, Gibbs distributions, and Bayesian
restoration of images," IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. PAMI-6, No. 6, November 1984.
8. Heitz, F. and Bouthemy, P., "Multimodal motion estimation and segmentation using Markov
random fields," Proc. IEEE Int. Conf. on Pattern Recognition, June, 1990, pp. 378-383.
9. Matthies, L., Szeliski, R., Kanade, T., "Kalman filter-based algorithms for estimating depth
from image sequences," Int. J. of Computer Vision, 3(3), Sept. 1989, pp. 209-236.
10. Murray, D. W. and Buxton, B. F., "Scene segmentation from visual motion using global op-
timization," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 2,
March 1987, pp. 220-228.
11. Navab, N., Deriche, R., and Faugeras, O. D., "Recovering 3D motion and structure from
stereo and 2D token tracking cooperation," Proc. Int. Conf. on Comp. Vision, ICCV-90, Os-
aka, Japan, Dec. 1990, pp. 513-516.
12. Thompson, W. B., "Combining motion and contrast for segmentation," IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. PAMI-2 1980, pp. 543-549.
This article was processed using the LATEXmacro package with ECCV92 style
Active Egomotion Estimation: A Qualitative
Approach
Abstract.
Passive navigation refers to the ability of an organism or a robot that
moves in its environment to determine its own motion precisely on the ba-
sis of some perceptual input, for the purposes of kinetic stabilization. The
problem has been treated, for the most part, as a general recovery from
dynamic imagery problem, and it has been formulated as the general 3-D
motion estimation (or structure from motion) module. Consequently, if a
robust solution to the passive navigation problem--as it has been formu-
lated in the recovery paradigm--is achieved, we will immediately be able to
solve many other important problems, as simple applications of the general
principle. However, despite numerous theoretical results, no technique has
found applications in systems that can perform well in the real world. In
this paper, we outline some of the reasons behind this and we develop a
robust solution to the passive navigation problem which is
- purposive, in the sense that it does not claim any generality. It just
solves the kinetic stabilization problem and cannot be used as it is for
other problems related to 3-D motion.
- qualitative, in the sense that the solution comes as the answer to a
series of simple yes/no questions and not as the result of complicated
numerical processing.
active, in the sense that the activity of the observer (in this case "sac-
-
1 Introduction
The problem of passive navigation (kinetic stabilization) has attracted a lot of attention
in the past ten years (Bruss and Horn, 1983; Longuet-Higgins, 1981; Longuet-I-Iiggins
and Prazdny, 1980; Ullman, 1979; Spetsakis and Aloimonos, 1988; Tsai and Huang, 1984)
because of the generality of a potential solution. The problem has been formulated as fol-
lows: Given a sequence of images taken by a monocular observer undergoing unrestricted
rigid motion in a stationary environment, to recover the 3-D motion of the observer. In
498
particular, if (U, V, W) and (w=,Wy,Wz) are the translation and rotation, respectively,
comprising the general rigid motion of the observer, the problem is to recover the follow-
ing five numbers: the direction of translation ( ~ , v ) and the rotation(wx,wy,Wz). (See
Fig. 1 for a pictorial description of the geometric model of the observer; O is the nodal
point of the eye).
--~W
r
The problem has thus been formulated as the general 3-D motion estimation problem
(kinetic depth or structure from motion) and its solution would solve a series of problems
(for example target pursuit, visual rendezvous, etc.) as simple applications. In this paper
we study the problem of passive navigation in the framework of purposive vision. Later
sections will clarify our point of view but our basic thesis is that we must seek a robust
solution for the problem under consideration only. If our proposed solution for the passive
navigation problem also solves the problem of determining the 3-D motion of an object
moving in the field of view of a static observer, then we have solved a more general
problem than the one we initially considered.
2 Previous Work
Previous research can be classified into two broad categories: methods based on optic
flow or correspondence and direct methods. 1
In the first category, under the assumption that optic flow or correspondence is known
with some uncertainty, finding the best solution results in a non-linear optimization
problem. One develops an error measure (usually a function of the input error) that is
minimized in some way. Treating the problem as one of statisticM estimation has given
rise lately to very sophisticated approaches. Although such research on general recovery
1 One can also differentiate a category of methods that use correspondence of macrofeatures
(contours, lines, sets of points, etc.) (Aloimonos, 1990b; Spetsakis and Aloimonos, 1990), but
we don't discuss them here, due to the lack of literature on the stability of such techniques.
499
is making tremendous progress, the existing general recovery results cannot yet survive
in the real world, because small amounts of error in the input can produce catastrophic
results in the output (Spetsakis and Aloimonos, 1988; Horn, 1988; Young and Chellappa,
1990; Adiv, 1985a, 1985b; Weng et al., 1987). Although it is true that ifa human operator
corresponded features in the successive image frames, 2 most of these algorithms would
give practical results, it is highly questionable that these algorithms could be used in a
real time navigational system, when an average of 1% input noise is enough to create an
error of 100% in the output, 3 and especially when the problem of computing optic flow
or displacements (correspondence) is ill-posed and any algorithm for computing them
must rely on assumptions about the world that might not always be valid. There is no
doubt that research on the topic will continue and will shed more light on the difficulties
associated with the general problem of 3-D motion computation.
In the second category, direct methods attempt to recover 3-D motion using as
input the spatiotemporal derivatives of the image intensity function, thus getting rid
of the correspondence problem. These techniques, pioneered in (Aloimonos and Brown,
1984) and developed much further by Horn and his associates (Horn and Weldon, 1984;
Negahdaripour, 1986), can be considered closer to our work, since the spatiotemporal
derivatives at a point define the normal flow at that point. The difference is, of course,
that thinking in terms of the normal flow provides much more geometric intuition about
the problem. In addition, existing direct methods attempt to solve the general problem,
while we are only interested in the kinetic stabilization problem, and we can also treat
the problem of extracting 3-D translation without knowing the rotation (see Section 6).
Most of the research on visual motion has been concentrated along the lines of general re-
covery, i.e. recovering from a sequence of images relative 3-D motion (passive navigation)
and structure. If this general recovery problem is solved, then a series of questions such
as: is there a moving object in the field of view of the observer?, is it getting closer to the
observer?, is it going to hit the observer?, how can the observer intercept it or avoid it?,
etc., become simple applications of the general recovery module. This point of view is
consistent with the ideas on the modular design of vision systems put forth by D. Mart
(1982). Although it is clear that a modular, general recovery approach to the analysis of
visual motion will uncover the general principles behind this perceptual ability (or mod-
ule), it is not at all clear that a good, working system capable of understanding visual
motion (i.e., capable of accomplishing visual tasks involving motion) must be designed
in a modular fashion. That is, it is not clear that any system handling visual motion
and capable of displaying intelligent behaviors is best designed by first going through
the stage of 3-D motion and structure computation. True, that would be convenient, as
vision would be studied in isolation and its results would be given to other modules, such
as motion planning or reasoning, to accomplish tasks such as target pursuit, obstacle
avoidance, and the like. Purposive vision, on the other hand, with its goal being to close
the gap between theory (general recovery) and application (actual visual systems), is
2 As in photogrammetry, for example, for solving the problem of relative orientation (Horn,
1988).
3 Since measurements are in focal length units, 1% error in displacements amounts to about
4-8 pixels for commercially available cameras.
500
a more general theory of vision involving multiple functions and multiple relationships
between subsystems of the final intelligent system. W i t h regard to visual motion, a pur-
posive design should allow any output that the visual system can be trained to associate
reliably with features derived from a changing image. In other words, we can attempt
different robust solutions to various motion related problems (such as passive navigation,
detecting obstacles, avoiding moving objects, etc.), even though all these problems are
simple applications of the general structure from motion module (Aloimonos, 1990a). 4
In this spirit, we study the perceptual task of kinetic stabilization.
4 Kinetic Stabilization
Consider a monocular observer as in Fig. 2. We assume that the observer moves only
forward (see Fig. 3). 5 It is assumed that the observer is equipped with inertial sensors
which provide the rotation(w,, w~, wz) of the observer at any time. As the observer moves
in its environment, normal flow fields are computed in real time. Since optic flow due
to rotation does not depend on depth but on image position (x, y), we know (and can
compute in real time) its value (u n, v n) at every image point along with the normal flow. 6
T h a t means that we know the optic flow due to translation (see Fig. 3a). In other words,
since we can derotate, we assume that the normal flow is due to translation only. In later
sections we analyze the case where rotation is present. W h e n the observer moves forward 7
in a static scene, it is approaching anything in the scene and the flow is expanding. From
Fig. 35, it is clear that the focus of expansion (FOE) = ( w
v~, v ) (when the gradient space
of directions is superimposed on the image space) lies in the half plane defined by line
(e); thus every point in that half space receives one vote for being the FOE. Clearly, at
every point we obtain a constraint-line which constrains the F O E to lie in a half plane.
If the FOE lies on the image plane (i.e. the direction of translation is anywhere in the
solid sector O A B C D (Fig. 4) then the FOE is constrained to lie in an area on the image
plane and thus it can be localized (see Fig. 5). When the FOE does not lie inside the
image, a closed area cannot be found, but the votes collected by the half planes indicate
its general direction. By making a "saccade", i.e. a rotation of the camera, the observer
can then bring the FOE inside the image and localize it (Fig. 6 explains the process).
5 The A l g o r i t h m
We assume that the computation of the normal flow, the voting and the localization
of the area containing the highest number of votes can be done in real time. In this
4 It is becoming quite clear that biological organisms are designed in a purposive manner, i.e.
their visual systems do not create a central database containing the recovery of the scene and
its properties which is then used as input to other cognitive processes. This doesn't mean that
a purposive design of a robot system (a design around behaviors) is better than a modular
one; biological systems might be suboptimal. Still, a purposive analysis of vision makes sense
because we have found that various behaviors can be robustly obtained (Aloimonos, 1990).
s In the case of backward movement the situation is symmetric (maximum - minimum) and
handled similarly.
s If computation of normal flow at some points is unreliable, we just don't compute normal flow
there.
7 In the sense of Fig. 2; we assume that the observer usually looks towards where it is moving.
501
==================================
::::::::::::::::::::::::: ::: :~:~:~!:!!ii :::~
motors
,~ ..... ~;iliiiiiiii!i!!iliiiiii~!iiiii!i!!iii;
.,~:~ ~i~i::i::ii~i::::i::::!::::iliiiiii:~: .....~::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
~i::::::::::::::
~ cable from brain to motors
.~:::::: ..: ::::::::: .~::.: ........ :: ............... >
, ::::: :::::: :::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::::
: ::....:::::.....:.:.:,
::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::
' cable from cyc to brain
F i g . 2.
Ij
,~ : normal flow
o (, OA-BC
Co)
\
u, : rotationalflow
(a)
(c)
F i g . 3. Given the normal flow u ~ and the rotational flow u at a point O(z, y), and given that
the projection of the sum u ~ + u t on (el) should equal u n, we conclude that the translational flow
is OD, where D is anywhere on (e2). Clearly, in such a case, the focus of expansion lies on the
half plane defined by (e) that does not contain ut. This statement is equivalent to the following
algebraic inequality (Horn and Weldon, 1987). If f(x, y, t) is the image intensity function, then
we have fxu + f~v + ft = 0, where u, v is the flow. If we only have translation (or we know the
rotation), then we get fx ( - v ~ x w ) -t-HY (-vz--~-~-) -t-f, ---- 0 or fx-~- (x - ~ ) -t- H, -~- (y - ~r) -I-
ft ~- 0 and if -~- > 0, (Hx (z - ~ ) -t- fy (y - ~ ) ) / H t < O. However, thinking in terms of normal
flow gives more geometric intuition. Indeed, if the normal flow due to translation is as in Fig. 3b,
the F O E must lie in the half plane (dotted line) of (e). But this assumes that the flow u can be
arbitrarily large, which is absurd. If there is a bound on the flow, then the F O E is constrained
further (Fig. 3c).
502
Y B
A (U, V, W)
0 Z
X
D
Fig. 4. Consider the camera coordinate system. If the translation vector (U, V, W) is anywhere
inside the solid O A B C D defined by the nodal point of the eye and the boundaries of the image,
then the FOE is somewhere on the image.
paper we don't get involved with real time implementation issues as we wish to analyze
the theoretical aspects of the technique. However it is quite clear that computation of
normal flow can happen in real time (there already exist chips performing edge detection).
According to the literature on connectionist networks (Ballard, 1984), voting can also be
done in real time. Let S denote the area with the highest number of votes. Let L(S) be a
Boolean function that is true when the intersection of S with the image boundary is the
null set, and false otherwise. Then, the following algorithm finds the area S, i.e., solves
the passive navigation problem. We assume that the inertial sensors provide the rotation
and thus we know the normal flow due to translation.
1. begin {
2. find area S
3. repeat until L(S)
4. { rotate camera around x, y axes so that the
optical axis passes through the center of S (saccade)
. find area S
}
6. output S
}
If the camera has a wide angle lens, then image points can represent many orientations,
and only one saccade (or none) may be necessary. But if we have a small angle lens, then
we may have to make more than one saccade.
We have assumed that the inertial sensors will provide the observer with accurate infor-
mation about rotation. Although expensive accelerometers can achieve very high accu-
503
(a) (b)
F i g . 5. (a) From a measurement u of the normal flow due to translation at a point (x, y) of the
image, every point of the image belonging to the half plane defined by (e) that does not contain
u is a candidate for the position of the focus of expansion, and collects one vote. The voting is
done in parallel for every image measurement. (b) If the F O E lies within the image boundaries,
then the area containing the highest number of votes is the area containing the FOE. Using only
a few measurements can result in a large area. Using many measurements (all possible) results
in a small area (in our experiments a small area means a few pixels, usually at most three or
four).
(a)
x,g }
(b) (c)
F i g . 6. (a) If the area containing the highest number of votes has a piece of the image boundary
as part of its boundary, then the F O E is outside the image plane (see fib). ('b) The position
of the area containing the highest number of votes indicates the general direction in which the
translation vector lies. (c) The camera ("eye") rotates so that the area containing the highest
number of votes becomes centered. With a rotation around the x and y axes only, the optical axis
can be positioned anywhere in space. The process stops when the highest vote area is entirely
inside the image.
504
racy, the same is not true for inexpensive inertial sensors and so we are b o u n d to have
some error. Thus we must assume t h a t some unknown r o t a t i o n a l p a r t still exists and
contributes to the value of the n o r m a l flow. As a result, the m e t h o d for finding the F O E
(previous section) which is based on t r a n s l a t i o n a l n o r m a l flow information (since we have
" d e r o t a t e d " ) might be affected by the presence of some r o t a t i o n a l flow. Thus, we need to
s t u d y the effect of rotation (the error of the inertial sensor) on the technique for finding
the F O E . A n extensive analysis of the distortion of the solution area in the presence of
r o t a t i o n is given in (Duri~ and Aloimonos, 1991). For the purposes of this analysis it will
suffice to show t h a t if we consider for voting normal flows whose value is greater t h a n
some threshold, then voting will always be correct. T h e analysis is done for a spherical
eye. Indeed, voting will clearly be correct only if the direction of the t r a n s l a t i o n a l normal
flow is the same as the direction of the actual normal flow, t h a t is when
So, if we set In. unl = Tt, then there are two possibilities: either In. ul is below the
threshold, in which case it is of no interest to voting, or the sign of n 9u is the same as
the sign of n . ut. In other words, if we can set the threshold equal to the m a x i m u m vMue
of the normal r o t a t i o n a l flow, then our voting will always be correct. But at point r of
the sphere the r o t a t i o n a l flow is
7 Experimental Results
We have performed several experiments with b o t h synthetic and real image sequences in
order to d e m o n s t r a t e the stability of our method. F r o m experiments on real images it was
found t h a t in the case of pure t r a n s l a t i o n the m e t h o d computes the Focus of Expansion
very robustly. In the case of general m o t i o n it was found from experiments on synthetic
d a t a t h a t the behavior of the m e t h o d is as predicted by our theoretical analysis.
505
7.1 S y n t h e t i c D a t a
z F
CO
X
Fig. 7. Sphere O X Y Z represents a spherical retina (frame O X Y Z is the frame of the observer).
The translation vector t is along the z axis and the rotation axis lies on the plane O Z Y . Although
a spherical retina is used here, information is used only from a patch of the sphere defined by the
solid angle FOV containing the viewing direction v d. The spherical image patch is projected
stereographically with center S' on the plane P tangent to the sphere at N', and having a
natural coordinate system (~, ~/). All results (solution areas, voting functions, actual and normal
flow fields) are projected and shown on the tangential plane.
7.2 R e a l D a t a
Fig. 13a shows one of the images from a dense sequence collected in our laboratory using
an Merlin American Robot arm that translated while acquiring images with the camera
it carried (a Sony miniature T V camera). Fig. 13b shows the last frame in the sequence
506
\\\\ , i, li~ ]
G '~, :~ i'I' s ,vl 11 I. ,.,.,, ,.,/I
./
\\ ,~"'\ \ \ \ \ \ \\ i 1~ l ,1i// I Y (,,i
== = _~_-
- -,. -:_:- -- .." "-- ".'7"
I \
I-~"'" ' l t
,,,~\\,\ ",.'%'1
, d"!
:2 / Ii ,,
Fig. 8.
i ,
\ \
- /
# : 9 " ; * x ~ -
9 " ~ t
r p 9 ! "'~ . o
9 +/ , . r - +i 9 \ "s 9 t
- T
9 , i . , ,q.id. ~
: i - i r \ //
9 ~ ~/ / ~ ~,: -1, !
Fig. 9.
and Fig. 13c shows the first frame with the solution area (where the FOE lies), which
agrees with the ground truth9 Figs. 14a, 14b and 14c show results similar to those above,
but for the image sequence provided by N A S A Ames Research Center and made public
for the 1991 IEEE Workshop on Motion 9 One can observe that the solution area is not
very small, primarily due to the absence of features around the FOE area.
507
/ / /
\1 l~ ill A ,, --
~1 t I il~ I . .. -"~ .I --
~---- .. z 1 I \
- :i : rtl ~,11\ \ ~
Fig. 10.
I Q I ~ . 1 ,
.% ~ 9 9 o ~ 9 9
\ \ 9
1 - t
_~. ,_ _, :.. ".; 9 .,,- , :.
x. 9 :,,o
m @
X'~,\ \ I
\* /. -. 9
--, -
9
2
''
pp
9
.~
T " ('
\
"IT"
"
]'I
9 Tp 9
: : v I
\
x ,.
~
'/
9_ _F ~ ,
/ /
/*
'
// I' /
-:
~ "
. ~
'i
/
Fig. 11.
8 Conclusions
We have presented a technique for computing the direction of motion of a moving observer
using as input the normal flow field9 In particular, for the actual computation only the
direction of the normal flow is used. We showed theoretically that the method works
very robustly even when some amount of rotation is present, and we quantified the
relationship between time-to-collision and magnitude of rotation that allows the method
508
Thresholding No Thresholding
Fig. 12.
to work correctly. It has been shown that the position of the estimated FOE is displaced
in the presence of rotation and this displacement is explained in (Duric and Aloimonos,
1991). The practical significance of this research is that if we have at our disposal an
inertial sensor whose error bounds are known, we can use the method described in this
paper to obtain a machine vision system that can robustly compute the heading direction.
However, if rotation is not large, then the method can still reliably compute the direction
of motion, without using inertial sensor information 9
Note: The theoretical analysis in this paper was done by Y. Aloimonos. All experiments
reported here have been carried out by Zoran Duri~. The support of DARPA (ARPA
Order No. 6989, through Contract DACA 76-89-C-0019 with the U.S. Army Engineer
Topographic Laboratories), NSF (under a Presidential Young Investigator Award, Grant
IRI-90-57934), Alliant Techsystems, Inc. and Texas Instruments, Inc. is gratefully ac-
knowledged, as is the help of Barbara Burnett in preparing this paper.
509
Fig. 13.
Fig. 14.
510
References
Adiv, G.: Determining three-dimensional motion and structure from optical flow generated by
several moving objects. IEEE Trans. PAMI 7 (1985a) 384-401.
Adiv, G.: Inherent ambiguities in recovering 3D motion and structure from a noisy flow field.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (1985b) 70-77.
Aloimonos, J.: Purposive and qualitative active vision. Proc. DARPA Image Understanding
Workshop (1990a) 816-828.
Aloimonos, J.: Perspective approximations. Image and Vision Computing 8 (1990b) 177-192.
Aloimonos, J., Brown, C.M.: The relationship between optical flow and surface orientation. Proc.
International Conference on Pattern Recognition, Montreal, Canada (1984).
Aloimonos, J., Weiss, I., Bandopadhay, A.: Active vision. Int'l. J. Comp. Vision 2 (1988) 333-
356.
Ballaxd, D.H.: Parameter networks. Artificial Intelligence 22 (1984) 235-267.
Bruss, A., Horn, B.K.P.: Passive navigation. Computer Vision, Graphics Image Processing 21
(1983) 3-20.
Duri~, Z., Aloimonos, 3.: Passive navigation: An active and purposive solution. Technical Report
CAR-TR-560, Computer Vision Laboratory, Center for Automation Research, University
of Maryland, College Park (1991).
Horn, B.K.P.: Relative Orientation. MIT AI Memo 994 (1988).
Horn, B.K.P., Weldon, E.J.: Computationally efficient methods of recovering translational mo-
tion. Proc. International Conference on Computer Vision (1987) 2-11.
Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections.
Nature 293 (1981) 133-135.
Longuet-Higgins, H.C., Prazdny, K.: The interpretation of a moving retinal image. Proc. Royal
Soc. London B 2 0 8 (1980) 385-397.
Mart, D.: Vision (W.H. Freeman, San Francisco (1982).
Negahdaripour, S.: Ph.D. Thesis, MIT Artificial Intelligence Laboratory (1986).
Nelson, R.C., Aloimonos, J.: Finding motion parameters from spherical flow fields (Or the ad-
vantages of having eyes in the back of your head). Biological Cybernetics 58 (1988) 261-273.
Nelson, R., Aloimonos, J.: Using flow field divergence for obstacle avoidance in visual navigation.
IEEE Trans. PAMI 11 (1989) 1102-~1106.
Spetsakis, M.E., Aloimonos, J.: Optimal computing of structure from motion using point corre-
spondences in two frames. Proc. International Conference on Computer Vision (1988).
Spetsakis, M.E., Aloimonos, J.: Unification theory of structure from motion. Technical Report
CAR-TR-482, Computer Vision Laboratory, Center for Automation Research, University
of Maryland, College Park (1989).
Spetsakis, M.E., Aloimonos, J.: Structure from motion using line correspondences. InCl. J. Com-
puter Vision 4 (1990) 171-183.
Tsal, R.Y., Huang, T.S.: Uniqueness and estimation of three dimensional motion parameters of
rigid objects with curved surfaces. IEEE Trans. PAMI 6 (1984) 13-27.
Ullman, S.: The Interpretation of Visual Motion (MIT Press, Cambridge, MA (1979).
Weng, 3., Huang, T.S., Ahuja, N.: A two step approach to optimal motion and structure esti-
mation. Proc. IEEE Computer Society Workshop on Computer Vision (1987).
Young, G.S., Chellappa, R.: 3-D motion estimation using a sequence of noisy stereo images.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (1988).
This article was processed using the IbTEX macro package with ECCV92 style
Active Perception Using DAM and Estimation Techniques
Wolfgang Polzleitner i and Harry Wechsler2
Joanneum Research, Wastiangasse 6, A-8010 Graz, Austria,
z Dept. of Computer Science, George Mason University, Fairfax, VA 22030, USA
Machine vision research is currently recognizing the impact of connectionist and parallel
models of computation to approach the problem of huge amounts of data to be processed.
Complete parallelism, however, is not possible because it requires too large a number of
processors to be feasible (Sandon [12], %otsos [14]). A balance has to be made between
processor-intensive parallel implementation and time-intensive sequential implementation.
This motivates the need for multiresolution image representations, which is also justified by
the organization of the human retina, the availability of retina-like sensors [11], multichan-
nel pyramidal structures [15], and last but not least by the fact that sequential and hierar-
chical decision trees are basic tools employed in statistical pattern recognition to decrease
the computational load.
The approach we suggest in this paper is to distribute the stimulus vectors (i.e., the fea-
tural representation) to parallel processors. Preattentive selection is not in terms of special
features triggering selection, but in terms of the full (learned) knowledge the system has
obtained during training, and which is present in each processor. We use the Distributed
Associative Memory (DAM) as a generic recognition tool. It has several useful properties
to cope with incomplete input (e.g., in the presence of occlusion) and noisy patterns [7].
During recall weights flk are computed to indicate how well the unknown stimulus vector
matches with k-th memorized stimulus vector. We have enhanced the DAM scheme using
results from statistical regression theory, and have replaced the conventionally used weights
/~k by t-StatiStiCS, where the basic equations defining the DAM are
t,, = o-)
V~blfgang Pdlzleitner has been supported by Joanneum Research. Harry Weehsler has been partly
supported by DARPA under Contract #MDA972-91-C-004 and the C3I Center at George Mason
University.
A long version of this paper is available from the first author as Joanneum Research technical report
DIB-56.
Lecture Notes in Computer Science, Vol. 588
G. Sandini (Ed.)
Computer Vision - ECCV '92
9 Springer-Verlag Berlin Heidelberg 1992
512
R2 = var(x) - RSS
var(x) (2)
Here tk is the t-statistic indicating how well a specific (new) input is associated with the k-th
stimulus vector stored in the memory, and R 2 is the coefficient of determination, a number
between 0 and 1, measuring the total goodness of the association. RSS = IIx- ~ll' and the
variance var(x) = IIx - ~1t 2, where ~ is the mean over all n elements of x.
To use the DAM in a preattentive manner requires the following properties [9]. First, a
reject function based on R ~-is incorporated that enables the memory to decide whether an
input stimulus is known or is a novelty. Second, the memory is made selective by allowing
it to iterativety discard insignificant stimulus vectors from memory. This is called attentive
mode of operation (or coefficient focusing [9]).
F~xation Points, Saccade Generation, and Focused Recognition
Our DAM-based system achieves the balance between parallel and sequential processing
by segmenting the receptive field in preattentive and attentive processes: From the high
resolution visual input a pyramid is computed using Gaussian convolution, and one of its
low resolution layers is input to an array of preattentive DAMs working in parallel. Each
DAM has stored a low-resolution representation of the known stimuli and can output the
coefficient of determination R 2. Thus a two-dimensional array of R2-values is available at
the chosen pyramid level. These coefficients indicate to what extent the fraction of the input
covered by the particular DAM contains useful information.
Now a maximum selection mechanism takes this array of R2-values as input and selects
the top-ranked FOV. The top-ranked FOVs constitute possible fixation points and a saccade
sequence is generated to move among them starting at that FOV for that the Maximum R ~-
was obtained. Full resolution receptive fields are then centered at these fixation points and
attentive recognition follows.
2 Experiments
We describe next two experiments where active vision implemented as saccades between
fixation points is essential for recognition and safe navigation purposes. Specifically, one
has to Irade the quality of recognition and safety considerations for almost real-time per-
formance, and as we show below parallel hierarchical processing are the means to achieve
such goals.
2.1 Experiment 1: Quality Control
The methods described in the previous sections were tested on textured images of wooden
boards. The goal was to recognize defects in the form of resin galls, knots and holes, where
the knots appear in three different subclasses: normal (i.e., bright knots), knots with partially
dark surrounding, and dark knots so that the resulting six classes can be discriminated.
The preattentive selection methods and the related generation of visual saccades is il-
lustrated in Fig. 1. Here 2~a level (P = 2) of the gray-level pyramid is input to the array of
low-resolution DAMs. The output of each DAM is shown in Fig. 1 b), where the diameter
of the circles are proportional to the value R 2 computed in each DAM. The local maxima
of the R 2 numbers are shown in Fig. 1 c), where only circles that are local maxima are kept.
The maximum selection can be done efficiently by lateral inhibition. The sequence of fix-
ation points is also indicated in Fig. 1 d). At each such position the attentive mode was
initiated, but only on positions marked by a full circle was the value ofR 2 large enough after
coefficient focusing [9] to indicate recognition. These positions indeed correctly indicated a
wooden board defect, which was classified as such.
513
References
1. R. Bajcsy. Active Perception. IEEEProceedings, 76(8):996--1005, August 1988.
2. E J. Burt. Smart Sensing within a Pyramid Vision Machine. IEl~.l~.Proceedings,76(8):1006-1015,
August 1988.
3. V Cherkassk'y. Linear Algebra Approach to Neural Associative Memories. To appear.
4. R. W. Conners and C. T Ng. Developing a Quantitative Model of Human Preattentive Vision.
IEEE Trans. SysL Man C'~bem, 19(6): 1384-1407, November/December 1989.
5. 1: Kohonen. S e ( f - ~ a t i o n and Associative Memory. Springer-Verlag, 2rid edition, 1988.
6. U. Neisser. Direct Perception and Recognition as Distinct Perceptual Systems. ~ ' v e S c / e n c e
Soc/ety Address, 1989.
7. P. OUvier. Optimal Noise Rejection in Linear Associative Memories. IEEE Trans. SysL Man
@bern., 18(5):814-815, 1988.
g W. P01zleitner and H. Wechsler. lnvatiantPatternReco~itionUsingAssociativeMemoty. Techni-
cal Report DIB-48, Joanneum Research, Institute for Image Processing and Computer Graphics,
March 1990.
9. W. POlzleitner and H. "~chsler. Selective and Focused Invariant Recognition Using Distributed
Associative Memories (DAM). JEEE Trans. Pattem AnaZ Machine IntelL, 11(8):809--814, August
1990.
10. W. POlzleitner, G. Paar, and G. Schwingshakl. Natural Feature ql'acking for Space Craft Guid-
ance by Vision in the Descent and Landing Phases. In ESA Workshop on Computer Vision and
Images Processingfor Spaceborn Applications, Noordwijk, June 10-12, 1991.
11. M. Sandini and M. Tistarelli. Fts/on and Space-VariantSat~ing. Volume 1, Academic Press, 1991.
12. E A. Sanclon. Simulating Visual Attention. Journal o f ~ v e N e u r o s c i e n c e , 2(3):213-231,
1990.
13. B.TelferandD. Casasent. Ho-Kashyap OptiealAssociativeProeessors. Appl/edOpt/cs,29:l191-
1202, March 10, 1990.
14. J. K. Tsotsos. Ana/ys/ng Vis/on at the Complex@LeveL Technical Report RCBV-TR-78-20, Univ.
Toronto, March 10, 1987.
15. L Uhr. Parallel Computer l/'tsion. Academic Press, 1987.
514
(5 d9
(~ (36
~8
05 @1_
6o
(3o
(~7 (~3
(34
e)
o0( [ ~ 0 0 o o o o o 0 o o 0
o08 ~ ' 0o~Q1o 7o 6 o oo oo ~o o1 7~6 1 7 6
ouo: oooo .o
o0{ (3o0oo 0o,, o o o
0 o o *oo00ooooo0
oo oO~oC)o o o o O size for R 2 = 1.0
OOo o~:c~,~p.@oo, o o o
o o o U._9.)JgS~ L.L)o * o
o o 9 O(-'X-~2L~Z~_)Oo o O
o o. o 6C~-X}C~OO_ o0o
0o0o o ODO0000o Oo
00o0oooo00ooo0o
00o000ooo000ooo
00o0000oo0.oo00
d)
Flg. L a) Level P = 0 of the gray-level image. The positions of the various DAMs marked on the input
image, b) A prototype 'active' receptive field consisting of 15 15 preattentive DAMs working in par-
allel. The array is first centered at the location of maximum R z. Coefficient focusing [9] is performed
only for this foveal DAM and the other top-ranked FOVs. Each circle is a DAM and the diameters of
the circles are scaled to the respective values o f R z. The stimuli stored in the DAMs are low-resolution
versions of those stored in the attentive DAM. They operate on the Z 'a level of a gray-level pyramid
that has a 4 4 reduced resolution with respect to the input image. The DAMs are spaced 4 pixels
apart, and their receptive fields l~ve a radius of 5 pixels on this level (which correspond to a radius of
5 x 4 = 20 pixels of the high-resolution level), c) Tlae saccades (changes of fixation points) generated
by sorting the local maxima of the preattentive R z values from b). The full circles represent locations
where the attentive recognition system has recognized an object. The numbers indicate the sequence
in which the fixation points were changed, d) Illustrates the scaling used in parts b) and c) and shows
the resulting diameter of the circle for R z = 1.
515
a)
o9 oC3O
c) d)
Fig. 2. a) The left image of an image pair used as a test image. Its size is 736 x 736 pixels, b) 4-th
pyramid level of the elevation model derived from this image pair is shown. Bright areas correspond
to high locations (mountain peaks, alpine glaciers), whereas dark areas are low altitude locations
(e.g., river valleys), c) The output of each preattentive memory is coded by drcles with radii scaled to
the respective values of R 2. The circular receptive fields of these memories are placed 4 pixels apart,
with radius of 14 pixels, d) The double circles show the values of R 2 in the preattentive (parallel)
mode (outer circles) and the values o f / ~ after the last step of coefficient focusing (inner circles). The
fixation points selected try the preattentive system are shown in b).
Active/Dynamic Stereo for Navigation
Enrico Grosso, Massimo Tistarelli and Giulio Sandini
University of Genoa
Department of Communication, Computer and Systems Science
Integrated Laboratory for Advanced Robotics (LIRA- Lab)
Via Opera Pia llA - 16145 Genoa, Italy
Abstract. Stereo vision and motion analysis have been frequently used
to infer scene structure and to control the movement of a mobile vehicle
or a robot arm. Unfortunately, when considered separately, these methods
present intrinsic difficulties and a simple fusion of the respective results has
been proved to be insufficient in practice.
The paper presents a cooperative schema in which the binocular dis-
parity is computed for corresponding points in several stereo frames and
it is used, together with optical flow, to compute the time-to-impact. The
formulation of the problem takes into account translation of the stereo set-
up and rotation of the cameras while tracking an environmental point and
performing one degree of freedom active vergence control. Experiments on
a stereo sequence from a real scene are presented and discussed.
1 Introduction
In figure 1 the first stereo pair, from a sequence of 15, is shown. The vehicle was moving
forward at about 100 m m per frame. The sequence has been taken inside the LIRA lab.
Many objects were in the scene at different depths. The vehicle was undergoing an almost
straight trajectory with a very small steering toward left, while the cameras were fixating
a stick on the desk in the foreground.
P
/
/ "Y A ) B
/ L K
H 5
rt
I ~ I
where a and 3 are the vergence angles, 7 = arctan ( ~ t ) and ~ = arctan (~[~t) define
the position of two corresponding points on the image planes and zrt = zu + D where
D is the known disparity. The depth is computed as:
Z, = d . K(a,fl,7,~ ) (2)
The knowledge of the focal length is required to compute the angular quantities.
3.2 M o t i o n analysis
The temporal evolution of image fetures (corresponding to objects in the scene) is de-
scribed as the instantaneous image velocity (optical flow). The optical flow V = (u, v)
is computed from a monocular image sequence by solving an over-determined system of
linear equations in the unknown terms (u, v) [HS81, UGVT88, TS90]:
d
dI = 0 --VI = 0
dt dt
where ! represents the image intensity of the point (x, y) at time t. The least squares
solution of these equations can be computed for each point on the image plane ITS90].
In figure 4 the optical flow of the sixth image of the sequence is shown. The image
velocity can be described as a function of the camera parameters and split into two terms
depending on the rotational and translational components of camera velocity respectively.
If the rotational part of the flow field Vr can be computed (for instance from pro-
prioceptive data), Vt is determined by subtracting Vr from V. From the translational
optical flow, the time-to-impact can be computed as:
A
T = (3)
Iv, i
where A is the distance of the considered point, on the image plane, from the FOE.
The estimation of the FOE position is still a critical step; we will show how it can be
avoided by using stereo disparity.
519
i ~ : ~ ' : : ~ ; : : : ; ~ ; . ~. ;;~:~z~;~::;'~.,
.-~.j~.~.~'..~i~'#]
~ . ~ ;~;m,-, ~~ . ~
"~"ww92w.*~kW,~?,i,~ i i / i | l %'mlltl$~;' t l ,.t~"~
Fig. 3. Disparity computed for the 6th Fig. 4. Optical flow relative to the 6th left
stereo pair of the sequence; negative values image of the sequence.
are depicted using darker gray levels.
Even though the depth estimates from stereo and motion are expressed using the same
metric, they are not homogeneous because they are related to different reference frames.
In the case of stereo, depth is referred to an axis orthogonal to the baseline (it defines
the stereo camera geometry) while for motion it is measured along a direction parallel
to the optical axis of the (left or right) camera. We have to consider the two reference
frames and a relation between them:
X
Z,(x,y) = Zm(x,y) h(x) h(x) = s i n a + ~ c o s a (4)
where (~ is the vergence angle of the left camera, F is the focal length of the camera in
pixels and x is the horizontal coordinate of the considered point on the image plane (see
figure 2). We choose to adopt the stereo reference frame, because it is symmetric with
respect to the cameras, therefore all the measurements derived from motion are corrected
accordingly to the factor h(x).
/i\
/i" ....
~ ......
i ,~ / "x"...l.....v.
/ i" "\
,.................. / ~v~"/. . . / =.o .......
/ ;-7_"_._ ~'. '*'~', .-~""
i /
/
i .:,r- -
. , .......... ..~vL..
',..",,,..
.R.........
Fig. 5. Rotation of the stereo system dur- Fig. 6. Correction of the relative depth us-
ing the motion, ing rotation.
corresponding left and right image points. However, camera resolution and numerical
instability problems make difficult a practical application of the theory. Moreover, as it
appears in figure 5 the computation of the vergence angle is insufficient to completely
determine the position of the stereo pair in space. For this reason the angle ~1 or, alter-
natively, ~2 must be computed. We first assume the vergence angles of the cameras to
be known at a given time instant; for example they can be measured by optical encoders
mounted directly on the motors.
The basic idea for computing the rotational angle is to locate two invariant points
on the scene space and use their projection, along with the disparity and optical flow
measurements, to describe the temporal evolution of the stereo pair. The first point
considered is the fixation point, as it is "physically" tracked over time and kept in the
image center. Other points are obtained by computing the image velocity and tracking
them over successive frames.
In figure 7 the position of the two invariant points ( F and P), projected on the Z X
plane, with the projection rays is shown.
Considering now the stereo system at time tl we can compute, by applying basic
trigonometric relations, the oriented angle between the 2D vectors F P and LR.:
~1 = el - O2 (6)
In this formulation ~1 represents the rotation of the base-line at time t2 with respect
to the position at time tl; the measurements of ~1 can be performed using a subset or
521
.. ..:::::.
..~,..."" ~L:.-/'.';i'~
......:>:~ / , f I\/ i
i .....
~ ~176176176
o~176176176176176176
Fig. 8. Rough and smoothed histograms of the angles computed from frames 5-6 and frames 7-8,
respectively. The abscissa scale goes from -0.16 radians to 0.16 radians. The maxima computed
in the smoothed histograms correspond to 0.00625 and 0.0075 radians respectively.
also all the image points: In the noiseless case, all the image points will produce identical
estimates. In the case of objects moving within the field of view it should be easy to
separate the different peaks in the histogram of the computed values corresponding to
moving and still objects. Figure 8 shows two histograms related, to frames 5-6 and 7-8
respectively.
In order to compute the angle ~1 the following parameters must be known or measured:
- c~ and ~, the vergence angles of the left and right camera respectively, referred to the
stereo baseline.
- 7 and 6, the angular coordinates of the considered environmental point computed
on the left and right camera respectively. The computed image disparity is used to
522
establish the correspondence between image points on the two image planes.
- The optical flow computed at time tl.
[ sin~l ]
P---"O = Z,,r(t2) = Z,(t2). cos~l + tan(-~2: 72) (7)
In the remainder of the paper we denote with Z the translational component Z , tr.
From the results presented in the previous sections we can observe that both stereo-
and motion-derived relative-depth depend on some external parameters. More explicitly,
writing the equations related to a common reference frame (the one adopted by the stereo
algorithm):
Ki Zi ~s h(xi)T. Zi (8)
= T = h(O) ' = W~
where d is the interocular baseline, Wz is the velocity of the camera along the stereo ref-
erence frame, T/ represents the time-to-impact measured in the reference frame adopted
by the motion algorithm and Ti' is the time-to-impact referred to the symmetric, stereo
reference system.
We consider now two different expressions derived from equations (8).
d
7~ =W-:.K~. h(0)
h(=r (10)
Using the first equation of (9) and equation (10) we can compute a new expression
for - ~ :
Wz K, - K s h(0) (11)
d = Tih(xi) 7~h(zj)
523
where (zl, yi) and (xj, yj) are two points on the image plane of the left (or right) camera.
This formulation is possible if we are not measuring the distance of a flat surface, because
of the difference of K at the numerator and the difference of T at the denominator.
Substituting now in equations (9) we obtain:
The two equations are the first important result. In particular the second equation
directly relates the relative-depth to the time-to-impact and stereo disparity (i.e. the
K function). The relative-depth or time-to-impact can be computed more robustly by
integrating several measurements over a small neighborhood of the considered point
(xi, Yi), for example with a simple average. The only critical factor in the second equation
of (12) is the time-to-impact which usually requires the estimation of the FOE position.
However, with a minimum effort it is possible to exploit further the motion equations to
directly relate time-to-impact and also relative-depth to stereo disparity (the K function)
and optical flow only.
We will exploit now the temporal evolution of disparity. If the optical flow and the
disparity map are computed at time t, the disparity relative to the same point in space
at the successive time instant, can be obtained by searching for a matching around the
predicted disparity, which must be shifted by the velocity vector to take into account the
motion.
As - ~ is a constant factor for a given stereo image, it is possible to compute a robust
estimate by taking the average over a neighborhood [TGS91]:
W= _ AK _ 1
"~ - At N 2 E [K/(t) - gi(t + At)] (13)
i
Given the optical flow V = (u, v) and the map of the values of the K function at time
t, the value of K~(t -t- At) is obtained by considering the image point (xl + u~, yl q- vl)
on the map at time t + At.
The value of AK for the 6th stereo has been computed by applying equation (13) at
each image point. By taking the average of the values of AK over the all image, a value of
--~ equal to 0.23 has been obtained. This value must be compared to the velocity of the
vehicle, which was about 100 millimeters per frame along the Z axis and the interocular
baseline which was about 335 millimeters. Due to the motion drift of the vehicle and the
fact that the baseline has been measured by hand, it is most likely that the given values
of the velocity and baseline are slightly wrong.
By using equation (13) to substitute --~ in the first equation of (9), it is possible
to obtain a simple relation for the time-to-impact, which involves the optical flow to
estimate AK:
= h(x,) = K,
AK (14)
524
Fig. 9. Time-to-impact computed using eq. (14) for the 6th pair of the sequence; darker regions
correspond to closer objects.
This estimate of the time-to-impact is very robust and does not require the compu-
tation of the FOE (see figure 9). From equation (13) and the second equation of (9), it
is possible to obtain a new expression for the relative-depth:
Z.LI = AK h(xi)
Z~ Kt " h(O) Ti (15)
There is, in principle, a different way to exploit the optical flow. For completeness we
will briefly outline this aspect.
From the knowledge of the translational component Vt of V it is possible to write
for a generic point (xh, Yh):
Considering two points (xi, Yi) and (xj, yj) and eliminating F / Z and W~/W u we express
a relation between Ti and ~ :
Now, combining equation (16) with the first of (12) we can obtain a new equation
of order two for ~ . ~ is expressed in this case as a function of the coordinates and
the translational flow of the two considered points. As in the case of equations (12) the
lime-to-impact can be computed more robustly by averaging the measurements over a
small neighborhood of the considered point (xi, Yi).
525
8 Conclusions
References
[Bro86] R.A. Brooks. A robust layered control system for a mobile robot. IEEE Trans. on
Robotics and Automat., RA-2:14-23, April 1986.
[CGS91] G. Casalino, G. Germzmo, and G. Sandini. Tracking with a robot head. In Proc. o]
ESA Workshop on Computer Vision and linage Proc essing for Spaceborn Applica-
tions, Noordwijk, June 10-12, 1991.
[FGMS90] F. Ferrari, E. Grosso, M. Magrassi, and G. Sandini. A stereo vision system for real
time obstacle avoidance in unknown environment. In Proc. of Intl. Workshop on
Intelligent Robots and Systems, Tokyo, Japan, July 1990. IEEE Computer Society.
[GST89] E. Grosso, G. Sandini, and M. Tistarelli. 3d object reconstruction using stereo
and motion. 1EEE Trans. on Syst. Man and Cybern., SMC-19, No. 6, Novem-
ber/December 1989.
[HS81] B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence,
17 No.1-3:185-204, 1981.
[KP86] B. Kamgar-Parsi. Practical computation of pan and tilt angles in stereo. Technical
Report CS-TR-1640, University of Maryland, College Park, MD, March 1986.
[LD88] L. Li and J.H. Duncan. Recovering three-dimensional translational velocity and
establishing stereo correspondence from binocular image flows. Technical Report
CS-TR-2041, University of Maryland, College Park, MD, May 1988.
[MutS0] K.M. Mutch. Determining object translation information using stereoscopic motion.
IEEE Trans. on P A M I . 8, No. 6, 1986.
[oc90] T.J. Olson and D.J. Coombs. Real-time vergence control for binocular robots. Tech-
nical Report 348, University of Rochester - Dept. of Computer Science, 1990.
[TGS91] M. Tistarelli, E. Grosso, and G. Sandini. Dynamic stereo in visual navigation. In
Proc. of lnt. Conf. on Computer Vision and Pattern Recognition, Lahaina, Maui,
Hawaii, June 1991.
[TK91] C. Tomasi and T. Kanade. Shape and motion from image streams: a factorization
method. Technical Report CS-91-105, Carnegie Mellon University, Pittsburgh, PA,
January 1991.
[TSg0] M. Tistarelli and G. Sandini. Estimation of depth from motion using an anthropo-
morphic visual sensor, linage and Vision Computing, 8, No. 4:271-278, 1990.
[UGVT88] S. Uras, F. Girosi, A. Verri, and V. Torre. Computational approach to motion per-
ception. Biological Cybernetics, 1988.
[WD86] A.M. Waxman and J.H. Duncan. Binocular image flows: Steps toward stereo-motion
fusion. IEEE Trans. on P A M I - 8, No. 6, 1986.
This article was processed using the IATEX macro package with ECCV92 style
Integrating Primary Ocular Processes
Kourosh Pahlavan, Tomas Uhlin and Jan-Olof Eklundh
Computational Vision and Active Perception Laboratory (CVAP)
Royal Institute of Technology
S-100 44 Stockholm, Sweden
Emall: kourosh~bion.kth.se, tomas@bion.kth.se, joe~bion.kth.se
1 Introduction
In recent years, there has been an increasing interest in studying active vision using head-
eye systems, see e.g. IBm88, ClF88, Kkv87]. Such an approach raises some fundamental
questions about control of attention.
Although one can point to work on cue integration, it is striking how computer vision
research generally treats vision problems in total isolation from each other. Solutions
are obtained as a result of imposed constraints or chosen parameters, rather than as an
outcome of several rivalling/cooperating processes, occuring in biological systems. The
main reason why such systems are so fault-tolerant and well performing, is that they
engage several processes doing almost the same task; while these processes are functional
and stable under different conditions.
The present paper presents an approach based on this principle, by integration of basic
behaviors of a head-eye system, here called primary ocular processes. These are low-level
processes that under integration, guarantee a highly reliable fixation on both static and
dynamic objects. The primary ocular processes build the interface between what we'll
call the reactive and active processes in the visual system, that is, the processes engaged,
either if the observer voluntarily wants to look at a specific point in the world, or if he
involuntarily is attracted by an event detected somewhere in the world.
In summary, the presented work actually addresses two separate issues. The first and
main one is that of building a complex behavior by integrating a number of independent,
527
In experimental psychology, two kinds of gaze shifting eye movements are clearly sepa-
rated [Ybs67]. These are vergence and saccades 1 . Since we also are interested in vergence
on dynamic objects or vergence under ego-motion, we change this division into pursuit
and saccades.
The general distinction between these two movements is based on the speed, am-
plitude and the inter-ocular cooperation of the movements. While saccades in the two
eyes are identical with respect to speed and the amplitude of the motion 2, the pursuit is
identified by its slow image-driven motion. Saccades are sudden and are pre-computed
movements (jumps), while pursuit is a continuous smooth movement.
With this categorization in mind, the elements of the human occulomotor system
can be classified into saccades, consisting of a spectrum of large to micro saccades, and
pursuit, consisting of accomodation, vergence and stabilization/tracking.
Here, we are arguing that in a computational vision system, the processes consti-
tuting pursuit, should be hardwired together. This is because they are not only similar
processes with regard to their qualitative features, they are also serving a similar purpose:
sequential and successive adjustment of fixation point.
Although they belong to the same class of movements, there is a very important
difference between convergence/divergence on one hand and stabilization on the other.
Stabilization is very successful and reliable in keeping track of lateral movements, while
the accomodation process (focusing and convergence/divergence) is very successful and
good at doing the same in depth. The combination of the two yields a very reliable
cooperative process.
In human vision, stabilization is a result of matching on the temporal side of one eye
and the nasal side of the other one, convergence/divergence is a result of matching on
either the temporal or the nasal side of both eyes [Hub88, Jul71].
Let's start by defining each process in our computational model of an occulomotor
system.
Saceades are sudden rotations in both eyes with identical speed and amplitude. They
are not dependent on the visual information under movement and the destination
point for them is pre-computed. The trace of a saccade is along the Vieth-Miiller
circle 3 for the point of fixation at the time. The motion is ramped up and down as a
function of its amplitude. Since the motion strategy is an open-loop one, the speed
is limited by the physical constraints of the motoric system.
P u r s u i t is a cooperative but normally not identical movement in both eyes. The motion
is image-driven, i.e. depending on the information from both images at the time, the
eyes are continuously rotated so that the point of interest falls upon each fovea cen-
tralis, in a closed-loop manner; the speed is limited by the computational constraints
of the the visual system and its image flow.
3 F i x a t i o n in v e r t e b r a t e s
The biological approach to the problem of fixation suggests a highly secure technique
[Ybs67]. Figure 1 (left) illustrates the model of human vergence combined with a saccade.
In this model, verging from one point to another starts by a convergence/divergence of
the optical axes of the eyes. This process continues for a certain short time. The initial
convergence/divergence is then followed by a saccade to the cyclopic axis through the new
fixation point. The remaining convergence/divergence is carried out along the cyclopic
axis until the fixation on the desired point is complete. Quite often, a correcting small
saccade is performed at the vicinity of the fixation point.
There are some points here, worth noting. To begin with, the model is based on a
cyclopic representation of the world. Although this does not have to be a representation
like in [Ju171], one needs at least some kind of direction information about the point of
interest. Changing the gaze direction to the cyclopic direction is preceeded by an initial
convergence or divergence, which could be caused by the preparation time to find the
direction. The search along the cyclopic axis is very interesting. This searching strategy
could transform the matching problem into a zero disparity detection. Besides, having a
non-uniform retina, this is perhaps the only way to do a reliable vergence, simply because
the matching/zero disparity detection is carried out at the same level of resolution. In
the end, the last matching is performed at the finest resolution part of fovea centralis.
Even more interesting evidence here is the observation that even with one eye closed,
the vergence process forces the closed eye to accompany the motion of the open eye.
Sudden opening of the closed eye shows that the closed eye was actually aiming fairly well
at the point of interest, suggesting that a monocular cue (focus in our implementation)
plays an active role in vergence.
3 The circle (or sphere) which goes through the fixation point and the optical centers of the two
eyes. All points on the circle have the same horizontal disparity. The horopter doesn't follow
this circle exactly. See e.g. [BaF91] and further references there.
529
D
~o'--~mQ.o 9
Fig. 1, Left: A model of human vergence suggested by A. Yarbus. In this model, con-
vergence/divergence movement is superimposed on the saccade movement. The conver-
gence/divergence shifts the fixation point Mong the cyclopic axis, while a saccaxie is a rotation
of the cyclopic axis towards the new fixation point. Right: The KTH-head.
The KTH-head is a head-eye system performing motions with 7 mechanical and 6 optical
DOFs. It utilizes a total of 15 motors and is capable of simulating most of the movements
in mammalians. Currently, it utilizes a network of 11 transputers, configured with a
symmetric layout for executing the primary behaviors of the system. Figure 1 (right)
illustrates the KTtI-head. The eye modules in the head-eye system are mechanically
independent of each other. There is also a neck module which can be controlled separately.
All DOFs can be controlled in parallel and the task of the coordination between them is
carried out by the control scheme of the system. See [PaE90, PaE92].
The design philosophy of the KTH-head was to allow a very flexible modular com-
bination of different units, so that the control system would have few restrictions in
530
integrating the module movements. The message here is that the design has been ad-
justed to a fairly flexible model of a mammalian head. In particular, our system allows
exploration and exploitation of the principle of competing and cooperating independent
primary processes proposed above. A motivation for our design also derives, from the
observation that in the mammalian visual s y s t e m s the mechanical structure supports
the visual processing.
Three major features distinguish our construction from earlier attempts 4. These are:
The first two items, are essential for adapting the mammalian control strategy to the
system. There is a very delicate but important difference between eye movements and
other body movements. When eyes rotate in their orbits, the image is not distorted 5.
We are simply suggesting that eye movements are not means of seeing things in the
world from another angle of view or achieving more information about objects by means
of motion parallax. Instead they seem to change the gaze direction and bring the point
of interest to the fovea. Smaller saccadic movements are also assumed to be means of
correspondence detection 6.
Naturally, the control strategy is dependent on the DOFs and the construction scheme.
The KTH-head is designed to cope with the two different movements of the model dis-
cussed earlier. By isolating the individual motor processes and allowing them to com-
municate via c o m m o n processes, they can be synchronized and communicate with one
another.
4.1 P e r f o r m a n c e d a t a
In order to give a better idea about what performance the KTH-head has, some data
about it, is briefly presented here:
- General data:
Total number of motors: 15
Total number of degrees of freedom: 13
Number of mechanical degrees of freedom in each eye: 2
Number of optical degrees of freedom in each eye: 3
Number of degrees of freedom in neck: 2
Number of degrees of freedom in the base-line: 1
Top speed on rotational axes (when all motors run in parallel): 180 deg/s
Resolution on eye and neck axes: 0.0072 deg
Resolution on the base-line: 20 p m
Repeatability on mechanical axes: virtually perfect
4 By earlier constructions, we mean those like [Brn88] which basically allow separate eye move-
ments and thereby asymmetric vergence, and not other constructions like [ClF88] and [Kkv87]
which despite their flexibilities follow a strategy based on symmetric vergence.
s For rotations up to almost 20 degrees, i.e. normal movements, human eyes rotate about a
specific center. Deviations from this is observed for larger angles. For rotations smaller than
20 degrees, the image is not distorted [Ybs67, Jul71].
e Or maybe these axe only corrective oscillations generated by eye muscles. There is evidence
showing that micro sa~:ca~ies and tremors in eyes disappear when eye balls axe not in a slip-
stick friction contact with the orbit [Gal91].
531
- Motors:
7 5-phase stepper motors on the mechanical axes
2 4-phase stepper motors for keeping the optical center in place
6 DC motors on optical axes, 3 on each lens
In these experiments only one single transputer for indexing and controlling the mo-
tors and one transputer based frame-grabber for primary control processes, were used.
For the purpose of extending the primary ocular processes, a network of 11 transputers
has recently been installed.
The details of the design and motivations can be found in [PaE90, PaE92].
5 Implementation issues
Presently, our implementation differs somewhat from the model. This is because we so
far lack a cyclopic representation and have to refer to left/right image depending on the
task in question. In addition, we have images with uniform resolution and no foveated
sensor.
Prior to integration, we implemented a set of primary ocular processes which run
continuously. These are:
has some advantages in our case, when a precise localization is required. Methods
based on e.g. bandpass filter techniques are presently beyond what we can compute
at video rate.
The processes described, do not mean too much if they are not put together as cooper-
ating processes compensating for each other's shortcomings. The three processes build a
circular loop, where each process interact with the two others in the ring
and in this way confirm or reject the incoming signals. At the "vergence" node, the left
ring and the right ring (see also Figure 2) are joined and this very node sends the action
command to the motor control process.
Figure 2 illustrates how these processes communicate with each other. The vergence
process, here, has a coordinative task and is the intersection of the two separate set of
processes on left and right sides respectively.
:~.'
.:-.,...:::!::~.'...
.,:.~!~:~ ~ .:..'.!:~
.~.,......
i_ _] Focus
Stabilization
Stabilization
i:!iiiiieH Motoric Process
"""";OCUS
.o I
~:~ ....~ ~
Fig. 2. Process configuration in the implementation. The meeting point for the processes dedi-
cated to the left and the right eye (the two rings), is the vergence process.
- A similar process searching for the left foveal image along the associated epipolar
band in the right image. The confirmation procedure is also similar to the other
process.
If none of the processes succeed, then the track of the p a t t e r n is lost; the object has gone
outside the common stabilization area of the two eyes.
Although moving objects could seem troublesome, they have their own advantages.
Moving objects trigger the stabilization processes on b o t h eyes, so t h a t a new binocular
position for fixation will be suggested by these processes. T h e m a t c h i n g task of the ver-
gence process is then simplified. C o n t r a r y to the vergence on static objects, the vergence
process has a binocular cue to judge on. This cooperation is however not i m p l e m e n t e d
yet, so it is too early to speculate a b o u t its applicability.
- We have chosen a large piece of wall paper with a checker b o a r d p a t t e r n and placed
it so t h a t it is viewed frontally.
- The real distance to the wall paper is 2215 mm.
s We did not mention pure saccades or versions among our implemented primary processes,
though we use them. The reason is simply the fact that we do not yet have a cyclopic repre-
sentation in our system. The sa~cades, under these circumstances, cannot be represented by a
process. The process would not have much to do, other than sending a command to the motor
control process to turn both eyes with the same amount that the retinal displacement of the
dominant eye requires. In the existing implementation, the focusing process alone decides if
the destination point is inside the zero-disparity circle, or outside it.
534
A ,..:
Fig. 4. The repetitive pattern without the band limits of accommodation. The band (top). The
pattern square superimposed on the best match (bottom). The match here, represented by the
least value, is false.
0.9
0.8 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiill
iiiiii,iiiiiiiiiiiiiiiiiiiiiii,iiiiiiiiiiiiiiiiiiiiiiiii
0.7
0.6
i !
0.5
m
0.4
| 0.3
0.2
0.1
0
-100 -~0 0 50 100 150
displac~ent [pixels]
Fig. 5. The evaluation function for matching. A good match here is represented by minimum
points. As shown in the figure, without the dynamic constraints from the focusing process, there
are multiple solutions to the problem. The output match here is false.
535
Figure 4 illustrates one sequence of vergence without focusing information. The image
is the repetitive pattern and the task is to find the pattern in the right eye, corresponding
to the pattern selected from the left eye (the square on the bottom stripe), by a search
along the epipolar band (the top stripe) in the left eye.
As it is expected and illustrated in Figure 5, the algorithm finds many candidates for
a good match (the minimum points). The problem here is that the best match is actually
not at all the real corresponding pattern on the left one, though from a comparative
point of view, they are identical.
Figure 6 illustrates the result of the cooperation with the focusing process. The fo-
cusing sharpness puts limit constraints on the epipolar band. The band is clearly shorter,
and as illustrated in Figure 7, there is, in this band, only one minimum point, yielding a
unique correspondence.
There is still a possiblity for repetitive structures in the area defined by depth of
focus. But this probability is very small and for a given window size, dependent on the
period of the structure and the field of view of the objective.
Figure 8 (top) illustrates a case where the distance to the object is increased. The
frequency of the pattern is higher than the earlier example, while the depth of focus has
become larger. In this case, the vergence process is still matching erroneously (at the
point with the pixel disparity o f - 1 4 ) . Figure 8 (bottom) illustrates the same case with
focus cooperation.
It can be observed that, although the cooperation provides limits for the matching
intervall and the matching process actually finds the right minimum, it still contains two
minimum points, i.e. a potential source of error. The problem here is caused by the large
depth of focus when the image is at a longer distance s. Note that it actually is not the
frequency which is important here; it is the distance to the pattern. An erroneous fixation
requires exactly similar patterns with small periods and a long distance.
Fig. 6. The repetitive pattern with the band limits provided by accommodation. The band
(left). The pattern square superimposed on the best match (right); here the correct match.
0.9
o.s
0.7
1
f ..........................
. . . . . . . . . . . . .
..........................
,
i ' i
................... ! .......................
~ .......................
i
i
....................... 4i ...............
........ ~
.......................i
,
~ - ~
.........
i......................
i....................... i................ ~ i i i i i i i i i i i
i
i .............
i -'-
......................
0.2 ......................
~ .......................
i .......................
i......................
i..................
o.~ ......................
! .......................
i.......................
L.....................
i............... i
Fig. 7. The evaluation function for matching. A good match here is represented by the minimum
point. To be compared with the curve not using the accommodation limits.
5.3 S t a b i l i z a t i o n g i v e s f o c u s i n g a h a n d
In order for focusing to recognize the sharpness in the center of the image, it must be
capable of handling small movements caused by the object or the subject itself. The
stabilizer, here, gives focusing a hand to focus on the same pattern even if the object is
shaking. The focusing process always gets its image through the stabilizer, which already
has decided which part of the image is the part that the focusing process is interested
in. Stabilization has a great effect on focusing on dynamic objects. Figure 11 illustrates
the effect of stabilization on focusing on a real object.
At this stage the loop is finally closed. In practice focusing is a very important cue
for active exploration of the world. Here, we suggest that this cue also can be used in eye
coordinations and fixation.
6 Ongoing work
10 That is, the combined speed of the eyes and neck can amount to 360 degrees/s.
537
1 !
0.7 ..... I . . . . . . . . i
I i
0.6 ........ !.. .... . . . . . . . . !
0.4 ..... ..4,..= 9..i ..... I.ii........... '"4 ........ 4. .,J ........ t.,-~,..t,..# .......
l
0.3 .....
i :
9.i ...... 9t"~. . . . . . ~""' 9'4 ........ l'"
I.=
0
-40 -20
i 0 20 40 ~O
i SO 100 1:20 140 160
1 i i
0-7 t ........................................... ! . . . . . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . ~. . . . . . . . . . . . . . . . . . . . .
| 0.3
I i
...................... .~ .......................~ ...................... i ..................... i....................... + ..................... . ......................
o.~ .......................
i.......................
i....................
~.....................
i.......................
i .....................
i.....................
oi .............
0 =
-15 -10 -5 0 10 15 20
Fig. 8. The graph at the top illustrates the result of matching along the epipolar band. The one
at the bottom illustrates the area confirmed by accommodation. The one at the bottom, in spite
of the two minima, detects a correct fixation. However, a potential source of false matching (the
second minimum) exists. The focal parameters ate by no means ideal here; the angle of view is
larger than 20 degrees. The displacement is defined to be 0 at the center of the image. Note,
however, that the cameras are at different positions in the two cases. Hence the displacements
do not directly relate.
7 Conclusion
I n b i o l o g i c a l vision, fault-tolerance and high performance is obtained through indepen-
dent processes often doing the s a m e task, which are functional and stable under different
538
Fig. 9. The left and right images of a stereo pair (top). The point marked with a cross (+) in the
left image, is the next fixation point. The epipolar band and the superimposed correlation square
on it are also shown. In this case, the matching process alone could manage the fixation problem
(the stripe in the bottom). The cooperation has however, shrunk the search area drastically (the
stripe in the middle).
conditions. Hence, complex behaviors arise from the cooperation and competition of such
primary processes.
This principle has generally been overlooked in computer vision research, where visual
tasks often are considered in isolation, without regard to information obtainable from
other types of processing. Before the recent interest in "active" or "animate" visual
perception one has also seldom appreciated the importance of having visual feedback from
the environment. We contend that such feedback is essential, implying that processing
should occur in (real) time and that a look at the world offers more information than a
look at prerecorded images.
539
I=
o.~ ........................... i............................ i.......................... ~ . . . . . . i........................ i..........................
o.41 - ......................... ..s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ! ........................... i..........................
o.31 ..........................
~............. i...........................
~............................
i..........................
o.2 ......................
~ ~ 'i ............. i i
0.1 ......................... ~ ........................... i ......................... i ........................... ~............................ ~..........................
0
-100 -50 0 50 100 150 200
displacement [pixels]
0.9
0.8
j 0.7
"6 0.6
0.5
M
|
0.4
0.3 ......................
T.......................
.........................................
! iiiiii! .................
0.2
0.1
0 i " i
-15 -10 -5 0 5 10 15 20
I n t h i s p a p e r , we h a v e p r e s e n t e d w o r k o n c o n t r o l l i n g e y e m o v e m e n t s o n t h e b a s i s
of these principles. We have abstractly modeled the occulo-motor behavior of what we
call t h e a c t i v e - r e a c t i v e v i s u o m o t o r s y s t e m as d i v i d e d i n t o t w o o p e n - l o o p a n d c l o s e d - l o o p
motions. We also argued that the processes engaged in the closed-loop action should be
h a r d - w i r e d t o o n e a n o t h e r , so t h a t t h e y c a n c o m p e n s a t e f o r t h e s h o r t c o m i n g s o f e a c h
other.
This cooperative model has been applied to the tasks of vergence and pursuit, and
i n a n i m p l e m e n t a t i o n o n o u r h e a d - e y e s y s t e m , it h a s b e e n s h o w n t o b e v e r y p o w e r f u l .
A s a side-effect, a s i m p l e a n d r o b u s t m e t h o d f o r g a z e c o n t r o l i n a s y s t e m a l l o w i n g s u c h
i n d e p e n d e n t p r o c e s s e s is o b t a i n e d . T h e d e s i g n o f t h e K T H - h e a d m o t i v a t e d b y t h e g e n e r a l
p h i l o s o p h y p r e s e n t e d a b o v e is also b r i e f l y d e s c r i b e d . T h i s b o t h g i v e s a b a c k g r o u n d t o t h e
experiments and a demonstration of how the general principle of independent cooperative
and competitive processes can be realized.
540
0.8
.a
o 0.6
.r,
:a
0.4
Z
0.2
0
0 5 10 15 20 25 30
Focus ring position
Fig. 11. The evaluation function for focusing without stabilization (dashed curve), and with
stabilization (solid curve). Stabilization results in a smooth curve free from local minima.
8 Appendix: algorithms
We have deliberately downplayed the role of the algorithms used. In fact there are many
possible approaches that could work.
The focusing algorithm used here is the tenengrad algorithm described by [Kkv87]:
where S is
v) = 9 y))= + (iv 9 y))=
ix and i v are the convolution kernels (e.g. for a Sobel operator) and T is a threshold.
The matching process is based on minimizing the sum of square of differences of the
two image functions.
square area around the center of image. For efficiency reasons, however, the search is
only performed down the steepest descent.
It is rather surprising how well these algorithms work, in spite of their simplicity!
This can be explained by two reasons:
- real-time vision is much less sensitive to relative temporal changes in the scene. The
correlations are for example less sensitive to plastic or elastic changes of the object,
smooth changes in lighting conditions, etc.
- cooperation of several processes gives vision a chance to judge the situation by several
objective criteria, rather than rigid constraints.
In all cases the input images have been noisy non-filtered ones. No edge extraction
or similar low-level image processing operations have preceeded the algorithms.
References
[AIK85] M. A. Ali, M. A. Klyne. Vision in Vertebrates, Plenum Press, New York, 1985
[BaF91] S. T. Barnard, M. A. Fischler Computational and Biological Models of Stereo Vision,
to appear in the Wiley Encyclopedia of Artificial Intelligence (2nd edition), 1991
[Brn88] C. Brown. The Rochester Robot, Tech. Rep., Univ. of Rochester, 1988
[CIF88] J. J. Clark, N. J. Ferrier. Modal Control of an Attentive Vision System, Proc. of the
2nd ICCV, Tarpon, Springs, FI, 1988
[Co191] H. Collewijn. Binocular Coordination of Saccadic Gaze Shifts: Plasticity in Time and
Space, Sixth European Conference on Eye Movements, Leuven, Belgium, 1991
[DeF90] R. Deriche, O. Faugeras. Tracking Line Segments, Proc. of the 1st ECCV, Antibes,
France, 1988
[Gal91] V. R. Galoyan Hydrobiomechanical Model of Eye Placing and Movements, Sixth Euro-
pean Conference on Eye Movements, Leuven, Belgium, 1991
[Hub88] D. H. Hubel. Eye, Brain, and Vision, Scientific American Library, 1988
[Jen91] M. R. M. Jenkin Using Stereo Motion to Track Binocular Targets, Proc. of CVPR,
Lah&inz, Hawaii, 1991
[Jul71] B. Julesz. Fundations o-f Cyclopean Perception, The University of Chicago Press, 1971
[Kkv87] E. P. Krotkov. Exploratory Visual Sensing for Determinig Spatial Layout with an Agile
Stereo System, PhD thesis, 1987
[PaE90] K. Pahlavan, J. O. Eklundh. A Head-Eye System .for Active, Purposive Computer Vi-
sion, TRITA-NA-P9031, KTH, Stockholm, Sweden, 1990
[PaE92] K. Pahlavan, J. O. Eklundh. Head, Eyes and Head-Eye Systems, SPIE Machine and
Robotics Conference, Florida, 1992 (To appear)
[Ybs67] A. Yarbus. Eye Movements and Vision, Plenum Press, New York, 1967
This article was processed using the I/flEX macro package with ECCV92 style
W h e r e to L o o k N e x t U s i n g a B a y e s N e t :
Incorporating Geometric Relations *
The University of Rochester, Computer Science Department, Rochester, New York 14627, USA
Abstract.
A task-oriented system is one that performs the minimum effort neces-
sary to solve a specified task. Depending on the task, the system decides
which information to gather, which operators to use at which resolution,
and where to apply them. We have been developing the basic framework of
a task-oriented computer vision system, called TEA, that uses Bayes nets
and a maximum expected utility decision rule. In this paper we present a
method for incorporating geometric relations into a Bayes net, and then
show how relational knowledge and evidence enables a task-oriented system
to restrict visual processing to particular areas of a scene by making camera
movements and by only processing a portion of the data in an image.
1 Introduction
This section summarizes the TEA-1 system, our second implementation of TEA, a general
framework of a task-oriented computer vision system. The reader is refered to [10] for
a detailed description of TEA-1. Earlier work involving TEA-0 and TEA-1 appears in
[8, 9].
M a i n C o n t r o l L o o p . In TEA, a task is to answer a question about the scene: Where
is the butter? Is this breakfast, lunch, dinner, or dessert? We are particularly interested
in more qualitative tasks: Is this an informal or fancy meal? How far has the eating
progressed? (Our example domain is table settings.) The TEA system gathers evidence
visually and incorporates it into a Bayes net until the question can be answered to a
desired degree of confidence. TEA runs by iteratively selecting the evidence gathering
action that maximizes an expected utility criterion involving the cost of the action and
its benefits of increased certainties in the net: 1) List all the executable actions. 2) Select
the action with highest expected utility. 3) Execute that action. 4) Attach the resulting
evidence to the Bayes net and propagate its influence. 5) Repeat, until the task is solved.
B a y e s N e t s . Nodes in a Bayes net represent random variables with (usually) a
discrete set of values (e.g. a utensil node could have values (knife, fork, spoon)). Links
in the net represent (via tables) conditional probabilities that a node has a particular
value given that an adjacent node has a particular value. Belief in the values for node
X is defined as B E L ( x ) -- P ( x I e), where e is the combination of all evidence present
in the net. Evidence, produced by running a visual action, directly supports the possible
values of a particular node (i.e. variable) in the net. There exist a number of evidence
propagation algorithms, which recompute belief values for all nodes given one new piece
of evidence. Several references provide good introductions to the Bayes net model and
associated algorithms, e.g. [2, 5, 7].
C o m p o s i t e B a y e s N e t . TEA-I uses a composite net, a method for structuring
knowledge into several separate Bayes nets [10]. A PART-0F net models subpart relation-
ships between objects and whether an object is present in the scene or not. An ezpected
area net models geometric relations between objects and the location of each object.
Section 3 presents the expected area net in detail. Associated with each object is an IS-A
tree, a taxonomic hierarchy modeling one random variable that has many mutually ex-
clusive values [7]. Task specific knowledge is contained in a task net. There is one task
net for each task, for example "Is this a fancy meal?", that TEA-1 can solve. Each of
the separate nets in the composite net, except the task net, maintains its B E L values
independently of the other nets. Evidence in the other nets affects the task net through a
mechanism called packages, which updates values in evidence nodes in the task net using
copies of belief values in the other nets.
A c t i o n s . TEA-1 uses the following description of an action:
- Precondition. The precondition must be satisfied before the action can be executed.
There are four types of precondition: that a particular node in the expected area net
be instantiated, that it not be instantiated, that it be instantiated and within the
field of view for the current camera position, and the empty precondition.
- Function. A function is called to execute the action. All actions are constructed from
one or more low-level vision modules, process either foveal image or peripheral image
data, and may first move the camera or fovea.
- Adding evidence. An action may add evidence to several nets and may do so in several
ways (see [7]): 1) A chance node can be changed to a dummy node, representing
virtual or judgemental evidence bearing on its parent node. 2) A chance node can
544
Each kind of object usually has several actions associated with it. TEA-1 currently has
20 actions related to 7 objects. For example, the actions related to plates are: The
p e r - d e t e c t - t e m p l a t e - p l a t e action moves the camera to a specified position and uses a
model grayscale template to detect the presence and location of a plate in the peripheral
image. P e r - d e t e c t - h o u g h - p l a t e uses a Hough transform for plate-sized circles for the
same purpose. P e r - c l a s s i f y - p l a t e moves the camera to a specified position, centers a
window in the peripheral image there, and uses a color histogram to classify that area
as paper or ceramic. F o r - c l a s s i f y - p l a t e moves the fovea (but not the camera) to a
specified location and uses a color histogram to classify the area as paper or ceramic.
C a l c u l a t i n g a n A c t i o n ' s Utility. The utility U(c~) of an action a is fundamentally
modeled as U(a) = Y(a)/C(c~), a ratio of value Y(a) and cost C(a). The value of an
action, how useful it is for toward the task, is based on Shannon's measure of average
mutual information, Y(a) = I(T, ea), where T is the variable representing the goal of
the task and ea is the combination of all the evidence added to the composite net by
action a. An action's cost is its execution time. The exact forms of the cost and utility
functions depend on the expecLed area net and will be given in Section 4.
An important feature of the TEA-1 design is that a different task net is plugged into
the composite net for each task the system is able to solve. The calculation of an action's
value depends on the task net. Thus the action utilities directly reflect the information
needs of the specific task, and produce a pattern of camera and fovea movements and
visual operations that is unique to the task.
Geometric relations between objects are modeled by an expected area net. The expected
area net and PART-0F net have the same structure: A node in the PART-0F net identifies a
particular object within the sub-part structure of the scene, and the corresponding node
in the expected area net identifies the area in the scene in which that object is expected
to be located. Fig. 1 shows the structure of one example of an expected area net.
In TEA-1 we assume a fixed camera origin. The location of an object in the scene is
specified by the two camera angles, O = (r ~tilt), that would cause the object to be
centered in the visual field. The height and width of an object's image is also specified
using camera angles.
Thus a node in the expected area net represents a 2-D discrete random variable, 0.
BEL(O) is a function on a discrete 2-D grid, with a high value corresponding to a scene
location at which the object is expected with high probability. Fig. 2(a)-(b) shows two
examples of expected areas. Note that these distributions are for the location of the
center of the object, and not areas of the scene that may contain any part of the object.
Each node also contains values for the height and width of the object. Initially these are
expected values, but once an object is located by a visual action the detected height and
width are stored instead. The height and width are not used in belief calculation directly,
but will be used to calculate conditional probabilities on the links (see below).
A root node R of an expected area net has an a priori probability, P(OR), which we
assume is given. A link from node A to node B has an associated conditional probability,
P(OB [ OA). Given a reasonable discretization, say as a 32x32 grid, each conditional
probability table has just over a million entries. Such tables are unreasonable to specify
545
Fig. 1. The structure of an expected area net. The corresponding PART-OF net is similar.
Fig. 2. The expected area (a) for setting-area (a place setting area) before the location of any
other object is determined, and (b) for napkin after the location of the tabletop and plate have
been determined. The relation maps (c) for setting-area given tabletop, and (d) for napkin given
setting-area.
and cause the calculation of new belief values to be very slow. Next we present a way to
limit this problem.
R e l a t i o n M a p s Simplify Specification o f P r o b a b i l i t i e s . We make the following
observations about the table of P(OB I Oa) values: 1) The table is highly repetitious.
Specifically, ignoring edge effects, for every location of object A the distributions are
all the same if they are considered relative to the given location of object A. 2) Belief
calculations can be sped up by detecting terms that will have zero value. Therefore, rather
than calculate all values of the distribution, we should use a function to calculate selective
values. 3) The distribution depends on the size of object A. We assume the expected
height and width of object A's image are known, but whenever an action provides direct
observations of the object's dimensions those values should be used instead.
Our solution is to compute values of the conditional probabilities using a special
simplified distribution called a relation map. A relation map assumes that object A has
unity dimensions and is located at the origin. The relation map is scaled and shifted ap-
propriately to obtain values of the conditional probability. This calculation is performed
by a function that can be called to calculate select values of the conditional probability.
Note that the spatial resolution of the relation map grid can be less than that of the
546
expected area grid. Fig. 2(c)-(d) shows two examples of relation maps that were used in
the calculation of the two expected areas shown in Fig. 2(a)-(b).
Given an expected area grid that is 32x32 ( N x N ) and a relation map grid that is
16x16 (MxM), all the values for one link's conditional probability table can be obtained
by specifying only 256 (M s) values. The brute force approach would require that 1048576
(N 4) values be specified.
S p e e d i n g u p C a l c u l a t i o n of Belief Values. When the set of expected locations for
an object covers a relatively small area of the entire scene, the table of P(0B [ 0A) values
contains a large number of essentially zero values that can be used to speed up the belief
propagation computation. The equations for belief propagation (and our notation) c a n
be found in [7]. We do not give all the equations here for lack of space. The calculation
of new BEL(z) values for node X, with parent node U, contains two key equations:
~r(z) = ~ j P(z ] uj)Trx(uj) and Ax(u) = ~'~i P(zi [ u)A(z~). These summations involve
considerable time since x and u both denote a 2-D array (grid) of variables. Time can be
saved in the first equation by not summing a term (which is an array) when it is multiplied
by an essentially zero value. Specifically, for all j where rx (uj) is essentially zero, we do
not add the P(x [ uj)rx(uj) term (an array) into the summation. Similar savings can
be obtained in the second equation. For any given value of i, the P(zi [ u)),(zi) term
(an array) contains essentially zero values everywhere except for a few places (a small
window in the array). We locate that window and only perform the sum for values inside
the window.
C o m b i n i n g L o c a t i o n I n f o r m a t i o n . The expected area for node B is actually cal-
culated not from a single node like node A, but by combining "messages" about expected
areas sent to it from its parent and all its children. This combination is performed within
the calculation of BEL(B). Generally, it is useful to characterize relations as "must-be",
"must-not-be" and "could-be". Combination of two "must-be" maps would then be by
intersection, and in general map combination would proceed by the obvious set-theoretic
operations corresponding to the inclusive or exclusive semantics of the relation. In TEA-
l, however, all the relations are "could-be", and the maps are essentially unioned by the
belief calculation.
M o v i n g c a m e r a s . Actions in TEA-1 that must move the camera to the expected loca-
tion of a specific (expected) object, say X, will move the camera to the center of mass of
the expected area for object X. (This happens even if the expected area, when thresh-
olded to a given confidence level, is larger than the camera's field of view. That case
could be handled by making several camera movements to cover the expected area.)
P r o c e s s i n g O n l y a P o r t i o n o f a n I m a g e . Every action related to a specific object
X processes only the portion of the image that is covered by the expected area of object
X, when thresholded to a given confidence level. Let I E (0, 1) be the confidence level,
which usually will be chosen close to 1 (typically 0.9). Let G~ be the smallest subset of
all the grid points Gx for node X (that corresponds with object X) in the expected area
net, such that their probabilities add up to I. G~ is the portion of the scene that should
be analyzed by the action. Each action in TEA-1 creates a mask that corresponds to the
portion of the current image data (i.e. after a camera movement) that overlaps G~, and
processes only the image pixels that are covered by that mask.
D e c i d i n g w i t h E x p e c t e d A r e a s . T E A - I ' s utility function for an action has the
following features: 1) Costs are proportional to the amount of image data processed.
547
2) It deals with peripheral actions that detect an object but don't otherwise generate
information for the task. 3) It considers that an action may have the impact of making
the expected areas of other objects smaller. Recall that the utility of an action a is
fundamentally modeled as a ratio of value V(a) (average mutual information) and cost
C(a) as explained near the end of Section 2.
An action a related to a specific object X has a cost proportional to the amount of
image data that it processes. Thus TEA-1 defines the cost as C(a) = rtaCo(a), where
C0(a) is the execution time of action a if it processed a hypothetical image covering
the entire scene, r~ is the ratio of the expected area for object X and the area of the
entire scene. So the value of r~ is the size of the subset G~ divided by the size of the
set Gx. rtx = 1 means that object X could be located anywhere in the entire scene.
Over time, as other objects in the scene are located and as more and tighter relations
are established, the value of r~ approaches zero. (Soon we will use a more accurate
cost function that has an additional term for the cost of moving the camera or fovea,
c ( a ) = c . . . . (a) +
TEA-1 uses the following "lookahead" utility function U(a) for action a.
v(7)
----argmaz'rEPre(a) 6(7)
The first term in equation (1) accounts for the future value of establishing the location of
an object. P r e ( a ) is the set of actions 7 such that EITHER 7 has a precondition satisfied
by executing action a OR 7 is already executable and V(7)/C(7) < Y(a)/C(a). The
second term in equation (1) accounts for the impact of making expected areas smaller so
that future actions will have lower costs, s~ is like r~ except it assumes that the location
of action a's associated object is known. H 6 (0, 1) is a gain factor that specifies how
much to weigh the second term relative to the first term. See [7] and [10] for more details
about I and U respectively,'.
5 Experimental Results
1
8 4.3 p e r - d e t e c t - n a p k i n 0.026
9 3.3 r o y - c l a s s i f y - c u p 0.026
10 2.4 f o r - c l a s s i f y - p l a t e 0.026
11 1.7 per-detect-hough-bowl 0.026
12 0.6 p e r - d e t e c t - b u t t e r 0.026
13 0.4 r o y - v e r i f y - b u t t e r 0.026
Fig. 3, The sequence of actions selected and executed by TEA-1 is shown in the table at left.
Each line corresponds to one cycle in the main control loop. The belief values listed are those
after incorporating the results from each action. The BEL(i) column shows B E L ( i n f o r m a l ) ,
and B E L ( f o r m a l ) = 1 - BEL(i). The path drawn on the wide-angle picture of the table scene
at the right illustrates the camera movements made in the action sequence.
Fig. 4. Processing performed by individual actions. Image pixels outside the expected area mask
are shown as gray values. (a) Results from the per-detect-hough-plate action executed at time
step 4. (b) Results from the per-detect-napkin action executed at time step 8. The mask prevents
the red napkin from being confused with the pink creamer container just above the plate. (c)
Results from the roy-classify-plate action executed at time step 10. A zoomed display of the fovea
centered on the plate is shown. Note: Fig. 5(b) shows results from the per-detect-hough-cup
action executed at time step 2.
location of the cup, and the action mistakenly detects the creamer container as the cup.
The situation improves once the tabletop is located, as shown in parts (b) and (f)..The
expected area is (almost) small enough to fit in the field of view and its center corre-
sponds better with the cup's actual location. A small portion of the image is masked out
by the expected area, and the cup is correctly detected, but this is just lucky since the
creamer and many other objects are still in the unmasked area. Parts (c) and (g) show
the situation after the plate has been located. The cup's expected area is much smaller.
Finally, in parts (d) and (h), once the napkin has been located, the cup's expected area
is small enough that the action is very likely to detect the cup correctly.
6 Concluding Remarks
Several people are investigating the use of Bayes nets and influence diagrams in sensing
problems. The most relevant work comes from two groups: Levitt's group was the first to
apply Bayes nets to computer vision [I, 6]. Dean's group is studying applications in sensor
based mobile robot control, using a special kind of influence diagram called a temporal
belief network (TBN) [3, 4]. More recently, they have used sensor data to maintain an
occupancy grid, which in turn affects link probabilities in the TBN.
The current TEA-1 system design, incorporating expected area nets, provides a frame-
work that enables the system to make decisions about moving a camera around and about
selectively gathering information. Thus we can begin using TEA-1 to study questions re-
garding task-oriented vision [8, 9, 10].
Deciding where to move a camera (or fovea) is an interesting problem. TEA-1 does
the simplest thing possible by moving to the center of the expected area of one object.
If several objects of interest should fall in the field of view, then it may for example be
better to move the camera to the center of that set of objects. In our experiments to
date, TEA-1 has relied mainly on camera movements to get the first piece of information
about an object, while fovea movements are mostly used for verification. This behavior
is determined by the costs and other parameters associated with actions. Another inter-
esting problem is to consider the tradeoffs between a camera and a fovea movement. A
camera movement is expensive and an action following one processes a completely new
area of the scene, which means there is risk of not finding anything, but if something
is found it will likely have large impact for the task. Alternatively, a fovea movement is
cheap but produces image data near an area already analyzed, so there is a good chance
of finding some new information, but it will tend to have a small impact on the task.
References
1. J. M. Agosta. The structure of Bayes networks for visual recognition. In Uncertainty in
A], pages 397-405. North-Holland, 1990.
2. E. Charniak. Bayesian networks without tears. AI Magazine, 12(4):50-63, Winter 1991.
3. T. Dean, T. Camus, and g. Kirman. Sequential decision making for active perception. In
Proceedings: DARPA linage Understanding Workshop, pages 889-894, 1990.
4. T. L. Dean and M. P. Wellman. Planning and Control. Morgan Kaufmann, 1991.
5. M. Henrion, J. S. Breese, and E. J. Horvitz. Decision analysis and expert systems. A I
Magazine, 12(4):64-91, Winter 1991.
6. T. Levitt, T. Binford, G. Ettinger, and P. Gelband. Probability-based control for computer
vision. In Proceedings: DARPA Image Understanding Workshop, pages 355-369, 1989.
7. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufman, 1988.
550
(c) . . . . . . . . . . (d)
@ Q Q
(e) (0 (g) (h)
Fig. 5. Performance of a cup detection action as the cup's expected area narrows over time:
(a) before any objects have been located, (b) after the .tabletop has been located, (c) after the
tabletop and plate have been located, (d) after the tabletop, plate and napkin have been located.
The cup's expected areas in (a)-(d) are plotted separately in (e)-(h). These plots must be rotated
90 degrees clockwise to match the images in (a)-(d).
8. R. D. Rimey. Where to look next using a Bayes net: An overview. In Proceedings: DARPA
Image Understanding Workshop, 1992.
9. R. D. Rimey and C. M. Brown. Task-oriented vision with multiple Bayes nets. Technical
Report 398, Department of Computer Science, University of Rochester, November 1991.
10. R. D. Rimey and C. M. Brown. Task-oriented vision with multiple Bayes nets. In A. Blake
and A. Yuille, editors, Active Vision. MIT Press, 1992. Forthcoming.
This article was processed using the I$TEX macro package with ECCV92 style
An Attentional Prototype for Early Vision
Abstract.
Researchers have long argued that an attentional mechanism is required
to perform many vision tasks. This paper introduces an attentiona] pro-
totype for early visual processing. Our model is composed of a process-
ing hierarchy and an attention beam that traverses the hierarchy, passing
through the regions of greatest interest and inhibiting the regions that are
not relevant. The type of input to the prototype is not limited to visual
stimuli. Simulations using high-resolution digitized images were conducted,
with image intensity and edge information as inputs to the model. The re-
suits confirm that this prototype is both robust and fast, and promises to
be essential to any real-time vision system.
1 Introduction
Systems for computer vision are confronted with prodigious amounts of visual informa-
tion. They must locate and analyze only the information essential to the current task
and ignore the vast flow of irrelevant detail if any hope of real-time performance is to
be realized. Attention mechanisms support efficient, responsive analysis; they focus the
system's sensing and computing resources on selected areas of a scene and may rapidly
redirect these resources as the scene task requirements evolve. Vision systems that have
no task guidance, and must provide a description of everything in the scene at a high
level of detail as opposed to searching and describing only a sub-image for a pre-specified
item, have been shown to be computationally intractable [16]. Thus, task guidance, or
attention, plays a critical role in a system that is hoped to function in real time. In short,
attention simplifies computation and reduces the amount of processing.
Computer vision models which incorporate parallel processing are prevalent in the
literature. This strategy appears appropriate for the vast amounts of input data that
must be processed at the low-level [4, 19]. However, complete parallelism is not possible
because it requires too many processors and connections [11, 17]. Instead, a balance
must be found between processor-intensive parallel techniques and time-intensive serial
techniques. One way to implement this compromise is to process all data in parallel at
the early stages of vision, and then to select only part of the available data for further
processing at later stages. Herein lies the role of attention: to tune the early visual input
by selecting a small portion of the visual stimuli to process.
This paper presents a prototype of an attentional mechanism for early visual process-
ing. The attention mechanism consists of a processing hierarchy and an attention beam
that guides selection. Most attention schemes previously proposed are fragile with respect
to the question of "scaling up" with the problem size. However, the model presented here
has been derived with a full regard of the amount of computation required. In addition,
this model provides all of the details necessary to construct a full implementation that
552
is fast and robust. Very few implemented models of attention exist. Of those, ours is
one of the first that performs well with general high-resolution images. Our implemented
attention beam may be used as an essential component in the building of a complete
real-time computer vision system.
Certain aspects of our model are not addressed in this investigation, such as the
implementation of task guidance in the attention scheme. Instead, emphasis is placed on
the bottom-up dimensions of the model that localize regions of interest in the input and
order these regions based on their importance.
The simulations presented in this paper reveal the potential of this attention scheme.
The speed and accuracy of our prototype are demonstrated by using actual 256 x ~56
digitized images. The mechanism's input is not constrained to any particular form, and
can be any response from the visual stimuli. For the results presented, image intensity
and edge information are the only input used. For completeness, relationships to existing
computational models of visual attention are described.
2 Theoretical Framework
The structure of the attention model presented in this paper is determined in part by
several constraints derived from a computational complexity analysis of visual search
[17]. This complexity analysis quantitatively confirms that selective attention is a major
contributer in reducing the amount of computation in any vision system. Furthermore,
the proposed scheme is loosely modelled after the increasing neurophysiology literature
on single-cell recordings from the visual cortex of awake and active primates. Moreover,
the general architecture of this prototype is consistent with their neuroanatomy [17, 18].
At the most basic level, our prototype is comprised of a hierarchical representation
of the input stimuli and an attention mechanism that guides selection of portions of
the hierarchy from the highest, most abstract level through to the lowest level. Spatial
attentional influence is applied in a "spotlight" fashion at the top. The notion of a
spotlight appears in many other models such as that of Treisman [15]. However, if the
spotlight shines on a unit at the top of the hierarchy, there seems to be no mechanism
for the rest of the selection to actually proceed through to the desired items.
One way to solve this problem in a computer vision system is to simply address the
unit of interest. Such a solution works in the computer domain because computer memory
is random access. Unfortunately, there is no evidence four random access in the visual
cortex. Another possible solution is to simply connect all the units of interest directly.
This solution also fails to explain how the human visual cortex may function because the
number of such connections is prohibitive. For instance, to connect all possible receptive
fields to the units in a single 1000 x 1000 representation, 10 :s connections are needed to
do so in a brute force manner 1. Given that the cortex contains 10 :~ neurons, with an
estimated total number of connections of 10:3 , this is clearly not how nature implements
access to high resolution representations.
The spotlight analogy is therefore insufficient, and instead we propose the idea of a
"beam" - something that illuminates and passes through the entire hierarchy. A beam is
required that "points" to a set of units at the top. That particular beam shines throughout
the processing hierarchy with an inhibit zone and a pass zone, such that the units in the
pass zone are the ones that are selected (see Fig. 1). The beam expands as it traverses
the hierarchy, covering all portions of the processing mechanism that directly contribute
to the output at its point of entry at the top. At each level of the processing hierarchy,
a winner-take-all process (WTA) is used to reduce the competing set and to determine
the pass and inhibit zones [18].
laye
inpu
abst: inhibitory
1beam
"pass" zone
Fig. 1. The inhibitory attentional beam concept. Several levels of the processing hierarchy are
shown. The pass z o n e of the beam encompasses all winning inputs at each level of the hierarchy.
The darkest beams represent the actual i n h i b i t z o n e s rooted at each level of the hierarchy. The
light-grey beam represents the effective inhibit zone rooted at the most abstract level.
hierarchy will compute their responses based on a weighted summation of their inputs,
resulting in the configuration of Fig. 2(a). The first pass of the W T A scheme is shown in
Fig. 2(b). This is accomplished by applying steps 5 and 6 of Algorithm i for each level of
the hierarchy. Once an area of the input is attended to and all the desired information is
extracted, then the winning inputs are inhibited. The attention process continues "look-
ing" for the next area. The result is a very fast, automatic, independent, robust system.
Moreover, it is a continuous and reactive mechanism. In a time-varying image, it can
track an object that is moving if it is the item of highest response. In order to construct
such a tracking system, the input would be based on motion.
i
I
(]) Q | | @ |
@ 9174174
(a) (b)
Fig. 2. A one-dimensional processing hierarchy. (a) The initial configuration. (b) The most
"important" item is selected - the beam's pass (solid lines) and inhibit zone (dashed lines) are
shown.
-
o~+1
~ + 3-v~ '
where, z represents the number of basic elements in the receptive field. Varying t~ affects
the absolute value of the function's asymptote; varying/~ affects the steepness of the
first part of the function. It was found empirically that values of t~ = 10 and/~ = 1.03
generally give good results in most instances (see Figure 3).
Weighting Factor
l.lO -- ~ -!
1.08 -
1.06--
1.04 -
1.02 --
l.O0- RF[Area x 103
0.00 5.00 10.00 15.00 20.00
F i g . 3. F ( z ) for ~ = 10, 3 = 1 . 0 3 .
a The number 1 in the numerator is a result of normalizing ~'(z) for z -- O. The V~ is used to
account for the a r e a of the RF.
556
4 Experimental Results
4 In this instance, maxRF was set to 100 pixels so that only a portion of the longest line was
attended to at first
557
Fig. 4. Processing hierarchy and attention beam at two time intervals. The input layer is a 256
256 8-bit image. The beam is rooted at the highest level and "shines" through the hierarchy
to the input layer. The darker portion of the attention beam is the pass zone. Once a region of
the input is attended to, it is inhibited and the next "bright" area is found. The black areas in
the input layer indicate the regions that have been inhibited.
are attended to in order of the length of the line, much like Sha'ashua and Ullman's work
on saliency of curvature [14].
5 Discussion
The implementation of our attention prototype has a number of important properties that
make it preferable to other schemes. For example, Chapman [3] has recently implemented
a system based on the idea of a pyramid model of attention introduced by Koch and
Ullman [5]. Chapman's model places a log-depth tree above a saliency map. Similar to
our model, at each level nodes receive activation from nodes below. It differs, however,
in that Chapman's model only passes the maximum of these values to the next level.
There are several difficulties with this approach, the most serious being that the focus of
attention is not continuously variable. The restriction this places on Chapman's model
is that it cannot handle real pixel-based images but must assume a prior mechanism for
segmenting the objects and normalizing their sizes. Our scheme permits receptive fields
of all sizes at each level, with overlap. In addition, the time required for Chapman's
model is logarithmic in the maximum number of elements, making it impractical for
high-resolution images. Further, the time required to process any item in a sensory field
558
Fig. 5. Scan paths for a 256 x 256 8-bit image digitized image ( m i n R F = 5, m a x R F = 40).
The paths displays a priority order in which regions of the image are assessed.
Fig. 6. Scan paths for a 256 256 8-bit image digitized image consisting of horizontal and
vertical lines ( m i n R F = 10, m a x R F = 100). The path displays a priority order in which regions
of the image are assessed. The focus of attention falls on the longest lines first (only a portion
of the longest line is attended to first in this example because r n a x R F = 100).
559
Summary
Acknowledgements
Niels daVitoria Lobo provided helpful suggestions. This research was funded by the
Information Technology Research Centre, one of the Province of Ontario Centres of
Excellence, the Institute for 1~boties and Intelligent Systems, a Network of Centres of
Excellence of the Government of Canada, and the N a t u r a l Sciences and Engineering
Research Council of Canada.
References
1. C.H. Anderson and D.C. Van Essen. Shifter circuits: A computational strategy for dynamic
aspects of visual processing. In Proceedings of the National Academy of Science, USA,
volume 84, pages 6297-6301, 1987.
2. R. Califano, A. Kjeldsen and R.M. Bolle. Data and model driven foveation. Technical
Report RC 15096 (~67343), IBM Research Division - T.J. Watson Lab, 1989.
3. D. Chapman. Vision, Instruction and Action. PhD thesis, MIT AI Lab, Cambridge, MA,
1990. TR1204.
4. J.A. Feldman. Four frames suffice: A provisional model of vision and space. The Behavioral
and Brain Sciences, 8:265-313, 1985.
5. C. Koch and S. Ullman. Shifts in selective visual attention: Towards the underlying neural
circuitry. Human Neurobioiog~, 4:219-227, 1985.
6. B. KrSse and B. Julesz. The control and speed of shifts of attention. Vision Research,
29(11):1607-1619, 1989.
7. D. Marr. Early processing of visual information. Phil. Trans. R. Soc. Lond., B 275:483-524,
1976.
8. J. Moran and R. Desimone. Selective attention gates visual processing in the extrastriate
cortex. Science, 229:782-784, 1985.
9. M.C. Mozer. A connectionist model of selective visual attention in visual perception. In
Proceedings: 9th Conference of the Cognitive Science Society, pages 195-201, 1988.
10. M.C. Mozer. The Perception of Multiple Objects: A Connectionist Approach. MIT Press,
Cambridge, MA, 1991.
11. U. Neisser. Cognitive Psychology. Appleton-Century-Crofts, New York, NY, 1967.
12. Y. Posner, M.I. Cohen and R.D. Rafal. Neural system control of spatial ordering. Phil.
Trans. R. Soe. Loud., B 298:187-198, 1982.
13. P.A. Sandon. Simulating visual attention. Journal of Cognitive Neuroscience, 2(3):213-231,
1990.
14. A. Sha'ashua and S. Ullman. Structure saliency: The detection of giobally salient structures
using a locally connected network. In Proceedings of the Second ICCV, pages 321-325,
Tampa, FL, 1988.
15. A. Treisman. Preattentive processing in vision. Computer Vision, Graphics, and Image
Processing, 31:156-177, 1988.
16. J.K. Tsotsos. The complexity of perceptual search tasks. In Proceedings, IJCAL pages
1571-1577, Detroit, 1989.
17. J.K. Tsotsos. Analyzing vision at the complexity level. The Behavioral and Brain Sciences,
13:423-469, 1990.
18. J.K. Tsotsos. Localizing stimuli in a sensory field using an inhibitory attentional beam.
Technical Report RBCV-TR-91-37, University of Toronto, 1991.
19. L.M. Uhr. Psychological motivation and underlying concepts. In S.L. Tanimoto and
A. Klinger, editors, Structured Computer Vision. Academic Press, New York, NY, 1980.
20. A.L. Yarbus. Eye Movements and Vision. Plenum Press, 1967.
W h a t c a n b e s e e n in t h r e e d i m e n s i o n s with an
u n c a l i b r a t e d s t e r e o rig?
Olivier D. Faugeras
1 Introduction
The problem we address in this paper is that of a machine vision system with two cameras,
sometimes called a stereo rig, to which no thro~-dimensional metric information has been
made available. The only information at hand is contained in the two images. We assume
that this machine vision system is capable, by comparing these two images, of establishing
correspondences between them. These correspondences can be based on some measures
of similitude, perhaps through some correlation-like process. Anyway, we assume that
our system has obtained by some means a number of point correspondences. Each such
correspondence, noted (m, m') indicates that the two image points m and m' in the two
retinas are very likely to be the images of the same point out there. It is very doubtful
564
at first sight that such a system can reconstruct anything useful at all. In the machine
vision jargon, it does not know either its intrinsic parameters (one set for each camera),
nor its extrinsic parameters (relative position and orientation of the cameras).
Surprisingly enough, it turns out that the machine vision system can nonetheless
reconstruct some very rich non-metric representations of its environment. These repre-
sentations are defined up to certain transformations of the environment which we assume
to be three-dimensional and euclidean (a realistic assumption which may be criticized by
some people). These transformations can be either affine or projective transformations
of the surrounding space. This depends essentially on the user (i.e the machine vision
system) choice.
This work has been inspired by the work of Jan Koenderink and Andrea van Doom
[4], the work of Gunnar Sparr [9,10], and the work of Roger Mohr and his associates [6,7].
We use the following notations. Vectors and matrixes will he represented in boldface,
geometric entities such as points and lines in normal face. For example, rn represents a
point and m the vector of the coordinates of the point. The line defined by two points
M and N will be denoted by (M, N). We will assume that the reader is familiar with
elementary projective geometry such as what can be found in [8].
2.1 A s i m p l e e x p r e s s i o n for
We write that
PAi=pial i=1,...,4
which implies, thanks to our choice of coordinate systems, that P has the form:
---- p~ 0 P4 (I)
0 P3 P4
565
Let a5 = [a, fl, 7] T, then the relation P A 5 = psa5 yields the three equations:
Pl + P 4 = p S a P2 + P 4 = PS~ P 3 + P 4 = P57
We now define/J = P5 and v = P4, matrix P can be written as a very simple function of
the two unknown parameters p and v:
= [00i]
# = ~:K + ~ Y
~ o
(2)
(3)
q =
[:o Ol] 07
-1 0 1 (4)
0 -11
A similar expression holds for P ' which is a function of two unknown parameters p' and
/fl:
#, = ~,~, + ~',~
2.2 O p t i c a l c e n t e r s a n d e p i p o l e s
Equation (2) shows that each perspective matrix depends upon two projectiv%parameters
i.e of one parameter. Through the choice of the five points Ai, i = 1, 9 9 5 as the standard
coordinate system, we have reduced our stereo system to be a function of only two
arbitrary parameters. What have we lost? well, suppose we have another match (m, mr),
it means that we can compute the coordinates of the corresponding three-dimensional
point M as a function of two arbitrary parameters in the projective coordinate system
defined by the five points Ai, i = 1 , . . . , 5. Our three-dimensional reconstruction is thus
defined up to the projective transformation (unknown) from the absolute coordinate
system to the five points Ai and up to the two unknown parameters which we can choose
as the ratios x = ~ and x ~ = ~ . We will show in a moment how to eliminate the
dependency upon x and z ~ by using a few more point matches.
v c~/J v - / ~ / J v--Tp
a set of remarkably simple expressions. Note that the coordinates of C depend only upon
the ratio z:
C= 1
1-az [
1
1-3x'
1
1-Ta:'
1
Identical expressions are obtained for the coordinates of C ' by adding ':
1
C,=[ 1 1 i I]T [ i I I IT
9v~__-cd/j,, v~ ~/zl , v~ 71pl, V, = I_CdX~ , l_/3~Xt, l _ 7 , X ~ , 1
566
If we now use the relation I~'C = o I to define the epipole d in the second image, we
immediately obtain its coordinates:
o'= L v_p-
a'~'a' v' +-vlv ' pl~l~_._
V
P~"
-Vt
- -b Vl , P171v_pT--
vl +__]Trip = "[Xl--~--I
x--al--xa
-- ' xl/311--x/3--
x/3, xl7
'l_xT-XT]T
(~)
We note that they depend only on the ratios x and x ~. We have similar expressions for
the epipole o defined by P C ~ = o:
(~)
Constraints o n the coordinates of the epipoles The coordinates of the epipoles are
not arbitrary because of the epipolartransformation.This transformation is well-known
in stereo and motion [3].It says that the two pencils of epipolar lines are related by a
collineation,i.e a linear transformation between projective spaces (here two projective
lines).It implies that we have the equalitiesof two cross-ratios,for example:
3.1 C o m p l e t e d e t e r m i n a t i o n o f I~ a n d 1~l
Assume for a moment that we know the epipoles o and d in the two images (we show in
section 3.3 how to estimate their coordinates). This allows us to determine the unknown
parameters as follows. Let, for example, U', V ~ and W ~ be the projective coordinates of
d . According to equation (5), and after some simple algebraic manipulations, we have:
x=0 xt=0
i Xt__ i
X=-- --.-7/
7 7
o'-(as^a~) x' ~" 7u'v'(a-~)+c~v'w'(7-a)+aw'u'(~-'O
z o~-(asAa~) = -7'v'v'(~-a)+a'v'w'(7-~)+yw'u'(~--r)
Where
o~ = [~v', ~v', ~w'] r
One of these points has to be a double point where the two conics are tangent. Since it
is only the last pair (z, z ~) which is a function of the epipolar geometry, it is in general
the only solution.
567
Note that since the equations of the two epipoles are related by the two equations
described in appendix A, they provide only two independent equations rather than four.
The perspective matrixes P and P ' are therefore uniquely defined. For each match
(m, m ~) between two image points, we can then reconstruct the corresponding three-
dimensional point M in the projective coordinate system defined by the five points Ai.
Remember that those five points are unknown. Thus our reconstruction can be considered
as relative to those five points and depending upon an arbitrary perspective transforma-
tion of the projective space ~p3. All this is completely independent of the intrinsic and
extrinsic parameters of the cameras.
We have obtained a remarkably simple result:
In the case where at least eight point correspondences have been obtained between two
images of an uncalibrated stereo rig, if we arbitrarily choose five of those correspondences
and consider that they are the images of five points in general positions (i.e not four of
them are coplanar), then it is possible to reconstruct the other three points and any other
point arising from a correspondence between the two images in the projective coordinate
system defined by the five points. This reconstruction is uniquely defined up to an unknown
projective transformation of the environment.
where the scalars A and # are determined by the equation ~ " M = m ~ which says that
m ~ is the image of M. Applying P ' to both sides of the previous equation, we obtain
p,p-1 0 ~'x'-I 0
0 0 7~x'-I
~,z--1
Ixt
Let us note a -- ~ , _ b -- ~ x - 1 , and c -- 7=_-11 . A and/J are then found by solving
the system of three linear equations in two unknowns
568
3.4 C h o o s i n g t h e five p o i n t s Ai
1 As mentioned before, in order for this scheme to work, the three-dimensional points
that we choose to form the standard projective basis must be in general position. This
means that no four of them can be coplanar. The question therefore arises of whether we
can guarantee this only from their projections in the two retinas.
The answer is provided by the following observation. Assume that four of these points
are coplanar, for example A1, A2, A3, and A4 as in figure 1. Therefore, the diagonals of
the planar quadrilateral intersect at three points B1, Bs, B3 in the same plane. Because
the perspective projections on the two retinas map lines onto lines, the images of these
diagonals are the diagonals of the quadrilaterals al, as, a3, a4 and at, aS, a~, a~ which
intersect at bl, bs, b3 and bt, b~, b~, respectively. If the four points Ai are coplanar, then
the points b~, j = 1, 2, 3 lie on the epipolar line of the points bj, simply because they
are the images of the points Bj. Since we know the epipolar geometry of the stereo rig,
this can be tested in the two images.
But this is only a necessary condition, what about the reverse? suppose then that b~
lies on the epipolar line of bl. By construction, the line (C, bl) is a transversal to the
two lines (A1, As) and (A2, A4): it intersects them in two points C1 and C2. Similarly,
(C I, bt) intersects (A1, As) and (As, A4) in CI and C~. Because bt lies on the epipolar
line of b~, the two lines (C, b~) and (C', b~) are coplanar (they lie in the same epipolar
plane). The discussion is on the four coplanar points C1, C2, C~, C~. Three cases occur:
1. C1 r C~ and Cs r C~ implies that (A1, As) and (As, A4) are in the epipolar
plane and therefore that the points al, as, as, a4 and at, aS, as,~ a 4i are aligned on
corresponding epipolar lines.
2. C1 --- C~ and Cs r C~ implies that (A1, As) is in the epipolar plane and therefore
that the lines (al, as) and (at, a~) are corresponding epipolar lines.
3. The case C1 r CI and C2 - C~ is similar to the previous one.
4. C1 -- C~ and C2 - C~ implies that the two lines (A1, As) and (As, A4) are coplanar
and therefore also the four points A1, As, As, A4 (in that cas we have C1 --- C~ -
C s -- C~ - B 1 ) .
In conclusion, except for the first three "degenerate cases" which can be easily detected,
the condition that bt lies on the epipolar line of bl is necessary and sufficient for the four
points A1, A2, As, A4 to be coplanar.
The basic idea also works if instead of choosing five arbitrary points in space, we choose
only four, for example Ai, i = 1, 9 9 4. The transformation of space can now be chosen in
such a way that it preserves the plane at infinity: it is an affine transformation. Therefore,
in the case in which we choose four points instead of five as reference points, the local
reconstruction will be up to an affine transformation of the three-dimensional space.
Let us consider again equation 1, change notations slightly to rewrite it as:
~= 0
r
where matrix
aX
A = V' bY
W' eZ
is in general of rank 2. We then have
3.3 D e t e r m i n i n g t h e e p i p o l e s f r o m p o i n t m a t c h e s
The epipoles and the epipolar transformation between the two retinas can be easily
determined from the point matches as follows. For a given point m in the first retina, its
epipolar line om in the second retina is linearly related to its projective representation.
If we denote by F the 3 x 3 matrix describing the correspondence, we have:
orn = F m
where o,n is the projective representation of the epipolar line o,n. Since the corresponding
point m' belongs to the line em by definition, we can write:
m'TFm : 0 (8)
B2
A2
A1 As
Fig. 1. If the four points A1, A2, As, Ai are coplanar, they form a planar quadrilateral whose
diagonals intersect at three points B1, B2, B3 in the same plane
with a similar expression with ' for P ' . Each perspective matrix now depends upon 4
projective parameters, or 3 parameters, making a total of 6. If we assume, like previously,
that we have been able to compute the coordinates of the two epipoles, then we can write
four equations among these 6 unknowns, leaving two. Here is how it goes.
It is very easy to show that the coordinates of the two optical centers are:
1,11 !] T C' 1,1 1 1]T
C=[p q, r ' =[7 q" r "
from which we obtain the coordinates of the two epipoles:
pt 81 ql 81 r~
P T o' T
~ =[7 ;'q ;'r s
Let us note
p# ql rI 81
X1 ~ -- X 2 ~-~ - - X 3 ---~ - - X4 ~ --
p q r s
We thus have for the second epipole:
X 1 -- g 4 U e x 2 -- z 4 -- V t
(10)
x3 - - x4 W~ x3 - - x4 W~
replacing these values for xl and x2 in equations (10), we obtain a system of two linear
equations in two unknowns z3 and z4:
~' - v ( w , - u,) ~ = ~ : ~ ~
We can therefore express matrixes 13 and 131 as very simple functions of the four projective
parameters p, q, r, s:
9 U'W_ U'(W-U~ "1
V'W . n
131= 0 wr'r v vU s
u
0 0 r s
There is a detail that changes the form of the matrixes 13 and 13' which is the following.
We considered the four points Ai, i = 1 , . . . , 4 as forming an affine basis of the space.
Therefore, if we want to consider that the last coordinates of points determine the plane
at infinity we should take the coordinates of those points to have a 1 as the last coordinate
instead of a 0. It can be seen that this is the same as multiplying matrixes 13 and P ' on
the right by the matrix
010
Q= 001
111
Similarly, the vectors representing the points of p 3 must be multiplied by Q - 1 . For
example
~1,1,1+1 1 1
c=[ p 7
We have thus obtained another remarkably simple result:
In the case where at least eight point correspondences have been obtained between two
images of an uncalibrated stereo rig, if we arbitrarily choose four of these correspondences
and consider that they are the images of four points in general positions (i.e not coplanar),
then it is possible to reconstruct the other four points and any other point arising from a
correspondence between the two images in the alpine coordinate system defined by the four
points. This reconstruction is uniquely defined up to an unknown affine transformation
of the environment.
The main difference with the previous case is that instead of having a unique de-
termination of the two perspective projection matrixes 13 and 13', we have a family of
such matrixes parameterized by the point o f P 3 of projective coordinates p, q, r, s. Some
simple parameter counting will explain why. The stereo rig depends upon 22 = 2 x 11
parameters, 11 for each perspective projection matrix. The reconstruction is defined up
to an affine transformation, that is 12 = 9 + 3 parameters, the knowledge of the two
epipoles and the epipolar transformation represents 7 = 2 + 2 + 3 parameters. Therefore
we are left with 22 - 12 - 7 = 3 loose parameters which are the p, q, r, s.
572
is in general of rank 2. The projective coordinates of the reconstructed point M are then:
[~ 1 1 1]T+.~[X, Y Z o ] T )
M=Q-I(/J ' q' r' q' r'
4.2 C h o o s i n g t h e p a r a m e t e r s p, q, r, 8
T h e p a r a m e t e r s p, q, r, s can be chosen arbitrarily. Suppose we reconstruct the s a m e
scene with two different sets of p a r a m e t e r s p l , ql, r l , Sl and p~, q2, r2, s2. Then the
relationship between the coordinates of a point M1 and a point Ms reconstructed with
those two sets from the same image correspondence (m, m ' ) is very simple in projective
coordinates:
M2 = Q - I
i"~
q~
0 ~r l
0 0
QM1 =
$1
surprise t h a t they are not related by an afline t r a n s f o r m a t i o n but it is clearly the case
t h a t the above transformation preserves the plane at infinity if and only if
P2 q2 r2 s2
Pl ql rl Sl
If we have more information a b o u t the stereo rig, for example if we know t h a t the two
optical axis are coplanar, or parallel, then we can reduce the number of free parameters.
We have not yet explored experimentally the influence of this choice of parameters on
the reconstructed scene and plan to do it in the future.
573
6 Experimental results
This theory has been implemented in Maple and C code. We show the results on the
calibration p a t t e r n of figure 2. We have been using this p a t t e r n over the years to calibrate
our stereo rigs and it is fair enough to use it to d e m o n s t r a t e t h a t we will not need it
anymore in the forthcoming years.
The p a t t e r n is m a d e of two perpendicular planes on which we have painted with
great care black and white squares. T h e two planes define a n a t u r a l euclidean coordinate
frame in which we know quite accurately the coordinates of the vertexes of the squares.
The images of these squares are processed to extract the images of these vertexes whose
pixel coordinates are then also known accurately. T h e three sets of coordinates, one set
in three dimensions and two sets in two dimensions, one for each image of the stereo
rig, are then used to estimate the perspective matrixes P1 and P2 from which we can
compute the intrinsic p a r a m e t e r s of each camera as well as the relative displacement of
each of t h e m with respect to the euclidean coordinate system defined by the calibration
pattern.
We have used as input to our p r o g r a m the pixel coordinates of the vertexes of the
images of the squares as well as the pairs of corresponding points 4. F r o m these we can
e s t i m a t e the epipolar geometry and perform the kind of local reconstruction which has
2 In fact there axe two which axe obtained as follows: given the family of all lines, if we impose
that this line intersects a given line, this is one condition, therefore there is in general a finite
number of lines which intersect four given lines. This number is in general two and the two
invaxiants axe the cross-ratios of the two sets of four points of intersection.
z This is true only, according to section 4.2, if the two reconstructions have been performed
using the same parameters p, q, r and s.
In practice, these matches axe obtained automatically by a program developed by R~gis Vail-
last which uses some a priori knowledge about the calibration pattern.
575
been described in this paper. Since it is hard to visualize things in a projective space, we
have corrected our reconstruction before displaying it in the following manner.
We have chosen A1, A2, Aa in the first of the two planes, A4, As in the second, and
checked that no four of them were coplanar. We then have reconstructed all the vertexes
in the projective frame defined by the five points Ai, i = 1 , . - . , 5 . We know that this
reconstruction is related to the real calibration pattern by the projective transformation
that transforms the five points (as defined by their known projective coordinates in the
euclidean coordinate system defined by the pattern, just add a 1 as the last coordinate)
into the standard projective basis. Since in this case this transformation is known to
us by construction, we can use it to test the validity of our projective reconstruction
and in particular its sensitivity to noise. In order to do this we simply apply the inverse
transformation to all our reconstructed points obtaining their "corrected" coordinates in
euclidean space. We can then visualize them using standard display tools and in particular
look at them from various viewpoints to check their geometry. This is shown in figure 3
where it can be seen that the quality of the reconstruction is quite good.
7 Conclusion
This paper opens the door to quite exciting research. The results we have presented in-
dicate that computer vision may have been slightly overdoing it in trying at all costs to
obtain metric information from images. Indeed, our past experience with the computa-
tion of such information has shown us t h a t it is difficult to obtain, requiring awkward
calibration procedures and special purpose patterns which are difficult if not impossible
to use in natural environments with active vision systems. In fact it is not often the case
that accurate metric information is necessary for robotics applications for example where
relative information is usually all what is needed.
In order to make this local reconstruction theory practical, we need to investigate in
more detail how the epipolar geometry can be automatically recovered from the environ-
ment and how sensitive the results are to errors in this estimation. We have started doing
576
oo[J
Fig. 3. Several rotated views of the "corrected" reconstructed points (see text)
this and some results are reported in a companion paper [2]. We also need to investigate
the sensitivity to errors of the affine and projective invariants which are necessary in
order to establish correspondences between local reconstructions obtained from various
viewpoints.
Acknowledgements:
I want to thank Th~o Papadopoulo and Luc Robert for their thoughtful comments on
an early version of this paper as well as for trying to keep me up to date on their latest
software packages without which I would never have been able to finish this paper on
time.
A C o m p u t i n g s o m e cross-ratios
18=UIl+V12 14=(W-U)II+(W-V)I2
The cross-ratio of the four lines is equal to cross-ratio of the four "points" 11, 12, 13, 14:
Y W - Y V(W- U)
{<o, al>, <o, a~>, (o, ~ ) , <o, ~4>} = {0, oo, U' W - - 5 } = g ( w V)
577
Therefore, the projective coordinates of the two epipoles satisfy the first relation:
V(W - U) V'(W' - U') (12)
U(W - V) - V'(W' - V')
In order to compute the second pair of cross-ratios, we have to introduce the fifth line
(0, a5), compute its projective representation 15 = o h as, and express it as a linear
combination of 11 and 12. It comes that:
15 = (U7 - W a ) l l + (V'}, - Wfl)12
From which it follows that the second cross-ratio is equal to:
{(o, al), (o, a2), (o, as), (o, as)} = {0, cr -~, ~ } =
v(v~,-w~)
U(U.y-wa)
Therefore, the projective coordinates of the two epipoles satisfy the second relation:
Y ( V ' t - Wj3) Vt(Vt'r' - W t f l t) (13)
- woo - v,(u,.r, -
We relate here the essential matrix F to the two perspective projection matrixes P and
~". Denoting as in the main text by P and P ' the 3 x 3 left sub-matrixes of t ' and P',
and by p and p' the left 3 x 1 vectors of these matrixes, we write them as:
= [ P P] P'= [P'P'I
Knowing this, we can write:
C = [P-11
p]_ Moo=P-lm
and we obtain the coordinates of the epipole o' and of the image moo' of Moo in the
second retina:
01 = ~l C = p i p - l p _ p, moo
i = pip-1 m
The two points o' and moo define the epipolar line om of m, therefore the projective
representation of om is the cross-product of the projective representations of o' and m ~ :
o m = o t ^ m : = o-'moo'
where we use the notation 6' to denote the 3 x 3 antisymmetric matrix representing the
cross-product with the vector o ~,
From what we have seen before, we write:
Thus:
<":'-'
<,:_, 1 r 1
p,p-ip_ p, = ~ _ /~/
~z-1
578
5' =
~x- 1 a x ---']"-
and finally:
l
0 3,'X'-TX
-- fix-1 - ~yx--1
"yx-1 ~x-1
i i
a'x'--I . 7'x~-Tx __Ttxl-1 . ~txt-~x
F = ax-1 3'x-1 0
OttO'S ~ 1 9 la:t
ff~x'-I a'x'--ax ")'x-1 crx-1
ax-1 ~ #x 'L1 " otx-1 0
References
This article was processed using the LTEX macro package with ECCV92 style
Estimation of Relative Camera Positions for
Uncalibrated Cameras
Richard L Hartley
G.E. CRD, Schenectady, NY, 12301.
1 Introduction
A non-iterative algorithm to solve the problem of relative camera placement was given
by Longuet-Higgins ([4]). However, Longuet-Higgins's solution made assumptions about
the camera that may not be justified in practice. In particular, it is assumed implicitly
in his paper that the focal length of each camera is known, as is the principal point (the
point where the focal axis of the camera intersects the image plane). Whereas it is often
a safe assumption that the principal point of an image is at the center pixel, the focal
length of the camera is not easily deduced, and will generally be unknown for images
of unknown origin. In this paper a non-iterative algorithm is given for finding the focal
lengths of the two cameras along with their relative placement, as long as other internal
parameters of the cameras are known. It follows from the derivation of the algorithm,
as well as from counting degrees of freedom that this is all the information that may be
deduced about camera parameters from a set of image correspondences.
In this paper, the term magnification will be used instead of focal length, since it
includes the equivalent effect of image enlargement.
and
(~,, r ~,)T = a ((=, y, z) T -- (t~, t~,tz)T) (2)
where R is a rotation matrix, the vectors (u, v, w) T and (u', v', w') T are the homogeneous
coordinates of the image points, and (=,y,z) T and (tz,ty,tz) T are non-homogeneous
object space coordinates. Writing T = (tx,ty, t~)T, and using homogeneous coordinates
in both object and image space, the above relations may be written in matrix form as
and
(u,, r ~,)T = (R I - R T ) ( = , y, z, 1) T = P=(~, y, z, 1) T (4)
where (I [ 0) and (R [ - R T ) are 3 4 matrices divided into a 3 x 3 block and a 3 x 1
column and I is the identity matrix.
Now, I will define a transformation between the 2-dimensional projective plane of
image coordinates in image 1 and the pencil of epipolar lines in the second image. As
is well known, given a point (u, v, w) T in image 1, the corresponding point in image 2
must lie on a certain epipolar line, which is the image under P~ of the set s of all points
(z, y, z, 1) T which map under P1 to (u, v, w) T. To determine this line one may identify two
points in s namely the camera origin (0, 0, 0, 1) T and the point at infinity, (u, v, w, 0) T .
The images of these two points under P2 are - R T and R(u, v, w) T respectively and the
line that passes through these two points is given in homogeneous coordinates by the
cross product,
Here (p, q, r) T represents the line pu' + qv' + rw' = O. Representing by S the matrix
0 -t~ ty
S = S T = ( t~
- t y t~
0 0 x ) (6)
(p, r r) T = R S ( u , v, w) T (7)
Since the point (u', v', w') T corresponding to (u, v, w) T must lie on the epipolar line, we
have the important relation
A proof is contained in [2]. This theorem allows us to give an easy method of factoring
any matrix into a product RS, when possible.
where
E =
(Ol ) (Zol)
10
0
, Z =
0
(9)
Proof T h a t the given factorization is valid is true by inspection. T h a t these are the only
solutions is implicit in the paper of Longuet-Higgins ([4]). [3
2.2 N u m e r i c a l C o n s i d e r a t i o n s
In any practical application, the matrix Q found will not factor exactly in the required
manner because of inaccuracies of measurement. In this case, the requirement will be to
find the matrix closest to Q that does factor into a product RS. Using the sum of squares
of matrix entries as a norm (Frobenius norm [1]), we wish to find the matrix Q' = R S
such that IIQ - Q']I is minimized. The following theorem shows that the factorization
given in the previous theorem is numerically optimal.
The algorithm for computing relative camera locations for calibrated cameras is as fol-
lows.
582
( UEVT I U(O,O,1) T)
( UEVT I-U(O,O, 1) T)
(uETvTI U(0,0,1) T)
(UETV m [ - U ( 0 , 0 , 1) T)
The choice between the four transformations for Pu is determined by the requirement
that the point locations (which may be computed once the cameras are known [4]) must
lie in front of both cameras. Geometrically, the camera rotations represented by UEV T
and U E T V T differ from each other by a rotation through 180 degrees about the line
joining the two cameras. Given this fact, it may be verified geometrically that a single
pixel-to-pixel correspondence is enough to eliminate all but one of the four alternative
camera placements.
3 Uncalibrated Cameras
If the internal camera calibration is not known, then the problem of finding the camera
parameters is more difficult. In general one would like to allow arbitrary non-singular
matrices K describing internal camera calibration and consider camera matrices of the
general form ( K R I - K R T ) , that is, general 3 x 4 matrices. Because K is multiplied by a
rotation, R, it may be assumed that K is upper triangular. Allowing for an arbitrary scale
factor, there are 5 remaining independent entries in K representing camera parameters.
Other authors ([6]) have allowed four internal camera parameters, namely principal point
offsets in two directions and different scale factors in two directions. If however different
scaling is allowed in two directions not necessarily aligned with the direction of the
image-space axes, then one more parameter is needed, making up the 5.
It is too much to hope that from a set of image point correspondences one could
retrieve the full set of internal camera parameters for a pair of cameras as well as the
relative external positioning of the cameras. Indeed if {xi} are a set of points visible
in a pair of cameras with transform matrices P1 and P2, and G is an arbitrary non-
singular 4 4 matrix, then replacing each xi by G-lxi and each camera Pj with Pi G
preserves the object-point to image-space correspondences. As may be seen, the internal
parameters of one of the cameras, P1 say, may be chosen arbitrarily. The situation is not
helped by adding more cameras. This is in contrast to the case of calibrated cameras
in which a finite number of solutions are possible ([2]). The question remains, therefore,
how much can be deduced about the internal camera parameters from a set of image
correspondences.
For uncalibrated cameras, a matrix Q can be defined, analogous to the matrix defined
for calibrated cameras, and this matrix may be computed given matched point pairs,
according to (8). It may be observed that however many pairs of matched points are
given, as far as determining camera models is concerned, the matrix Q encapsulates all
the information available, except as to which points lie behind or in front of the cameras.
As remarked above, the choice of the four possible relative camera placements may be
determined using just one matched point pair - the rest may be thrown away once Q
has been computed. To justify this observation it may be verified that a pair of matching
583
points (u, v, W)T and (u', v', w') T correspond to a possible placement of an object point
if and only if (u', v', w')Q(u, v, w) T = 0. This means that the addition of match points
beyond 8 does not add any further information except numerical stability. Now, Q has
only 7 degrees of freedom consisting of 9 matrix entries, less one for arbitrary scale
and one for the condition that det(Q) = 0. (Theorem 1 does not hold for uncalibrated
cameras.) Therefore, the total number of camera parameters that may be extracted from
a set of image-point correspondences does not exceed 7. As shown by Longuet-Higgins,
the relative camera placements account for 5 of these (not 6, since scale is indeterminate),
and this paper accounts for two more, the camera magnification factors. It is not possible
to extract any further information from Q, or hence from a set of matched points.
3.1 F o r m o f t h e Q - m a t r l x
Let K1 and K2 be two matrices representing the internal camera transformations of the
two cameras and let P1 -- (K1 I 0) and P2 -- (K2R I - K 2 R T ) be the two camera trans-
forms. The task is to obtain R, T, K1 and K2 given a set of image-point correspondences.
For the present, the matrices K1 and Ks will be assumed arbitrary.
As before, it is possible to determine the epipolar line corresponding to a point
(u,v,w) T in image 1. The two points that must lie on the epipolar line are the im-
ages under P2 of the camera centre (0, 0, 0, 1) T of the first camera and the point at
infinity ( K ~ l ( ~ v ' w ) T ) . Transform P~ takes these two points to the points - K 2 R T
and K 2 R K ~ 1 (u, v, w) T. The line through these points is given by the cross product
K2RT x K~RK~I(u, v, w) T (10)
Now, writing S = SIqT as defined in (6), we have a formula for the epipolar line corre-
sponding to the point (u,v,w) T in image 1 :
(p, q, r) T ~ K ~ R K I T S ( u , v, w) T (12)
(u',v',w')Q(u,v,w) T = 0 . (13)
An alternative factorization for Q that may be derived from (10) and Lemma 4 is
Q ~ (K~I)TRSK~ 1 (14)
3.2 Factorlzation of Q
O u r goal, given Q, is to find the factorization Q ~ K ~ R K I T S . As before, we use the
Singular Value Decomposition, Q = U D W T. By multiplying by - 1 if necessary, U and V
m a y be chosen such the det(U) -- d e t ( V ) = -F1 so t h a t U* -- U and V* = V. Since Q is
singular, the diagonal m a t r i x D equals diag(r, s, 0) where r and s are positive constants.
Since Q W ( 0 , 0,1) T = 0, it follows t h a t S W ( 0 , 0, 1) T -- 0 since K~RK~ T is non-singular,
and so S ~ W Z W T where Z is given in (9). The general solution to the problem of
factoring Q into a product R~S ~, where R * is non-singular and S ~ is skew-symmetric is
therefore given by
Q = ( U X , ~ , a , . r E T W T ) . ( W Z W T) (15)
where X~,Z,7 is given by
X,,~,~ = s (16)
0
and a , fl and 7 are a r b i t r a r y constants. T h e two bracketed expressions are R' and S '
respectively and the factorization is unique (except for the variables a , fl and 7) up to
scale. In contrast to the situation in Section 2.1 we do not need to consider the alternate
solution in which E T is replaced by E, since t h a t is taken care of by the undetermined
values ~, ~ and 7. Since b o t h E and W are orthogonal matrices, we write V = W E , and
V is also orthogonal.
Now, we turn our attention to the m a t r i x R ~ = UX~,,p,TV T. For some values of c~,
and 7, it must be true t h a t R' .~ K 2* R K 1* - I where R is a rotation matrix. From this it
follows t h a t R ~ . K 2, - 1 R ~K 19 . We now apply the p r o p e r t y t h a t a rotation m a t r i x is equal
t o its cofactor matrix, (inverse transpose). This means t h a t K 2* -1R~K'~ ~ ~ 2 1 D~LI * ~1 It: ~ TT"
or
x*:,~ = r7 (18)
\-s~ - r ~ rs
A t this point, it is necessary to specialize to the case where K1 and Ks are of the
simple form K1 = diag(1, 1, kl) and Ks = diag(1, 1, ks). In this case, kl and k2 are the
inverses of the magnification factors. If the entries of UXa,z,~,V T are (fij) and those of
*
UXc,,~,~V T are (gij), then multiplying by (K~K2 T) and (K1K~ T) respectively gives an
equation
fll S12 f13 ~ :gll g12 k2g13~
/ =xig l k g 3/ (20)
where the fij and gij linear expressions in a, fl and 7, and x is an unknown scale factor.
T h e top left h a n d block of (20) comprises a set of equations of the form
( .fll :12"~
f21 Y22) =
X(gllg12)
\g21 g2~ "
(21)
585
If the scale factor were known, then this system could be solved for a , / 3 and 7 as a set
of linear equations. Unfortunately, x is not known, and it is necessary to find the value
of z before solving the set of linear equations. Since the entries of the matrices on both
sides of (21) are linear expressions in a,/3 and 7, it is possible to rewrite (21) in the form
M1(a,/3,7, 1) T - x M,(c~,/3,7, 1) T = 0 , (22)
where M1 and M , are 4 x 4 matrices, each row of M1 or M , corresponding to one of
the four entries in the matrices in (21). Such a set of equations has a solution only if
det(Ma - z Mz) = 0. This leads to a polynomial equation of degree 4 in z : p(z) =
det(M1 - x M , ) = 0. It will be seen later that this polynomial reduces to a quadratic.
The form of the matrix M~ may be written out explicitly. Let X~,;~,-r be written in
the form c~.A~a +/3.A2~ + 7.A~3 + ( r . A ~ + s A ~ ) , where Aij is the matrix having a one
in position i,j and zeros elsewhere. Then,
UXa,~,TV y = o~UA13V T -1-/3UA23 VT -1-7UAz3V T -k-rUAll V r -~-sU A22V T .
It may be verified that the the p,q-th entry of the matrix UAijV T is equal to UpiVqj.
Now, suppose that the rows of M1 are ordered corresponding to the entries f ~ , fx~, f~l
and f22 of UXc,,z.yV T. Then
( fll~
f12 /
/ UllV13 Ul~V13 U13V13 r.UllVll+s.U12Vl~ { a
[ Ull V23 U12V23 V13V23 r'UllV21-~s'U12V221 ~
f 2 1 ] -- [U21VI3
f22/
U22V13 U23V~3 r.U2Wn+s.U~2Vx2]
k U21V~3 U2~V23 U23V23 r.U21V~l+s.U22V22]
(23)
and MI is the matrix in this expression. The exact form of the matrix Mx may be
computed in a similar manner.
The apparent redundancy in the equations (25) is resolved by the following proposi-
tion.
P r o p o s i t i o n 5.
1. If x is either of the roots of p(x), then the two expressions zg31/f31 and zgs2/f32
for k~ in (g5.i) are equal. Similarly, the two expressions/or k~ in (e5.ii) are equal
and the relationship (25.iii) is always true.
2. Values k~ and k~ are either both positive or both negative.
3. The estimated values of k~ corresponding to the two opposite roots of p(x) are the
same. The same holds for the two values of k~.
Proof of this proposition is beyond the scope of this paper. The case where k~ and k~ are
negative implies as before that no solution is possible. Once again, selecting a different
value for the principal points (origin of irnage-space coordinates) may lead to a solution.
At this point, it is possible to continue and compute the values of the rotation matrix
directly. However, it turns out to be more convenient, now that the values of the mag-
nification are known, to revert to the case of a calibrated camera. More particularly, we
observe that according to (14), Q may be written as Q = K ~ I Q ' K ~ 1 where Qf = RS,
and _R is a rotation matrix. The original method of Section 2.3 may now be used to solve
for the camera matrices derived from Q~. In this way, we find camera models P1 = ( I I 0)
and P2 = (R I - R T ) for the two cameras corresponding to Q'. Taking account of the
magnification matrices K1 and K2, the final estimates of the camera matrices are (K1 I 0)
and ( K 2 R I - K 2 R T ).
In practice it has been observed that greater numerical accuracy is obtained by re-
peating the computation of kl and k2 after replacing Q by Qt. The values of kl and k~.
computed from Q~ are very close to 1 and may be used to revise the computed magni-
fications very slightly. However, such a revision is necessary only because of numerical
round-off error in the algorithm and is not strictly necessary.
3.3 A l g o r i t h m O u t l i n e
Although the mathematical derivation of this algorithm is at times complex, the imple-
mentation is not particularly difficult. The steps of the algorithm are reiterated here.
1. Compute a matrix Q such that (u~, v~, 1)YQ(ul, vi, 1) -- 0 for each of several matched
pairs (at least 8 in number) by a linear least-squares method.
2. Compute the Singular Value Decomposition Q ~ U D W r with det(U) = det(V) =
+1 and set r and s to equal the two largest singular values. Set V -- W E .
3. Form the matrices M1 and M~ given by (23) and (24) and compute the determinant
p(x) = det(Ul - x i~:) = alx .-b a3x 3.
4. If - a l / a a < 0 no solution is possible, so stop. Otherwise, let x = -X/'~l/aa, one of
the roots of p(x).
5. Solve the equation (M1 - x M=)(a,/~, 7, 1) T -- 0 to find a, fl and 7 and use these
values to form the matrices X~,,p,.r and X*~,x given by (16) and (18).
6. Form the products UX,~,/L.rV "r and UX~,,~,.rV9 7" and observe that the four top left
elements of these matrices are the same.
7. Compute kl and k2 from the equations (25) where (fij) and (gii) are the entries of
the matrices UX~,,/L.r V T and UX,~,Z,.rV
* T respectively. If kl and k2 are imaginary,
then no solution is possible, so stop.
8. Compute the matrix Q~ = K2QK1 where K1 and K2 are the matrices diag(1, 1, kl)
and diag(1, 1, k~) respectively.
587
4 Practical Results
This algorithm has been encoded in C and tested on a variety of examples. In the first test,
a set of 25 matched points was computed synthetically, corresponding to an oblique place-
ment of two cameras with equal magnification values of 1003. The principal point offset
was assumed known. The solution to the relative camera placement problem was com-
puted. The two cameras were computed to have magnifications of 1003.52 and 1003.71,
very close to the original. Camera placements and point positions were computed and
were found to match the input pixel position data within limits of accuracy. Similarly,
the positions in 3-space of the object points matched the known positions to within one
part in 1 0 4 .
The algorithm was also tested out on a set of matched points derived from a stereo-
matching program, STEREOSYS ([3]). A set of 124 matched points were found by an
unconstrained hierarchical search. The two images used were 1024 x 1024 aerial overhead
images of the Malibu region with about 40% overlap. The algorithm described here was
applied to the set of 124 matched points and relative camera placements and object-point
positions were computed. The computed model was then evaluated against the original
data. Consequently, the computed camera models were applied to the computed 3-D
object points to give new pixel locations which were then compared with the original
reference pixel data. The RMS pixel error was found to be 0.11 pixels. In other words,
the derived model matches the actual data with a standard deviation of 0.11 pixels. This
shows the accuracy not only of the derived camera model, but also the accuracy of the
point-matching algorithms.
References
1. Atkinson, K. E., "An Introduction to Numerical Analysis," John Wiley and Sons, New
York, Second Edition 1989.
2. Faugeras, O. and Maybank, S., "Motion from Point Matches : Multiplicity of Solutions,"
International Journal of Computer Vision, 4, (1990), 225-246.
3. Hannah, M.J., "Bootstrap Stereo," Proc. Image Understanding Workshop, College Park,
MD, April 1980, 201-208.
4. Longuet-tiiggins, H. C., "A computer algorithm for reconstructing a scene from two pro-
jections," Nature, Vol. 293, 10, Sept. 1981.
5. Porrill, J and Pollard, S., "Curve Fitting and Stereo Calibration," Proceedings of the British
Machine Vision Conference, University of Oxford, Sept 1990, pp 37-42.
6. Strat, T. M., "Recovering the Camera Parameters from a Transformation Matrix," Readings
in Computer Vision, pp 93-100, Morgan Kauffamann Publishers, Inc, Los Altos, Ca., 1987.
7. Wolfram, S., Mathematica, "A System for Doing Mathematics by Computer," Addison-
Wesley, Redwood City, California, 1988.
This article was processed using the I~TEX macro package with ECCV92 style
Gaze Control for a Binocular Camera Head
James L. CROWLEY, Philippe BOBET and Mouafak MESRABI
LIFIA (IMAG), 46 Ave Felix Viallet, 38031 Grenoble, France
Abstract. This paper describes a layered control system for a binocular stereo head.
It begins with a discussion of the principles of layered control and then describes the
mechanical device for a binocular camera head. A device level controller is presented
which permits an active vision system to command the position of the gaze point. The
final section describes experiments with reflexive control of focus, iris and vergence.
1. I n t r o d u c t i o n
During the last few years, there has been a growing interest in the use of active control of
image formation to simplify and accelerate scene understanding. Basic ideas which were
suggested by [Bajcsy 88] and [Aloimonos et al. 87] has been extended by several groups.
Examples include [Ballard 91], and [Eklundh 92]. Brown [Brown 90] has demonstrated how
multiple simple behaviours may be used for control of saccadic, vergence, vestibulo-ocular
reflex and neck motion.
This trend has grown from several observations. For example, Aloimonos and others observed
that vision cannot be performed in isolation. Vision should serve a purpose [Aloimonos 87],
and in particular should permit an agent to perceive its environment. This leads to a view of a
vision system which operates continuously and which must furnish results within a fixed
delay. Rather than obtain a maximum of information from any one image, the camera is a
active sensor giving signals which provide only limited information about the scene. Bajcsy
[Bajcsy 88] observed that many traditionally vision problems, such as stereo matching, could
be solved with low complexity algorithms by using controlled sensor motion. Examples of
such processes were presented by Krotkov [Krotkov 90]. Ballard [Ballard 88] and Brown [Brown
90] demonstrated this principle for the case of stereo matching by restricting matching to a
short range of disparities close to zero, and then varying the camera vergence angles.
The development of binocular camera heads and an integrated vision system has opened a line
of cooperation between the scientific communities of biological vision, machine vision and
robotics. This paper is concerned with a robotics problem posed by such devices: How to
organize the control architecture. We will argue for a layered control architecture in which a
"gaze point" may be commanded by an external process or driven by simple measurements of
information from the scene.
I H-' I
I ~176176176 ]
[ Motor Controllers [
Figure 2.1 A Layered Architecture for Control of a Binocular Head in the SAVA System.
The motor level is concerned with control parameters defined by the motor
shaft. Sensor (typically optical encoders) provide information in terms of position and angular
speed of the motor. Commands are generated in terms of motor position, speed and perhaps
acceleration. Typical control cycle times for robotics are on the order of 1 to 10 milliseconds.
The motor level is typically controlled by a form of PID Controller.
The device level is concerned with the geometric and dynamic state of the
entire device. Control cycles for robotics applications are typically on the order of 10 to 100
milliseconds. For device independence, it is useful to design a controller for an idealized "virtual
device". Our virtual head is based on controlling an "gaze-point" defined as the intersection of
the optical axes. The mapping from the virtual head to a particular mechanical head is
performed by a translation layer between the device controller and the motor controllers. The
device level also permits any of the axes of the virtual head to be directly controlled.
The action level concerns procedural control of the device state based on
measurements taken from sensor signals. An action will drive the device lhrough a sequence of
states. The action level for a binocular head involves control of head motion and optical
parameters. The action level often involves control cycles of 0.1 to 1.0 seconds.
~ : A task level controller has a goal expressed in terms of a symbolic state. The
task level controller chooses actions to bring the device and the environment to the desired
state. The selection of actions is based on a symbolic description of the preconditions and
results of actions, as well as a description of the current state of the device and environment.
This leads us to propose a control cycle composed of three phases: Evaluation (of state),
Selection (of an action), and Execution (of an action).
The vergence mechanism is mounted on the sixth axis of a robot manipulator. The two
cameras are mounted on small platforms that pivot about a point underneath the camera lens. A
precision adjustment screw permits the camera to be moved forward or backward so as to
position the optical center under the rotation point. The gearing on the motors provides
approximately 3 encoder counts per arc second (12.5 encoder counts per pixel), over a range of
20 ~ The sprocket gears have been mounted on the focus and aperture rings of 25 mm fl.8 c-
mount lenses. The gearing on the ring and motors provide approximately 15 000 counts over
the full range of movements for the focus and approximately 12 000 counts for the aperture.
A six axis manipulator serves as a "neck" for this head. The head and neck are mounted on a
mobile robot, permitting experiments in vision guided motion. The neck is mounted at a
point which is midway between the power wheels of the vehicle. This point serves as the
origin for both the vehicle and arm coordinate systems. The neck permits us to command the
position and orientation of the camera in coordinates which are relative to the position of the
mobile robot. The mobile robot provides us with an estimate of its position and orientation in
an arbitrary a world coordinate system.
A standard PID motor control software has been developed for the three head motor micro-
processors and burned onto ROM. The software protocol for each of the three motor
controllers is the same, except that the maximal values for each motor controller depends on
the axis. The protocol for the motors permits initialization, incremental and absolute
movement, immediate stop, and interrogation of the current motor position (in encoder counts).
The protocol is written in such a way that a command may be issued at any time, and that a
new movement command will replace a current command.
The head controller has four components: Protocol Interpretation (1), State Estimation (2),
Command Generation (3) and Translation (4). These four components are illustrated in figure
4.1 and described in the following sections.
591
I rotocolInterpretation (~)
/Q
Command
XState r
Generation Estimation
/
I Translation )07 Virtual Head
(Virtual Head to Real Head
In order to accommodate any configuration of axes, a device table is defined. This device table
is built up from a dynamically allocated structure called an "Axis". Initialization of this
structure defines which axes are present, their units, the initial value for that axis, the
conversion factor from encoder counts, and their maximum and minimum values. Subsequent
access to that axis may either be based on the index of the entry in the table, or by association
with the axis name. The depth and angle to the gaze point are treated as axes.
Absolute and incremental moves change the reference for the axis control. At the end of each
cycle, the translator scans the list of axes and updates the current position. Whenever the
commanded position is different from the current position, the commanded value is transformed
to encoder counts and a move is issued to the motor controller. A command to an axis which is
not currently in the head table will trigger a negative acknowledgement.
Let us derive formulas for determining the gaze point within an "eye centered" 2D polar
coordinate system. This eye centered coordinate system has its origin mid-way between the
optical centers of a pair of stereo cameras. Let us define the X axes as coincident with the
baseline, and the Y axis as perpendicular to the baseline and in the plane det-med by the optical
axes. Let the separation of the cameras be a distance 2B so that the location of the optical
centers are the defined as the points (B, 0) and (-B, 0). Furthermore, let the optical axes be
located in the (X, Y) plane with angles of a I and Or.
Y
~m~ e
.,.._,_.Origin, I I ~ = x
t.~ ~l
Figure 4.2 The gaze point P is the intersection of the optical axes.
The equation of the left optical axis in the plane defined by the base line and the optical axis is:
The right optical axis is described by: X Sin (re - fir) - Y Cos (n - fir) - B Sin(re - fir ) = 0.
Since Sin (re - fir) = Sin (or) while Cos (n - fir) = -Cos (fir), the left equation reduces to
The position of the fixation point, defined by the intersection of the optical axes, can be
calculated as the sum and difference of these two equations. This gives
CoS(Ol)Sin(fir)-Sin(ol)COs(fir) Sin(ol--Or)
X = B CoS(Ol)Sin(fir)+Sin(ol)Cos(or) - B 'Sin(ol+Or) (4.1)
2BSin(ol)Sin(or) Sin(ol)Sin(or)
Y = ~os(al)Sin(fir)+Sin(al)Cos(fir ) = 2B S i n ( a l + a r ) (4.2)
2BSin(o~
X=0 Y = 2Cos(o) = B Tan(a)
In polar coordinates, (Dc, ok:), the vergence angle to the gaze point can be expressed as
Equation 4.3 describes a gaze point as if at the end of a telescopic stick which we can extend
and pointed within a plane. We can solve for the position of the end of this stick in the scene
using the position and orientation of the head and the state of the binocular head. The head
"state" parameters on which the gaze point depend are the Distance (Dc), the azimuth gaze angle
or pan (ag) and the elevation gaze angle, or tilt (13g). The gaze azimuth angle, ~g, is defined
as the sum of the head pan ah and a common vergence angle ~.c.
C~g = c~h + a c
The elevation angle depends on the state of the manipulator "neck". These values constitute a
polar expression of the gaze point with respect to a head centered coordinate frame.
Transformation to Cartesian form is quite simple. We consider that the pan and tilt axes of the
head are located at Cartesian coordinates (Xh, Yh, Zh). The position of the gaze point is
determined by.
Xg = Xh + Dc Cos(~g )
Yg = Yh + Dc Sin(cXg)
Zg = Zh + Dc Sin(13g)
Both the polar and Cartesian forms of the gaze point are stored in a data structure that defines
the head "state". The Cartesian values are computed from the polar values. Commanded values
may be set by messages from other processes. The difference between a commanded value and a
current position triggers the translator to call a device specific procedure to move the necessary
real axes.
Command of the gaze point involves controlling an under-constrained system of motors. Our
solution is to simultaneously drive each of the axes with common error term. In order to assure
stability, we must assure that the sum of the gain terms for the redundant axes is less than one.
Thus each motor moves with its characteristic speed, and the system converges to the specified
gaze point in an over-damped manned. The motor gains are tuned to assure stable convergence
over the range of motions.
9 1OO
5.2 Focus
It is well known that focus can be controlled by the "sharpness" of contrast. The problem is
how to measure such "sharpness". In [Krotkov 87] we can find a description of several methods
for measuring image sharpness. Horn [Horn 65] proposes to maximize the high-frequency
energy in the power spectrum. Jarvis proposes to sum the magnitude of the f'trst derivative of
neighboring pixels along a scan line [Jarvis 83]. Schlag [Schlag et al. 82] and Krotkov
[Krotkov 87] propose to sum the squared gradient magnitude. Tenenbaum [Tenenbaum 82] and
Schlag compare gradient magnitude to a threshold and sum uniquely those pixels which are
above a threshold. The problem is then the choice of such a threshold. We have found that such
a measure performs poorly. After experiments with several measures, we have found our best
results with the sum of gradient magnitude, without the use of the threshold.
We measure image gradient at the level five or our low-pass pyramid, providing a binomial
smoothing window with a standard deviation of 4"~. Gradient is calculated using compositions
of the filter [1 0 -1] in the row and column directions. By default, the "region of interest" is
at the center of the image, but this region may be placed anywhere in the image by a message
from another software module. Local extrema in the gradient magnitude are summed within the
region of interest. An initialize command causes focus to look for a a global maximum in this
sum. Subsequently, the reflex action seeks to keep the focus at a local maximum. Note that
this measure exhibits a plateau around the proper focal value. This region corresponds to the
"depth of field". Reducing the aperture will enlarge the depth of field and thus enlarge this
plateau.
Figure 5.2 shows the values obtained by this measure. The camera was pointed at the
boundary between a dark and a gray face of a calibration cube, at a distance of approximately 1
meter. The sum of the gradient extrema was made within a 20 by 20 pixel region centered over
this boundary at level 1 of our pyramid, and the focus was scanned over the range of values. At
each focus setting, the sum of extrema was calculated.
595
l r r
" rHr IV _ l
6. Conclusions
In this paper we have presented a layered control architecture for a binocular head. We began by
discussing the principles of layered control. We then presented the mechanical and motor
control architecture for the LIFIA/SAVA binocular head.
Section 4 of this paper was concerned with a device level controller. In particular we developed
the control system for estimating and controlling the device state in terms of a "gaze point"
which can be used to explore the scene. By defining a virtual head, we are able to provide a
general head protocol. This head controller should allow algorithms to be easily ported between
different heads, even when the axes are not configured the same.
In section 5 we described some preliminary work on measures for controlling focus, aperture,
and vergence. The measures which we presented are all simple, stable, and of low
computational complexity. The development of such control techniques are a necessary
component in the construction of real time active vision systems.
596
Bibliography
[Bajcsy 88] R. Bajcsy, "Active Perception", IEEE Proceedings, Vol 76, No 8, pp. 996-
1006, August 1988.
[Ballard 88] Ballard, D.H. and Ozcandarli, A., "Eye Fixation and Early Vision: Kinematic
Depth", IEEE 2nd Intl. Conf. on Comp. Vision, Tarpon Springs, Fla., pp. 524-531, Dec.
1988.
[Ballard 91] D. Ballard, "Animate Vision", Artificial Intelligence, Vol 48, No. 1, pp. 1-27,
February 1991.
[Clark and Ferrier 88] Clark, J. and Ferrier, N., "Modal Control of an Attentive Vision
System", IEEE 2nd Intl. Conf. on Comp. Vision, Tarpon Springs, Fla., pp. 514-523, Dec.
1988.
[Eklundh 92] J.O. Eklundh and K. Pahlavan, Head, "Eye and Head-Eye System", SPIE
Applications of AI X: Machine Vision and Robotics, Orlando, Fla. April 92 (to appear).
[Horn 68] Horn, B. P. K., "Focussing", MIT Artificial Intelligence Lab Memo No. 160,
May 1968.
[Jarvis 83]. Jarvis, R. A., "A Perspective on Range Finding techniques for Computer
Vision", IEEE Trans. on PAMI 3(2), pp 122-139, March 1983.
[Krotkov 87] Krotkov, E., "Focusing", International Journal of Computer Vision, 1, p223-
237(1987).
[Krotkov 90] Krotkow, E., Henriksen, K. and Kories, R., "Stereo Ranging from Verging
Cameras", IEEE Trans on PAMI, Vol 12, No. 12, pp. 1200-1205, December 1990.
[Sehlag et. al. 83] Schlag, J., A. C. Sanderson, C. P. Neumann, and F. C. Wimberly,
"Implementation of Automatic Focussing Algorithms for a Computer Vision System with
Camera Control", CMU-RI-TR-83-14, August, 1983.
[Westelius et. al. 91] Westelius, C. J., H. Knutsson, and G. H. Granlund, "Focus of
Attention Control", SCIA-91, Seventh Scandinavian Conference on Image Analysis, Aalborg,
August 91.
Computing Exact Aspect Graphs of Curved Objects:
Algebraic Surfaces*
Jean Ponce x, Sylvain Petitjean 1, and David J. Kriegman ~
1 Dept. of Computer Science, University of Illinois, Urbana, IL 61801, USA
Dept. of Electrical Engineering, Yale University, New Haven, CT 06520, USA
1 Introduction
The aspect graph [25] is a qualitative, viewer-centered representation that enumerates all
possible appearances of an object: The range of all possible viewpoints is partitioned into
maximal regions such that the structure of the image contours, also called the aspect, is
the same from every viewpoint in a region. The change in the aspect at the boundary
between regions is named a visual event. The maximal regions and their boundaries are
organized into a graph, whose nodes represent the regions with their associated aspects
and whose ares correspond to the visual event boundaries between adjacent regions.
Since their introduction by Koenderink and Van Doom [25] more than ten years ago,
aspect graphs have been the object of very active research. The main focus has been on
polyhedra, whose contour generators are viewpoint-independent. Indeed, approximate
aspect graphs of polyhedra have been successfully used in recognition tasks [7, 18, 20],
and several algorithms have been proposed for computing the exact aspect graph of these
objects [6, 15, 16, 31, 36, 38, 39, 41, 42].
Recently, algorithms for constructing the exact aspect graph of simple curved objects
such as solids bounded by quadric surfaces [8] and solids of revolution [11, 12, 26] have
also been introduced. For more complex objects, it was recognized from the start that
the necessary theoretical tools could be found in catastrophe theory [1, 5, 25]. However,
algorithms based on these tools have, until very recently, remained elusive: Koenderink
[24] and Kergosien [23] show the view sphere curves corresponding to the visual events of
some surfaces, but, unfortunately, neither author details the algorithm used to compute
these curves. Rieger [35] uses cylindrical algebraic decomposition to compute the aspect
graph of a quartic surface of the form z = f(z, y).
This paper is the third in a series on the construction of exact aspect graphs of smooth
objects, based on the catalogue of possible visual events established by Kergosien [22]
(see [33, 40] for the case of piecewise-smooth objects). Previously, we presented a fully
implemented algorithm for solids of revolution whose generating curve is polynomial [26]
(see [11, 12] for a different approach to the same problem), and reported preliminary
results for polynomial parametric surfaces [32]. Here, we present a fully implemented
algorithm for computing the aspect graph of an opaque solid bounded by parametric or
* This work was supported by the National Science Foundation under Grant IRI-9015749.
600
implicit smooth algebraic surfaces, observed under orthographic projection (see [23, 24,
35, 37] for related approaches).
This algorithm is described in Sect. 3. It relies on a combination of symbolic and
numerical techniques, including curve tracing and cell decomposition [29], homotopy
continuation [30], and "symbolic" ray tracing [21, 28]. An implementation is described
in Sect. 4, and examples are presented (Figs. 4,5). Finally, future research directions are
briefly discussed in Sect. 5. While the main ideas of our approach are presented in the
body of the paper, detailed equations and algorithms are relegated to four appendices.
2 Visual Events
Let us start by reviewing some results from catastrophe theory [1]: From most view-
points, the image contours of smooth surfaces are piecewise-smooth curves whose only
singularities are cusps and t-junctions. The contour structure is in general stable with
respect to viewpoint, i.e., it does not change when the camera position is submitted to
a small perturbation. From some viewpoints, however, almost any perturbation of the
viewpoint will alter the contour topology. A catalogue of these "visual events" has been
established by Kergosien [22] for transparent generic smooth surfaces observed under
orthographic projection (Fig. 1).
a.
Fig. 1. Visual events, a. Local events. From top to bottom: swallowtail, beak-to-beak, lip. b.
Multilocal events. From top to bottom: triple point, tangent crossing, and cusp crossing.
Each visual event in this catalogue occurs when the viewing direction has high order
contact with the observed surface along certain characteristic curves [1, 24]. When contact
occurs at a single point on the surface, the event is said to be local; when it occurs at
multiple points, it is said to be multilocal. A catalogue of visual events is also available for
piecewise-smooth surfaces [33, 40], but we will restrict our discussion to smooth surfaces
in the rest of this paper.
2.1 Local E v e n t s
As shown in [22], smooth surfaces may exhibit three types of local events: swallowtail,
beak-to-beak, and lip transitions (Fig. 1.a). During a swallowtail transition, a smooth
image contour forms a singularity and then breaks off into two cusps and a t-junction. In
a beak-to-beak transition, two distinct portions of the occluding contour meet at a point
601
in the image. After meeting, the contour splits and forms two cusps; the connectivity
of the contour changes. Finally, a lip transition occurs when, out of nowhere, a closed
contour is formed with the introduction of two cusps.
Swallowtails occur on flecnodal curves, and both beak-to-beak and lip transitions
occur on parabolic curves [1, 24]. Flecnodal points are inflections of asymptotic curves,
while parabolic points are zeros of the Gaussian curvature. Equations for the parabolic
and flecnodal curves of parametric and implicit surfaces are given in Appendices A.1 and
B.1 respectively. The corresponding viewing directions are asymptotic directions along
these curves.
2.2 M u l t i l o c a l E v e n t s
These events occur when two or more surface points project onto the same contour
point. As shown in [22], there are three types of multilocal events: triple points, tangent
crossings, and cusp crossings (Fig. 1.b). A triple point is formed by the intersection of
three contour segments. For an opaque object, only two branches are visible on one side
of the transition while three branches are visible on the other side. A tangent crossing
occurs when two contours meet at a point and share a common tangent. Finally, a cusp
crossing occurs when the projection of a contour cusp meets another contour.
A multilocal event is characterized by a curve defined in a high dimension space, or
equivalently by a family of surface curves. For example, a triple point is formed when
three surface points are aligned and, in addition, the surface normals at the three points
are all orthogonal to the common line supporting these points. By sweeping this line
while maintaining three-point contact, a family of three curves is drawn on the surface.
Equations for the families of surface curves corresponding to multilocal events are given
in Appendices A.2 and B.2 for parametric and implicit surfaces respectively. The corre-
sponding viewing directions are parallel to the lines supporting the points forming the
events.
3 The Algorithm
We propose the following algorithm for constructing the aspect graph of an opaque solid
bounded by an algebraic surface:
1. Trace the visual event curves.
2. Eliminate the occluded events.
3. Construct the regions delineated on the view sphere by the remaining events.
4. Construct the corresponding aspects.
We now detail each step of the algorithm. Note that the aspect graph of a transparent
solid can be constructed by using the same procedure but omitting step 2.
3.1 S t e p 1: T r a c i n g Visual E v e n t s
As shown in Sect. 2, a visual event corresponds in fact to two curves: a curve (or family
of curves) F drawn on the object surface and a curve ,A drawn on the view sphere.
For algebraic surfaces, the curve F is defined implicitly in n+l by a system of n
polynomial equations in n + 1 unknowns, with 1 < n < 8 (see Appendices A and B):
P1 (X0, Xl, 9 9 X , ) = 0,
(1)
t P{,(Xo,X,,...,x,) o.
602
,X1
El
X0
Fig. 2. An example of curve tracing in ~2. This curve has two extremal points El, E~, and four
regular branches with sample points $1 to $4; note that E2 is singular.
This algorithm overcomes the main difficulties of curve tracing, namely finding a
sample point on every real branch and marching through singularities. Its output is a
graph whose nodes are extremal or singular points on F and whose arcs are discrete
approximations of the smooth curve branches between these points. This graph is similar
to the s-graph representation of plane curves constructed through cylindrical algebraic
decomposition [2]. Using the mapping f r o m / ' onto A, a discrete approximation of the
curve A is readily constructed.
3.2 S t e p 2: E l i m i n a t i n g O c c l u d e d E v e n t s
All visual events of the transparent object are found in step 1 of the algorithm. For an
opaque object, some of these events will he occluded, and they should be eliminated. The
visibility of an event curve F can be determined through ray tracing at its sample point
found in step 1.3 [21, 44].
3.3 S t e p 3: C o n s t r u c t i n g t h e R e g i o n s
To construct the aspect graph regions delineated by the curves A on the view sphere, we
refine the curve tracing algorithm into a cell decomposition algorithm whose output is a
description of the regions, their boundary curves, and their adjacency relationships. Note
that this refinement is only possible for curves drawn in two-dimensional spaces such as
the sphere.
The algorithm is divided into the following steps (Fig. 3): 3.1. Compute all extremal
points of the curves in the X0 direction. 3.2. Compute all the intersection points be-
tween the curves. 3.3. Compute all intersections of the curves with the "vertical" lines
orthogonal to the X0 axis at the extremal and intersection points. 3.4. For each interval
of the X0 axis delimited by these lines, do the following: 3.4.1. Intersect the curves and
the line passing through the mid-point of the interval to obtain a sample point on each
real branch of each curve. 3.4.2. Sort the sample points in increasing X1 order. 3.4.3.
March from the sample points to the intersection points found in step 3.3.3.4.4. Two
consecutive branches within an interval of X0 and the vertical segments joining their
extremities bound a region.
A sample point can be found for each region as the mid-point of the sample points
of the bounding curve branches. This point is used to construct a representative aspect
in Sect. 3.4. Maximal regions are found by merging all regions adjacent along a vertical
line segment (two regions are adjacent if they share a common boundary, i.e., a vertical
line segment or a curve branch).
XO
I I I :
Fig. 3. An example of cell decomposition. Two curves are shown, with their extremal points Ei
and their intersection points Is; the shaded rectangle delimited by I1 and I2 is divided into five
regions with sample points $1 to $5; the region corresponding to $3 is shown in a darker shade.
3.4 S t e p 4: C o n s t r u c t i n g t h e A s p e c t s
This step involves determining the contour structure of a single view for each region,
first for the transparent object, then for the opaque object. This can be done through
"symbolic" ray tracing of the object contour [28] as seen from the sample point of the
region. Briefly, the contour structure is found using the curve tracing algorithm described
earlier. Since contour visibility only changes at the contour singularities found by the
algorithm, it is determined through ray tracing [21, 44] at one sample point per regular
branch.
The algorithm described in Sect. 3 has been fully implemented. Tracing the visual event
curves (step 1) is by far the most expensive part of the algorithm. Curve tracing and
continuation are parallel processes that can be mapped onto medium-grained MIMD
architectures. We have implemented continuation on networks of Sun SPARC Stations
communicating via Ethernet, networks of INMOS Transputers, and Intel Hypercubes. In
practice, this allows us to routinely solve systems with a few thousands of roots, a task
605
that requires a few hours using a dozen of Sparc Stations. The elimination of occluded
events in step 2 of the algorithm only requires ray tracing a small number of points on
the surface and takes a negligible amount of time. In our current implementation, the
cell decomposition algorithm of step 3 works directly with discrete approximations of
the visual event curves and only takes a few seconds. Finally it takes a few minutes to
generate the aspects in step 4.
In the following examples, an object is represented graphically by its silhouette,
parabolic and flecnodal curves, and the corresponding aspect graph is shown (more pre-
cisely, the visual event curves and their intersections are drawn on the view sphere).
Figure 4.a shows an object bounded by a complex parametric surface, and its (partial)
aspect graph. All visual events except certain cusp crossings and triple points have been
traced. Note that the aspect graph is extremely complicated, even though some events
are still missing. Also, note that this object is in fact only piecewise-smooth. As remarked
earlier, the catalogue of visual events used in this paper can be extended to piecewise-
smooth surfaces [33, 40], and corresponding equations can also be derived [32].
The objects considered in the next three examples are described by smooth compact
implicit surfaces of degree 4. The full aspect graph has been computed. Note that it has
the structure predicted in [5, 24] for similarly shaped surfaces.
Figure 4.b shows the silhouette of a bean-shaped object and the corresponding aspect
graph, with vertices drawn as small circles. This object has a hyperbolic patch within a
larger convex region.
Figure 4.c shows a squash-shaped object, its parabolic and flecnodal curves and its
aspect graph. This object has two convex parts separated by a hyperbolic region. Note
the two concentric parabolic curves surrounding the flecnodal curves.
Figure 4.d shows a "dimpled" object and its aspect graph. This object has a concave
island within a hyperbolic annulus, itself surrounded by a convex region. The flecnodal
curve almost coincides with the outer parabolic curve (compare to [24, p. 467]). There
is no tangent crossing in this case. Figure 5.a shows the corresponding decomposition
of the parameter space of the view sphere into 16 maximal regions, with sample points
indicated as black spots. The horizontal axis represents longitude, measured between - l r
and ~r, and the vertical axis represents latitude, measured between -7r/2 and 1r/2. Figure
5.b shows the corresponding 16 aspects.
What do these results indicate? First, it seems that computing exact aspect graphs
of surfaces of high degree is impractical. It can be shown that triple points occur only for
surfaces of degree 6 or more, and that computing the extremal points of the corresponding
curves requires solving a polynomial system of degree 4,315,680 - a very high degree
indeed! Even if this extraordinary computation were feasible (or another method than
ours proved simpler), it is not clear how useful a data structure as complicated as the
aspect graph of Fig. 4.a would be for vision applications.
On the other hand, aspect graphs of low-degree surfaces do not require tracing triple
points, and the necessary amount of computation remains reasonable (for example, a
mere few thousands of roots had to be computed for the tangent crossings of the bean-
shaped object). In addition, as demonstrated by Fig. 4.b-d, the aspect graphs of these
objects are quite simple and should prove useful in recognition tasks.
We have presented a new algorithm for computing the exact aspect graph of curved ob-
jects and described its implementation. This algorithm is quite general: as noted in [27],
606
a.
1
(
b.
C.
1
d.
Fig. 4. A few objects and their aspect graphs: a. A parametric surface, b. A bean-shaped implicit
surface, c. A squash-shaped implicit surface, d. A "dimpled" implicit surface.
607
:.:.:.:...... 9
iiiiii iiiii!iiii!iiiiiliii!ii!ii!i
iiiiiiii
. , ....:.:,:.:.:.:.:.:...., . ....;.:+:.
@C)C)C)
~i~i~ii~i~ii!i~i~i~i~i~i~i~iiiiii~i~i~i~i~i~i~ii~iii~ii~i~i~i~ii~ii~ii!i~i~i~i~i~i~iiii!ii~
a2:i~i~i:i:ii:i:ii:~!i!~i:i!~i:i:i~i:i~i:i:i!i:i:i!~i~i!i!:i!i:i:i:i~i:i:ii~ii~!ii!i!~!~i:i~i:i~i:i!i~:i:iii!~!!!!!!::i~:i:i~i:ii:i!i!i!i!!i!i:ii~b.
i:i:i
Fig. 5. a. Aspect graph regions of the "dimpled" object in parameter space, b. The corresponding
aspects.
algebraic surfaces subsume most representations used in computer aided design and com-
puter vision. Unlike alternative approaches based on cylindrical algebraic decomposition
[3, 9], our algorithm is also practical, as demonstrated by our implementation.
We are investigating the case of perspective projection: the (families of) surface curves
that delineate the visual events under orthographic projection also delineate perspective
projection visual events by defining ruled surfaces that partition the three-dimensional
view space into volumetric cells.
Future research will be dedicated to actually using the aspect graph representation
in recognition tasks. In [27], we have demonstrated the recovery of the position and
orientation of curved three-dimensional objects from monocular contours by using a
purely quantitative process that fits an object-centered representation to image contours.
What is missing is a control structure for guiding this process. We believe that the
qualitative, viewer-centered aspect graph representation can be used to guide the search
for matching image and model features and yield efficient control structures analogous
to the interpretation trees used in the polyhedral world [14, 17, 19].
A.1 Local E v e n t s
We recall the equations defining the surface curves (parabolic and flecnodal curves) as-
sociated to the visual events (beaks, lips, swallowtails) of parametric surfaces. There
is nothing new here, but equations for flecnodal curves are not so easy to find in the
literature.
Note: in this appendix, a u (resp. v) subscript is used to denote a partial derivative
with respect to u (resp. v).
Consider a parametric surface X(u, v) and define:
i.e., N is the surface normal, and e, f, g, are the coefficients of the second fundamental
form in the coordinate system (X~, Xv) [10].
The asymptotic directions are given by u'X~ + v'X~, where u' and v' are solutions of
the above equation. A contour cusp occurs when the viewing direction is an asymptotic
direction.
A.1.3 F l e c n o d a l Curves. As shown in [43, p.85], the flecnodal points are inflections
of the asymptotic curves A, given by:
An equation for the flecnodal curves is obtained by eliminating u', v', u", v" among
eqs. (4), (6), and the equation obtained by differentiating (4) with respect to t. Note
that since all three equations are homogeneous in u ~, ff and in u", v", this can be done by
arbitrarily setting u' = 1, u" = 1, say, and eliminating v', v" among these three equations.
The resulting equation in u, v characterizes the flecnodal curves.
Note that although it is possible to construct the general equation of flecnodal curves
for arbitrary parametric surfaces, this equation is very complicated, and it is better to
derive it for each particular surface using a computer algebra system. As before, explicit
equations for ~2 can be constructed.
609
A.2 M u l t i l o e a l E v e n t s
Multilocal events occur when the viewing direction V has high order contact with the
surface in at least two distinct points Xl and X2. In that case, V = Xl - X~.
(xx - x ~ ) ( x 2 - X3) = 0,
(x~ - x ~ ) 9 N 3 = 0,
(7)
(x2 - x z ) 9 N 1 = 0,
(x~ - x , ) 9N~ = 0 .
The first equation is a vector equation (or equivalently a set of two independent scalar
equations) that expresses the fact that the three points are aligned with the viewing
direction. The next three equations simply express the fact that the three points belong
to the occluding contour. It follows that triple points are characterized by five equations
in the six variables ui, vi, i = 1,2, 3.
An explicit equation for the curve D corresponding to a triple point can be obtained
by replacing (X2 - X3) by V in (7). Similar comments apply to the other multilocal
events.
N1 x N2 = O,
(8)
(Xl - X 2 ) " N I = 0.
Again, remark that the first equation is a vector equation (or equivalently a set of
two independent scalar equations). It follows that tangent crossings are characterized by
three equations in the four variables ui, vi, i = 1, 2.
where et, fl, gl are the values of the coefficients of the second fundamental form at X1,
and (a, b) are the coordinates of the viewing direction X 1 - X2 in the basis Xu(Ul, Vl),
Xv(ul, vt) of the tangent plane. Note that a, b can be computed from the dot products of
X 1 - X 2 with Xu(ul, vl) and Xv(ul, vl). It follows that cusp crossings are characterized
by three equations in the four variables ui, vl, i = 1, 2.
610
F ( X , Y, Z) = F ( X ) = 0, (10)
where F is a polynomial in X, Y, Z.
B.1 Local E v e n t s
For implicit surfaces, even the equations of parabolic curves are buried in the literature.
Equations for both parabolic and flecnodal curves are derived in this appendix.
Note: in this appendix, X, Y, Z subscripts denote partial derivatives with respect to
these variables.
V F ( X ) . V = 0, (11)
VTH(X)V 0,
plus the equation F ( X ) = 0 itself. For each point X along a parabolic curve, there is only
one asymptotic direction, which is given by (11). It should be noted that one can directly
characterize the beak-to-beak and lip events by adding to (12) the equations F ( X ) = 0
and (11), and tracing the resulting curve /2 in IRS; the projection of this curve onto ~2
defines the beak-to-beak and lip curves on the view sphere.
V F ( X ) . V = 0,
V r H ( X ) V = 0, (13)
VT(Hx(X)V1 + Hy(X)V2 + Hz(X)V3)V -- O.
Since these three equations are homogeneous in the coordinates of V, these coordi-
nates can easily be eliminated to obtain a single equation in X. Along with F ( X ) = 0,
this system defnes the flecnodal curves. As before, explicit equations for f2 can be con-
structed.
611
B.2 M u l t i l o c a l E v e n t s
B.2.1 T r i p l e P o i n t s . Let Xi, i = 1, 2, 3, be three points forming a triple point event.
The corresponding equations are similar to the equations defining triple points of para-
metric surfaces:
F(XI) = 0, i = 1, 2, 3,
(Xt - X2) x (X2 - X3) = 0,
(Xl - X 2 ) . N 3 = 0, (14)
( X 2 - X3)" Nt = 0,
( X 3 - X l ) " N 2 = O,
where Ni = VF(XI). It follows that triple points are characterized by eight equations in
the nine variables Xi, Yi, Zi, i = 1, 2, 3.
{ F(Xi)=0,
N1 x N2 = 0,
i=1,2,
(Xl - X 2 ) " N I = 0.
(15)
This is a system of five equations in the six variables Xi, Yi, Zi, i = 1,2.
l ( OP1/oXt)dXl+'''+(OPt/OXn)dXn=O (dX11
r162 J = 0 (17)
[ i'OP,~/OX1)dXI +...+(OP,~/OX,)dXn =O \'dXn]
where J = (OPdOXj), with i,j = 1, .., n, is the Jacobian matrix. This system has non-
trivial solutions if and only if the determinant D(X0, X1, ..., Xn) = [J[ of the Jaeobian
matrix vanishes. The extrema of F are therefore the solutions of:
PI(Xo,X1, ...,Xn) = O,
(18)
P.(X0, x l , . , x.) = 0,
D(Xo, X1,..., X,) = O.
612
These steps correspond to finding all the intersections o f / ' with some hyperplane X0 --
)(0. These intersections are given by:
PI(X0,X1, ...,X,) = 0,
(19)
.... , x,,) = o.
C.3 S t e p 1.4: M a r c h i n g on E x t r e m a - F r e e I n t e r v a l s
j(dX1] = _ d X o (OP1/OXo
(20)
ix. / oP2)OXo)
Given a step dXo in the X0 direction, one can predict the remaining dXi's by solving
this system of linear equations. This is only possible when the determinant of the Jacobian
matrix J is non-zero, which is exactly equivalent to saying that the point ( X 0 , . . . , Xn) T
is not an extremum in the X0 direction.
The correction step uses Newton iterations to converge back to the curve from the
predicted point. We write once more a first order Taylor approximation of the Pi's to
compute the necessary correction (dX1,..., dXn) T for a fixed value of X0:
P1 + (OP1/OXx)dX1 + . . . + (OP1/OXn)dXn = 0
P~ + (OP,/OX1)dXx + . . . + (oen/oXn)dX, = 0
,x1) (,)
dX2 = _ j- x
(21)
(1 - t)Q(X) + t P ( X ) = 0. (22)
The solutions of the target system are found by tracing the curve defined in ~n+l by
these equations from t = 0 to t -- 1 according to step 1.4 of our curve tracing algorithm.
In this case, however, the sample points are the known solutions of Q(X) = 0 at t = 0,
which allows us to bypass step 1.3 of the algorithm. It can also be shown [30] that with
an appropriate choice of Q, the curve has no extrema or singularities, which allows us to
also bypass steps 1.1-1.2.
613
References
1. V.I. Arnol'd. Singularities of systems of rays. Russian Math. Surveys, 38(2):87-176, 1983.
2. D.S. Arnon. Topologically reliable display of algebraic curves. Computer Graphics,
17(3):219-227, July 1983.
3. D.S. Arnon, G. Collins, and S. McCallum. Cylindrical algebraic decomposition I and II.
SIAM J. Comput., 13(4):865-889, November 1984.
4. C.L. Bajaj, C.M. Hoffmann, R.E. Lynch, and J.E.H. Hopcroft. Tracing surface intersec-
tions. Computer Aided Geometric Design, 5:285-307, 1988.
5. J. Callalaan and R. Weiss. A model for describing surface shape. In Proc. IEEE Conf.
Comp. Vision Part. Recog., pages 240-245, San Francisco, CA, June 1985.
6. G. Castore. Solid modeling, aspect graphs, and robot vision. In Pickett and Boyse, editors,
Solid modeling by computer, pages 277-292. Plenum Press, NY, 1984.
7. I. Chakravarty. The use of characteristic views as a basis for recognition of three-
dimensional objects. Image Processing Laboratory IPL-TR-034, Rensselaer Polytechnic
Institute, October 1982.
8. S. Chen and H. Freeman. On the characteristic views of quadric-surfaced solids. In 1EEE
Workshop on Directions in Automated CAD-Based Vision, pages 34-43, June 1991.
9. G.E. Collins. Quantifier Elimination for Real Closed Fields by Cylindrical Algebraic De-
composition, volume 33 of Lecture Notes in Computer Science. Springer-Verlag, New York,
1975.
10. M.P. do Carmo. Differential Geometry of Curves and Surfaces. Prentice-Ha/l, Englewood
Cliffs, N J, 1976.
11. D. Eggert and K. Bowyer. Computing the orthographic projection aspect graph of solids
of revolution. In Proc. IEEE Workshop on Interpretation of 3D Scenes, pages 102-108,
Austin, TX, November 1989.
12. D. Eggert and K. Bowyer. Perspective projection aspect graphs of solids of revolution: An
implementation. In IEEE Workshop on Directions in Automated CAD-Based Vision, pages
44-53, June 1991.
13. R.T. Farouki. The characterization of parametric surface sections. Comp. Vis. Graph. Ira.
Proc., 33:209-236, 1986.
14. O.D. Faugeras and M. Hebert. The representation, recognition, and locating of 3-D objects.
International Journal of Robotics Research, 5(3):27-52, Fall 1986.
15. Z. Gigus, J. Canny, and R. Seidel. Efficiently computing and representing aspect graphs of
polyhedral objects. IEEE Trans. Part. Anal. Mach. lntell., 13(6), June 1991.
16. Z. Gigus and J. Malik. Computing the aspect graph for line drawings of polyhedral objects.
IEEE Trans. Part. Anal. Mach. Intell., 12(2):113-122, February 1990.
17. W.E.L. Grimson and T. Lozano-P~rez. Localizing overlapping parts by searching the in-
terpretation tree. IEEE Trans. Patt. Anal. Mach. Intell., 9(4):469-482, 1987.
18. M. Hebert and T. Kanade. The 3D profile method for object recognition. In Proc. IEEE
Conf. Comp. Vision Patt. Recog., pages 458-463, San Francisco, CA, June 1985.
19. D.P. Huttenlocher and S. Uilman. Object recognition using alignment. In Proc. Int. Conf.
Comp. Vision, pages 102-111, London, U.K., June 1987.
20. K. Ikeuchi and T. Kanaxie. Automatic generation of object recognition programs. Proceed-
ings of the IEEE, 76(8):1016-35, August 1988.
21. J.T. Kajiya. Ray tracing parametric patches. Computer Graphics, 16:245-254, July 1982.
22. Y.L. Kergosien. La famille des projections orthogonales d'une surface et ses singularit~s.
C.R. Acad. Sc. Paris, 292:929-932, 1981.
23. Y.L. Kergosien. Generic sign systems in medical imaging. IEEE Computer Graphics and
Applications, 11(5):46-65, 1991.
24. J.J. Koenderink. Solid Shape. MIT Press, Cambridge, MA, 1990.
25. J.J. Koenderink and A.J. Van Doom. The internal representation of solid shape with
respect to vision. Biological Cybernetics, 32:211-216, 1979.
614
26. D.J. Kriegman and J. Ponce. Computing exact aspect graphs of curved objects: solids of
revolution, lnt. J. of Comp. Vision., 5(2):119-135, 1990.
27. D.J. Kriegman and J. Ponce. On recognizing and positioning curved 3D objects from image
contours. 1EEE Trans. Patt. Anal. Mach. lntell., 12(12):1127-1137, December 1990.
28. D.J. Kriegman and J. Ponce. Geometric modelling for computer vision. In SPIE Confer-
ence on Curves and Surfaces in Computer Vision and Graphics 1I, Boston, MA, November
1991.
29. D.J. Kriegman and J. Ponce. A new curve tracing algorithm and some applications. In P.J.
Laurent, A. Le Mdhautd, and L.L. Schumaker, editors, Curves and Surfaces, pages 267-270.
Academic Press, New York, 1991.
30. A.P. Morgan. Solving Polynomial Systems using Continuation for Engineering and Scien-
tific Problems. Prentice Hall, Englewood Cliffs, N J, 1987.
31. H. Plantinga and C. Dyer. Visibility, occlusion, and the aspect graph. Int. J. of Comp.
Vision., 5(2):137-160, 1990.
32. J. Ponce and D.J. Kriegman. Computing exact aspect graphs of curved objects: parametric
patches. In Proc. A A A I Nat. Conf. Artif. lntell., pages 1074-1079, Boston, MA, July 1990.
33. J.H. Rieger. On the classification of views of piecewise-smooth objects, linage and Vision
Computing, 5:91-97, 1987.
34. J.H. Rieger. The geometry of view space of opaque objects bounded by smooth surfaces.
Artificial Intelligence, 44(1-2):1-40, July 1990.
35. J.H. Rieger. Global bifurcations sets and stable projections of non-singular algebraic sur-
faces. Int. J. of Comp. Vision., 1991. To appear.
36. W.B. Seales and C.R. Dyer. Constrained viewpoint from occluding contour. In IEEE
Workshop on Directions in Automated "CAD-Based" Vision, pages 54-63, Maui, Hawaii,
June 1991.
37. T. Sripradisvarakul and R. Jain. Generating aspect graphs of curved objects. In Proc.
IEEE Workshop on Interpretation of 3D Scenes, pages 109-115, Austin, TX, December
1989.
38. J. Stewman and K.W. Bowyer. Aspect graphs for planar-fax:e convex objects. In Proc.
1EEE Workshop on Computer Vision, pages 123-130, Miami, FL, 1987.
39. J. Stewman and K.W. Bowyer. Creating the perspective projection aspect graph of poly-
hedral objects. In Proc. Int. Conf. Comp. Vision, pages 495-500, Tampa, FL, 1988.
40. C.T.C. Wall. Geometric properties of generic differentiable manifolds. In A. Dold
and B. Eckmann, editors, Geometry and Topology, pages 707-774, Rio de Janeiro, 1976.
Springer-Verlag.
41. R. Wang and H. Freeman. Object recognition based on characteristic views. In Interna-
tional Conference on Pattern Recognition, pages 8-12, Atlantic City, N J, June 1990.
42. N. Watts. Calculating the principal views of a polyhedron. CS Tech. Report 234, Rochester
University, 1987.
43. C.E. Weatherburn. Differentialgeometry. Cambridge University Press, 1927.
44. T. Whitted. An improved illumination model for shaded display. Comm. of the ACM,
23(6):343-349, June 1980.
SURFACE INTERPOLATION USING WAVELETS
Alex P. Pentland
1 Introduction
Surface interpolation is a common problem in both human and computer vision. Perhaps
the most well-known interpolation theory is regularization [7, 9]. However this theory
has the drawback that the interpolation network requires hundreds or even thousands of
iterations to produce a smoothly interpolated surface. Thus in computer vision applica-
tions surface interpolation is often the single most expensive processing step. In biological
vision, timing data from neurophysiology makes it unlikely that many iterations of cell
firing are involved in the interpolation process, so that interpolation theories have been
forced to assume some sort of analog processing. Unfortunately, there is little experimen-
tal evidence supporting such processing outside of the retina. In this paper I will show
how efficient solutions to these problems can be obtained by using orthogonal wavelet
filters or receptive fields.
1.1 Background
In computer vision the surface interpolation problem typically involves constructing a
smooth surface, sometimes allowing a small number of discontinuities, given a sparse set
of noisy range or orientation measurements. Mathematically, the problem may be defined
as finding a function U within a linear space 7~ that minimizes an energy functional s
s : inf s inf (/C(12)+ R(12)) (1)
vE~ ~E~
.a
m
~ ..je 7.w 2.~ e.o~
. . ._. .~. . . ,
Fig. 1. Wavelet filter family "closest" to Wilson-Gelb filters (arbitrarily scaled for display).
W h e n / C is chosen to be the stress within a bending thin plate (as is standard), then K
is the stiffness matrix familiar from physical simulation. Unfortunately, several thousand
iterations are often required to the interpolated surface. Although sophisticated multires-
olution techniques can improve performance, the best reported algorithms still require
several hundred iterations.
The cost of surface interpolation is proportional to both the bandwidth and condition
number of K. Both of these quantities can be greatly reduced by choosing the correct
basis (a set of n orthogonal vectors) and associated coordinate system in which to solve
the problem. In neural systems, transformation to a new basis or coordinate system can
be accomplished by passing a data vector through a set of receptive fields; the shapes of
the receptive fields are the new basis vectors, and the resulting neural activities are the
coordinates of the data vector in the coordinate system defined by these basis vectors. If
the receptive fields are orthonormal, then we can convert back to the original coordinate
system by adding up the same receptive fields in amounts proportional to the associated
neurons activity.
For the class of physically-motivated smoothness functionals, the ideal basis would
be both spatially and spectrally localized, and (important for computer applications)
very fast to compute. The desire for spectral localization stems from the fact that, in the
absence of boundary conditions, discontinuities, etc., these sort of physical equilibrium
problems can usually be solved in closed form in the frequency domain. In similar fash-
ion, a spectrally-localized basis will tend to produce a banded stiffness matrix K. The
requirement for spatial localization stems from the need to account for local variations
in K ' s band structure due to, for instance, boundary conditions, discontinuities, or other
inhomogeneities.
A class of bases that provide the desired properties are generated by functions known
as orthogonal wavelets [5, 2, 8]. Orthogonal wavelet functions and receptive fields are
different from the wavelets previously used in biological and computational modeling
because all of the functions or receptive fields within a family, rather than only the
functions or receptive fields of one size, are orthogonal to one another. A family of
617
It has been proven that by using wavelet bases linear operators such as ( ~ can be
represented extremely compactly [1]. This suggests that 4~w is an effective preconditioning
transform, and thus may be used to obtain very fast approximate solutions. The simplest
method is to transform a previously-defined K to the wavelet basis,
= T (5)
then to discard off-diagonal elements,
Y~ = diag T (6)
and then to solve. Note that for each choice of K the diagonal matrix/22~ is calculated
only once and then stored; further, its calculation requires only O(n) operations. In
numerical experiments I have found that for a typical K the summed magnitude of the
off-diagonals of I( is approximately 5% of the diagonal's magnitude, so that we expect
to incur only small errors by discarding off-diagonals.
This set of wavelets were developed by applying the gradient-descent QMF design procedure of
Simoneelli and Adelson [8] using the Wilson-Gelb filters as the initial "guess" at an orthogonal
basis. Wavelet receptive fields from only five octaves are shown, although the Wilson-Gelb
model has six channels. Wilson, in a personal communication, has advised us that the Wilson-
Gelb "b" and "e" channels are sufficiently similar that it is reasonable to group them into a
single channel.
618
Case I. The simplest case of surface interpolation is when sensor measurements exist for
every node so that the sampling matrix S = I. Substituting ~wlJ = U and premultiplying
by ~T converts Equation 3 to
A ~TK ~ U ~ T
+ ~O ~U~ = ~D
T
(7)
Case 2. In the more usual case where not all nodes have sensor measurements, the in-
terpolation solution may require iteration. In this case the sampling matrix S is diagonal
with ones for nodes that have sensor measurements, and zeros elsewhere. Again substi-
tuting 9 wI~ = U and premultiplying by ~ rw converts Equation 3 to
T ~
A#wK#wU + ~ T S ~ U~ = Ow
T
D (9)
The matrix ~wT S#w is diagonally dominant so that the interpolation solution U may be
obtained by iterating
v = + + V' (1o)
where S : diag(~Ts4~w) and D t = D - (K + S)U t is the residual at iteration t. I have
found that normally no more than three to five iterations of Equation 10 are required
to obtain an accurate estimate of the interpolated surface; often a single iteration will
sauce.
Note that for this procedure to be successful, the largest gaps in the data sampling
must be significantly smaller than the largest filters in the wavelet transform. Further,
when A is small and the data sampling is sparse and irregular, it can happen that the
off-diagonal terms of ~T S ~ introduce significant error. When using small A I have found
that it is best to perform one initial iteration with a large A, and then reduce A to the
desired value in further iterations.
An Example. Figure 2(a) shows the height measurements input to a 64 x 64 node inter-
polation problem (zero-valued nodes have no data); the verticM axis is height. These data
were generated using a sparse (10%) random sampling of the function z = 100[sin(kx)+
sin(ky)]. Figure 2(b) shows the resulting interpolated surface. In this example Equation
10 converged to within 1% of its true equilibrium state with a single iteration. Execution
time was approximately 1 second on a Sun 4/330.
619
(a) (b)
Fig. 2. A surface interpolation problem; solution after one iteration (1 second on a Sun 4/330).
2.1 S u m m a r y
I have described a method for surface interpolation that uses orthogonal wavelets to
obtain good interpolations with only a very few iterations. The method has a simple bio-
logical implementation, and its performance was illustrated with wavelets that accurately
model human spatial frequency sensitivity.
References
1. Albert, B., Beylkin, G., Coifman, R., Rokhlin, V. (1990) Wavelets for the Fast Solution of
Second-Kind Integral Equations. Yale Research Report DCS.RR.837, December 1990.
2. Daubechies, I. (1988) Orthonormal Bases of Compactly Supported Wavelets. Communica-
tions on Pure and Applied Mathematics, XLI:909-996, 1988.
3. Kohonen, T., (1982) Self-organized formation of topologically correct feature maps, Biol.
Cyber., 43, pp. 59-69.
4. Linsker, R. (1986) From basic network principles to neural architecture, Proc. Nat. Acad.
Sci, U.S.A., 83, pp. 7508-7512, 8390-8394, 8779-8783.
5. Mallat, S. G., (1989) A theory for multiresolution signal decomposition: the wavelet repre-
sentation, IEEE Trans. PAMI, 11(7):674-693, 1989
6. Pentland, A., (1991) Cue integration and surface completion, Invest. Opthal. and Visual
Science 32(4):1197, March 1991.
7. Poggio, T., Torte, V., and Koch, C., (1985) Computational vision and regularization theory,
Nature, 317:314-319, Sept. 26, 1985.
8. Simoncelli, E., and Adelson, E., (1990) Non-Separable Extensions of Quadrature Mirror
Filters to Multiple Dimensions, Proceedings of the IEEE, 78(4):652-664, April 1990
9. Terzopoulos, D., (1988) The Computation of visible surface representations, IEEE Trans.
PAMI, 10(4):417-439, 1988.
10. Wilson, H., and Gelb, G., (1984) Modified line-element theory for spatial-frequency and
width discrimination, J. Opt. Soc. Am. A 1(1):124-131, Jan. 1984.
This article was processed using the LTEX macro package with ECCV92 style
S m o o t h i n g and M a t c h i n g of 3-D Space Curves *
1 Introduction
Physicians are frequently confronted with the very practical problem of registrating 3D
medical images. For example, when two images provided by complementary imaging
modalities must be compared, (such as X-ray Scanner, Magnetic resonance Imaging,
Nuclear Medicine, Ultrasound Images), or when two images of the same type but acquired
at different times and/or in different positions must be superimposed.
A methodology exploited by researchers in the Epidanre Project at Inria, Paris, con-
sists of extracting first highly structured descriptions from 3D images, and then using
those descriptions for matching [1].Characteristic curves describe either topological sin-
gularities such as surface borders, hole borders, and simple or multiple junctions, etc.,
(see [10]), or differential Structures, such as ridges, parabolic lines, and umbilic points [11].
* This work was financed in part by a grant from Digital Equipement Corporation. GeneraJ
Electric-CGR partially supported the research that provided ridge extraction software.
621
The characteristic curves are stable with respect to rigid tranformations, and can tolerate
partial occlusion due to their local nature. They are typically extracted as a connected
set of discrete voxels, which provides a much more compact description than the original
3D images (involving a few hundreds of points compared to several million). Fig. 1 shows
an example of ridges extracted from the surface of a skull [12]. These curves can be used
to serve as a reference identifying positions and features of the skull and to establish
landmarks to match skulls between different individuals, yielding a standard approach
for complex skull modeling [4].
Fig. 1. Extraction of characteristic curves (crest lines) from the surface of a skull (using two
different X-ray Scanner images)
The problem we address in this paper is the use of t h e s e c u r v e s to identify and
accurately locate 3D objects. Our approach consists in introducing a new algorithm
to approximate a discrete curve by a sufficiently smooth continuous one (a spline) in
order to compute intrinsic differential features of second and third order (curvature and
torsion). Given two curves, we then wish to find, through a matching algorithm, the
longest common portion, up to a rigid transformation. From three possible approaches,
specifically: prediction-verification, accumulation and geometric hashing, we retained the
third one whose complexity is sublinear in the number of models. We call it an indexation
method, and introduce logical extensions of the work of [9, 15, 3]. Our work is also closely
related to the work of [2, 6, 16] on the identification and positionning of 3D objects.
In Section 2, we discuss approaches to fitting curves to collections of voxels (points) in
3D imagery. In Section 3, we implement a matching system based on the indexation (geo-
metric hashing), whose complexity is sublinear in the number of models in the database.
Certain modifications are required for use with the differentiable spline curve represen-
tation, and other enhancements are suggested, in order to make the method robust to
partial occlusion of the curves (potentially in multiple sections). We finally introduce
alternative invariants for hashing. In sum, we considerably extend previous indexation-
based curve-matching methods. In Section 4, we provide experimental results obtained
using real data.
extensive literature on B-splines We provide a very brief introduction, using the notation
of [3, 15]. Given a sequence of n + 1 points P~(z~, yi, z~), i = O..n in 3-space, a CK-2
approximating B-spline consists of the following components:
1. A control polygon of m + l points is given, such that ~ ( X j , Y j , Z j ) , j = O..m are
known points;
2. We are given m + l real-valued piecewise polynomial functions, Bj,K(fi), representing
the basis splines, which are functions of the real variable fi and consist of polynomials
of degree K - l , and are globally of class CK-2. The location in 3-space of the approx-
imating curve for a given parameter value fi is given by: Q(~) : ~m_ 0 VjBj,K(fi).
3. The knots must also be specified, and consist of m + K real values ~ . } , with ul : 0
and u,~+K : L, partitioning the interval [0, L] into r e + K - 1 intervals. Here, L is the
length of the polygon joining the P~'s. If the intervals are uniform, then we say that
the approximation is a uniform B-spline.
We use the global parameter fi along the interval [0, L], and denote by u the relative
distances between knots, defined by u = (fi-fi~)/(~+x-fi~)- The basis spline functions
are defined recursively. The basis splines of order 1 are simply the characteristic functions
of the intervals:
9 / 1 ~ _< ~ < fij+~
Bi,l(fi) -- k 0 otherwise
Successively higher-order splines are formed by blending lower-order splines:
Thus quadratic splines, the (Bj,~), are C1, cubic splines (Bj,~), are C2, etc. Because of
this simple formula, we may incorporate contraints on the derivatives in our measure of
the quality of an approximation, for the process of finding the best control points and
knots, and we will also be able to easily make use of differential measures of the curve
for matching purposes.
2.1 A P r e v i o u s A p p r o x i m a t i o n Scheme
We next recall a classic approximation scheme due to Barsky [3]. This scheme has been
used by St-Marc and M~dioni [15] for curve matching. Our emphasis is on the shortcom-
ings of the approach for our objectives and on proposed modifications.
Given n + l data points P~(z~, y~, z~), i = 0..n, we seek m + l control vertices ~ , j :
0..m and m-t-K corresponding knots 12j , j -- 0 . . m + K minimizing the sum of square
distances between the B-spline Q(~) of degree K - 1 and the data P~. The notion of
distance between a spline Q(~2) and a data point P~ is based on the parameter value ~
where the curve Q(~2) comes closes to Pi. Thus, the criterion to minimize is:
A1 = ~ IIQ(~) - P~ll 2
i----O
The calculation of the ~ values is critical, since ] I Q ( ~ ) - P~ll is supposed to represent the
Euclidian distance of the point P~ to the curve. On the other hand, an exact calculation
623
of the values ~ is difficult, since they depend implicitly on the solution curve Q(~). As
an expedient, Barsky suggests using for ~ the current total length of the polygonal curve
from/Do to P~. Thus as an estimate, we can use ~ = ~ k =i-1
0 l I P s + * - Pk[I. I f B is the rn+l
by n matrix of the Bj,K(~), X the rr~-I by 3 control vertices matrix and z the n + l by
3 matrix of data points coordinates, A1 can we written as [[BtX - z[[ 2. Differentiating
with respect to X leads to:
BBtX - B z = 0 or A X : Bz.
Because X t B B t X : [ I B t X I [ 2, we know that A and B have the same rank. Thus i f m _< n
A is positive definite up to numerical error. If the approximating curve is not a closed
curve then A is a band matrix of band size K and X can be determined in linear time
with respect to ~ with a Choleski decomposition.
In working with this method, we have observed that ~ + 1 , the number of control
points, must he quite large in order to obtain a good visua/fit to the data points. Worse,
small amplitude oscillations often appear, corrupting the derivative information, and
making derivative-based matching methods unworkable. For example, using the synthetic
data of a noisy helix (Fig. 2a), we reconstruct Fig. 25 using the Barsky method for
spline approximation. It can be seen that curvature and torsion measurements along the
approximation curve will be unstable. In the next section, we explain how the results
shown in Figs. 2c and 2d are obtained.
2.2 I m p r o v e m e n t s
a ~ - P~ll = 0.
F~(~) = alIQ(~)
= Var(llQ(ud - Pill),
and Var designates the observed variance of the argument values over the index i. The
second term is related to the bending energy of the spline. Since the second derivative
values are linear in terms of control vertices, A2 is again quadratic, the construction
and complexity are as before, and the result is a spline. Fig. 2d illustrates results of
minimizing A2.
9 !
4 i
x' y " .TI
/
~'~ ,I" " 9 .-':'
b" o / ~ ...'
,, .9149 ... ,
|. /, :. ....
F i g . 2. a. T o p l e f t : Noise is added to a helix, and points are sampled with a limitation on the
distance between successive points. The curvature and torsion are plotted in the top and the
right panels of the cube, as a function of arclength. In a perfect reconstruction, the curvature
and torsion would he constant.
b . T o p r i g h t : In the reconstruction method as suggested by Barsky, curvature and (especially)
torsion values are extremely noisy, despite the quality of the reconstruction (in terms of position)
of the original curve.
c. B o t t o m left: A more precise estimate of model-data distances improves the estimation of
curvature and torsion.
d . B o t t o m r i g h t : The constraint on the second derivative also improves the estimation.
onal representation of the curves, and thus vote for a model and a displacement length,
representing a difference between the arclength locations of the point s! and the candi-
date matching point rr~,j measured relative to some reference point along each curve's
representation. Since our representation of the curves includes a differentiable structure
and thus Fr~net frames, we may include the explicit calculation of the entire rigid trans-
formation as part of the recognition process. The advantage of our method is that the
arclength parametrization can suffer from inaccuracies and accumulative errors, whereas
the six-parameter rigid transformation suffers only from local representation error. An-
other advantage of voting for rigid transformations is that we may use a statistical method
to compute a distance between two such transformations, and incorporate this into the
voting process and the indexation table [71.
3.1 E n h a n c e m e n t s t o t h e I n d e x a t l o n M e t h o d
every point P on the (sampled) curve. This computation is repeated for every model
curve, and for eztremal curvature basis point8 B along the curve. In this way, the in-
formation about the model curves are stored into a three-dimensional table, indexed by
(c, at, ut). In each bin of this table, entries consist of a model curve, a point on that curve
(together with the corresponding Pr6net frames), and a basis point B also on the curve.
For the recognition algorithm, a basis point is selected on an unknown curve, and
transformations are computed from that basis point to other points along the curve. For
each such computation, the parameters (c, 8t, ut) m a p to a bin in the three-dimensional
table, which gives rise to votes for model/basis pairs, similar to before. This procedure
applies also to curves in m u l t i p l e s e c t i o n s (features are exclusively local), and last to
s c a t t e r e d p o i n t s associated with curvature information and a local reference frame.
Experimental results are reported in the next section.
4 Results
Using two views (A and B) of a skull from zeal X-ray Scanner data, we used existing
software (see [12]) to find ridge points, and then fed these points into the curve smoothing
algorithm of Section 2. For each view, the sub-mandibular rim, the sub-orbital ridges,
the nose contour and other curves were identified.
Using the indexation algorithm of Section 3.1, we preprocessed all curves from A,
also in the reverse orientation if necessary (to cope with orientation problems), and built
the indexation table, based on measurements of (c, 0t, ut) along the curves. Applying the
indexation-based recognition algorithm, curves from B were successfully identified. The
resulting transformations were applied to all curves from B and superimposed matches
(model, scene) appear in Figs. 3b to 3d. We next run our algorithm on the chin and
right orbit curves considered as one single curve (Fig. 3e) and finally on all curves simul-
taneously (Fig. 3f). CPU times on a DEC-workstation (in seconds) for recognition and
positionning are summarized in the following table. It confirms the linear time hypothesis.
scene curv e Inose contourlright orbit{left orbit{ chin {chin- orbitlall curves from B
[ C P U time H 1.085 [ 0.964 [ 1.183 12.5771 3.562 ] 9.515
Note that incorporating more curves increases the l i k e l i h o o d of the match. We thus
start from a local curve match and end up with one global rigid transformation.
We then experimented the matching using scattered points (several hundreds) on the
surface of the objet, selected for the high curvature value on the surface and associated
with a surface frame [12](Fig 3g). Last, we registered the entire skull by just applying
the transformation that superimposed the two submandibular curves. Incorporating the
match of the orbital ridge curves, we improved the overal rigid transformation estimate,
resulting in a more precise correspondence (Fig 4).
References
1. N. Ayache, J.D. Boissonnat, L. Cohen, B. Geiger, J. Levy-Vehel, O. Monga, and P. Sander.
Steps toward the automatic interpretation of 3-d images. In H. Fuchs K. Hohne and
S. Pizer, editors, 8D Imaging in Medicine, pages 107-120. NATO ASI Series, Springer-
Verlag, 1990.
2. N. Ayache and O.D. Faugeras. Hyper: A new approach for the recognition and positioning
of two-dimensional objects. IEEE Transactions on Pattern Analyai8 and Machine Intelli-
gence, 8(1):44-54, January 1986.
628
,P i J~"
Fig. 3. a. To p left: The successful matching of the two sub-mandibular curves, superimposed.
(Note that the occlusion and translation of the second view are handled automatically).
b. To p m i d d l e : Nose contours matched, c. T o p right: Right orbits matched.
d. M i d d l e left: Left orbits matched, e. M i d d l e : Chin-orbit matched simulaneously.
f . M i d d l e right: All curves matched simulaneously.
g. B o t t o m : The matching algorithm is successfully applied (bottom right) to scattered points
associated to A (bottom left) and B (bottom middle), represented here together with their
reference frame. There is a scale factor on the x and y axes, due to the evaluation in image
coordinates (as compared to real coordinates in the previous maps).
3. R. Bartels, J. Beatty, and B. Barsky. An introduction to splines for use in computer graph-
ics and geometric modeleling. Morgan Kaufmann publishers, 1987.
4. Court B. Cutting. Applications of computer graphics to the evaluation and treatment of
major craniofacial malformation. In Jayaram K.Udupa and Cabot T. Herman, editors, 3D
Imaging in Medicine. CRC Press, 1989.
5. W. Eric L. Crimson and Daniel P. Huttenlocher. On the verification of hypothesized
matches in model-based recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 13(12):1201-1213, December 1991.
6. W.E.L Crimson and T. Lozano-Per~z. Model-based recognition and localization from
sparse range or tactile data. International Journal of Robotics Research, 3(3):3-35, 1984.
629
Fig. 4. R e g l s t r a t l n g t h e ridges s top row shows the ridges extra~ted of a skull scanned in
position A (top left) and position B (top right). Figure in the bottom left shows the superposition
of the ridge points, obtained after transforming the points of the second view according to the
transformation discovered by matching the sub-mandibular curves. The best correspondences are
along the chin points. Figure in the bottom right shows the improved transformation obtained
with the addition of left sub-orbital curves.
7. A. Gudziec and N. Ayache. Smoothing and matching of 3d-space curves. Technical Report
1544, Inria, 1991.
8. J.C. Holladay. Smoothest curve approximation. Math. Tables Aids Computation, 11:233-
243, 1957.
9. E. Kishon, T. Hastie, and H. Wolfson. 3-d curve matching using splines. Technical report,
A T & T , November 1989.
10. G. Malandain, G. Bertrand, and Nicholas Ayache. Topological segmentation of discrete
surface structures. In Proc. International Conference on Computer Vision and Pattern
Recognition, Hawai,USA, June 1991.
11. O. Monga, N. Ayache, and P. Sander. From voxels to curvature. In Proc. International
Conference on Computer Vision and Pattern Recognition, Hawai,USA, June 1991.
12. Olivier Monga, Serge Benayoun, and Olivier D. Faugeras. Using third order derivatives to
extract ridge lines in 3d images. In submitted to IEEE Conference on Vision and Pattern
Recognition, Urbana Champaign, June 1992.
13. T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, 1977.
14. M. Flass and M. Stone. Curve fitting with piecewise parametric cubics. In Siggraph, pages
229-239, July 1983.
15. P. Saint-Marc and G. Medioni. B-spline contour representation and symmetry detection.
In First European Conference on Computer Vision (ECCV), Antibes, April 1990.
16. F. Stein. Structural hashing: Efficient 3-d object recognition. In Proc. International Con-
ference on Computer Vision and Pattern Recognition, Hawai,USA, June 1991.
17. D. W. Thompson and J. L. Mundy. 3-d model matching from an unconstrained viewpoint.
In Proc. International Conference on Robotics and Automation, pages 208-220, 1987.
Shape from Texture for S m o o t h Curved Surfaces
Jonas Gdrding
Computational Vision and Active Perception Laboratory (CVAP)
Department of Numerical Analysis and Computing Science
Royal Institute of Technology, S-100 44 Stockholm, Sweden
Email: jonasg@bion.kth.se
1 Introduction
Fig. 1. This image of a slanting plane covered with circles illustrates several forms of projective
distortion that can be used to estimate surface shape and orientation.
The fact that projective texture distortion can be a cue to three-dimensional surface
shape was first pointed out by Gibson [5]. His observations were mostly of a qualitative
nature, but during the four decades which have passed since the appearance of Gibson's
631
seminal work, m a n y interestingand useful methods for the quantitative recovery of sur-
face orientation from projective distortion have been proposed; see e.g. [3] for a review.
Furthermore, psychophysical studies (e.g. [I, 2]) have verified that texture distortion
does indeed play an important role in h u m a n perception of three-dimensional surfaces.
However, in our view there are two important issues which have not received enough
attention in previous work.
Firstly, most of the proposed mechanisms are based on the assumption that the
surface is planar. As pointed out by m a n y authors, real-world physical surfaces are rarely
perfectly planar, so the planarity assumption can at best be justified locally.However,
limiting the size of the analyzed region is generally not enough. W e show in this paper
that even for infinitesimallysmall surface patches, there is only a very restricted class
of texture distortion measures which are invariant with respect to surface curvature.
Gibson's gradient of texture density,for example, does not belong to this class.
Secondly, the possibilityof using projectivetexture distortion as a direct cue to sur-
face properties has not been fully exploited in the past. For example, a local estimate
of a suitably chosen texture gradient can be directly used to estimate the surface ori-
entation. Although this view of texture distortion as a direct cue predominates in the
psychophysical literature,most previous work in computational vision has proposed in-
direct approaches (e.g.backprojection) where a more or less complete representation of
the image pattern is used in a search procedure.
The main purpose of the present work is to analyze the use of projective distortionas a
direct and local cue to three-dimensional surface shape and orientation.W e concentrate
on those aspects that depend on the surface and imaging geometry, and not on the
properties of the surface texture. Whereas most previous work has assumed that the
scene is planar and sometimes also that the projection is orthographic, we study the
more general case of a smooth curved surface viewed in perspective projection. Early
work in the same spirit was done by Stevens [7],who discussed the general feasibility
of computing shape from texture, and derived several formulas for the case of a planar
surface.
A more detailed account of the work presented here can be found'in [4].
I SurfaceS
: Focal point
F(p) ~ . . N
I I
M
: t = b
Viewsphere Z
T
(a) (b)
Fig. 2. a) Local surface geometry and imaging model. The tangent planes to the viewsphere ,~
at p and to the surface S at F(p) are seen edge-on but axe indicated by the tangent vectors
t and T. The tangent vectors b and B are not shown but are perpendicular to the plane of
the drawing, into the drawing, b) The derivative map F, can be visualized by an image ellipse
which corresponds to a unit circle in the surface.
direction t is usually called the tilt direction. The angle a between the viewing direction
p and the surface normal N is called the slant of the surface. Together, slant and tilt
specify the surface orientation uniquely.
We also define an orthogonal basis (T, B) for the tangent plane to the surface S at
F ( p ) as the normalized images under F, of t and b respectively.
2.1 F i r s t - O r d e r D i s t o r t i o n : F o r e s h o r t e n i n g
Starting with Gibson [5], much of the literature on shape from texture has been concerned
with texture gradients, i.e., the spatial variation of the distortion of the projected pattern.
However, an important fact which is sometimes overlooked is that texture gradients are
not necessary for slant perception; there is often sufficient information in the local first-
order projective distortion (F,) alone.
F, specifies to first order how the image pattern should be "deformed" to fit the
corresponding surface pattern. For a frontoparallel surface, F, is simply a scaling by the
distance, but for a slanted and tilted surface it will contain a shear as well. It can be
shown that in the bases (t, b) and (T, B), we have the very simple expression
(1)
where r = IIF(p)II is the distance along the visual ray from the center of projection to
the surface. The characteristic lengths (m, M) have been introduced to simplify later
expressions and because of their geometric significance: F, can be visualized by an image
ellipse corresponding to a unit circle in the surface (Fig. 2b). The minor axis of the ellipse
is aligned with t and has the length 2m, and the major axis has the length 2M.
The ratio m/M is called the foreshortening of the pattern. We see that magnitude
and direction of foreshortening determine slant a uniquely, and tilt t up to sign.
633
We are now prepared to take a closer look at the information content of texture gradients,
i.e., various measures of the rate of change of projective texture distortion. Gibson [5]
suggested the gradient of texture density as a main cue. Many other texture gradients
have subsequently been considered in the literature, see e.g. Stevens [7] or Cutting and
Millard [2]. These authors have restricted the analysis to the case of a planar surface.
In this section we reexamine the concept of texture gradients for the more general
case of a smooth curved surface. The analysis of texture gradients can be divided into
two relatively independent subproblems; firstly, gradient measurement, and secondly,
gradient interpretation. Here we concentrate on the interpretation task, but one specific
measurement technique is described in Sect. 5.
3.1 D i s t o r t i o n G r a d i e n t s
The most obvious way of defining the rate of change of the projective distortion is by
the derivatives of the characteristic lengths M and m defined by (1). This definition
encompasses most of the texture gradients that have been considered in the literature,
e.g. the compression gradient ~lVm, the perspective gradient ~2VM, the foreshortening
gradient Ve = (~I/~2)V(m/M), the area gradient VA = ~ I ~ V ( m M ) , and the density
gradient Vp = psV(1/(mM)), where ~1,~2 and ps are unknown scale factors.
We will refer collectively to such gradients as distortion gradients. They do not all
provide independent information, since by the chain rule the gradient of any function
f(M, m) is simply a linear combination of the basis gradients V M and Vm.
In practice it makes more sense to consider the normalized gradients ( V M ) / M and
( V m ) / m , since these expressions are free of scale factors depending on the distance to
the surface and the absolute size of the surface markings. Explicit expressions for these
gradients are given by the following proposition:
where r is the distance from the viewer, a is the slant of the surface, tot is the normal
curvature of the surface in the tilt direction, and r is the geodesic torsion, or "twist", of
the surface in the tilt direction.
From Proposition 1 it is straightforward to derive explicit expressions for gradients of
any function of m and M. For example, we obtain the normalized foreshortening gradient
Ve 1 ( s i n a + rx, t a n a )
e cos a \ r r sin a (4)
Proposition 1 and equations (4-5) reveal several interesting facts about texture gradi-
ents. Firstly, the minor gradient Vm depends on the curvature parameters x~ and T,
whereas the major gradient V M is independent of surface curvature. This is important
because it means that ( V M ) / M can be used to estimate the local surface orientation,
and hence to corroborate the estimate obtained from foreshortening. Furthermore, unlike
foreshortening, ( V M ) / M yields an estimate which has no tilt ambiguity.
Secondly, the direction of any texture gradient which depends on m, such as the
foreshortening gradient or the density gradient, is aligned with the tilt direction if and
only if the twist r vanishes, i.e., if the tilt direction happens to be a principal direction in
the surface. This is of course always true for a planar surface, but for a general curved
surface the only distortion gradient guaranteed to be aligned with tilt is ( V M ) / M .
Thirdly, the complete local second-order shape (i.e. curvature) ors cannot be estimated
by distortion gradients. The reason is that it takes three parameters to specify the surface
curvature, e.g. the normal curvatures ~t, xb in the T and B directions and the twist r.
The Gaussian curvature, for example, is given by K = Xt~b -- 7"2. However, the basis
gradients (2) and (3) are independent of the normal curvature t%.
3.2 L e n g t h G r a d i e n t s
The concept of distortion gradients can be generalized. Distortion gradients are defined
as the rate of change of some function of the characteristic lengths rn and M, everywhere
measured relative to the tilt direction which may vary in the image. An alternative
procedure could be to measure the rate of change in a fixed direction in the image of
projected length in some direction w. It is a non-trivial fact that when w coincides with
the tilt or the perpendicular direction, this measure is equivalent to the corresponding
distortion gradient. However, a gradient can be computed for projected lengths in any
direction, not just t and b. This way we can obtain information about the surface shape
which cannot be provided by any distortion gradient.
A particularly useful example is the derivative in some direction of the projected
length measured in the same direction. This derivative could e.g. be estimated in a given
direction by measuring the rate of change of the distances between the intersections of
a reference line with projected surface contours. We have shown that the normalized
directional derivative computed this way in the direction w -- a t +/3b is given by
+ cos2, )'
- c~tana 2+ cos a (a~_ + j32 costa) ] (6)
4 Global Analysis
Obtaining local estimates of surface shape is only the first step in the estimation of shape
from texture. The local estimates must then be combined to obtain a global surface
description, and in this process ambiguities existing at the local level can sometimes be
resolved. In the next subsection we will examine one such possibility in more detail.
4.1 T h e P h a n t o m S u r f a c e
Consider the common situation that local foreshortening is used to estimate surface ori-
entation. We pointed out in Sect. 2 that foreshortening only determines tilt up to sign,
leading to two mutually exclusive estimates of the local surface normal. By integrating
these two sets of surface normals we can in principle obtain two global surface descrip-
tions. We can then use any a priori knowledge we might have about the surface shape
(e.g. that it is approximately planar) to decide which of the two surfaces is more likely
to be correct.
The latter possibility has generally been overlooked in the literature, most likely
because the relation between the two surfaces is trivial if orthographic projection is
assumed. In this case the two surface normals are related by a reflection in the optical
axis, which corresponds to a reflection of the two surfaces in a plane perpendicular to
the optical axis. Hence, both surfaces will have the same qualitative shape.
In perspective projection, however, the relation is much more interesting and useful.
The sign ambiguity in the tilt direction now corresponds to a reflection of the surface
normal in the line of sight, which varies with position in the visual field. For example, if
the true surface has a constant surface normal (i.e., it is planar), then the other set of
surface normals will not be constant, i.e., it will indicate a curved surface. We will call
this surface the "phantom surface" corresponding to the actual surface. Strictly speaking,
we must first show that the surface normals obtained this way are integrahle, so that the
phantom surface actually exists. It turns out that this is indeed the case, and that there
is a very simple relation between the true surface and the phantom surface:
where K is an arbitrary positive constant. The phantom surface S has everywhere the
same slant but reversed tilt direction with respect to the true surface S.
An interesting observation is that the phantom surface and the true surface are equivalent
if and only if r(p) is constant, i.e., if the eye is looking at the inside of a sphere from the
center of that sphere.
X
Phantom surface(sphere)
Yiewspher~ ........ /
' ""; Z
[ sN
Fig. 3. This drawing shows the intersection of the XZ-plane with the plane Z cos a - X sin a = 6
and the corresponding phantom surface (a sphere) (X + K sin a)2 -k y2 + (Z - K cos a)2 = K 2.
K is in a~bitrary scaling constant which in this example has been set to 8/2.
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
o o o ~ o
0 0 0 0 0 0 0 0
o o o o o o o o
0 0 0 0 0 0 0 0
0 ~ o 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 o o o o 0 o 00000000
O0000000 0 0 0 0 0 0 0 0
0000o00 9 000000 9169
Synthetic image Estimated local distortion Rescaled local distortion
'oooooooo
0 0 0 0 0 0 0 0
0 0 ~ 0 0 0 ~
0 0 ~ 0 ~ 0
0000000~ Q Q Q O ~
0 0 ~ 0 0 0 0
O~GO0000 000 9169
06000000 9169169
Using ~reshortening (1) Using Breshortening (2) Using the major gradient
Fig. 4. Estimation of surface orientation from first- and second-order projective distortion.
The top row shows the original image (left), the estimated local distortion/~i repre-
sented as ellipses on a superimposed 8 x 8 grid (middle), and a rescaled representation
of Pi where all the ellipses have been given the same size to allow a better assessment of
their shapes.
The bottom row shows the estimated local shape, obtained by first transforming each
local/Ji to a plane perpendicular to the line of sight and then applying (8). The left image
in this row shows the first estimate of surface orientation obtained from foreshortening,
the middle image shows the second estimate obtained by reflecting the first surface nor-
mal in the line of sight, and the third image shows an independent estimate of surface
orientation, obtained from the normalized major gradient.
Note that the second estimate from foreshortening indicates a curved surface, as a
result of the varying angle between the line of sight and the surface normal. As was
shown in Sect. 4, this "phantom surface" is obtained from the true surface by inversion
638
in the sphere. The estimate from the major gradient is significantly more noisy than the
other estimates, which was to be expected since it is based on the spatial derivatives
of an estimated quantity. Nevertheless, it is obviously stable enough to resolve the tilt
ambiguity in the estimate from foreshortening.
6 Conclusion
References
1. D. Buckley, J.P. Frisby, and E. Spivey, "Stereo and texture cue combination in ground planes:
an investigation using the table stereometer", Perception, vol. 20, p. 91, 1991.
2. J.E. Cutting and R.T. MiUard, "Three gradients and the perception of fiat and curved
surfaces", J. of Experimental Psychology: General, vol. 113(2), pp. 198-216, 1984.
3. J. G~rding, Shape from surface markings. PhD thesis, Dept. of Numerical Analysis and
Computing Science, Royal Institute of Technology, Stockholm, May 1991.
4. J. G~rding, "Shape from texture for smooth curved surfaces in perspective projection", Tech.
Rep. TRITA-NA-P9203, Dept. of Numerical Analysis and Computing Science, Royal Insti-
tute of Technology, Stockholm, Jan. 1992.
5. J. Gibson, The Perception of the Visual World. Houghton Mitt]in, Boston, 1950.
6. B. O'Neill, Elementary Differential Geometry. Academic Press, Orlando, Florida, 1966.
7. K.A. Stevens, "The information content of texture gradients", Biological Cybernetics, vol. 42,
pp. 95-105, 1981.
This article was processed using the IbTF_tXmacro package with 1~CCV92 style
Recognising rotationally symmetric surfaces f r o m
their outlines*
1 Introduction
There has been a history of interest in recognising curved surfaces from their outlines.
Freeman and Shapira [7], and later Malik [9] investigated extending line labelling to
curved outlines. Brooks [2] studied using constraint-based modelling techniques to recog-
nise generalised cylinders. Koenderink [8] has pioneered work on the ways in which the
topology of a surface's outline changes as it is viewed from different directions, and has
studied the way in which the curvature of a surface affects the curvature of its out-
line. Ponce [12] studied tlte relationships between sections of contour in the image of
a straight homogenous generalised cylinder. Dhome [4] studied recognising rotationally
invariant objects by computing pose from the image of their ending contours.
Terzopolous et al. [17] compute three-dimensional surface approximations from im-
age data, based around a symmetry seeking model which implicitly assumes that "the
axis of the object is not severely inclined away from the image plane" (p. 119). These
approximations can not, as a result, be used for recognition when perspective effects are
* DAF acknowledges the support of Magdalen College, Oxford, of the University of Iowa, and
of GE. JLM acknowledges the support of the GE Coolidge Fellowship. AZ acknowledges
the support of SERC. CAR acknowledges the support of GE. The GE CRD laboratory is
supported in part by the following: DARPA contract DACA-76-86-C-007, AFOSR contract
F49620-89-C-003.
640
significant. Despite this work, it has been hard to produce robust, working model based
vision systems for curved surfaces.
Ponce and Kriegman [13] show that elimination theory can be used to predict, in sym-
bolic form, the outline of an algebraic surface viewed from an arbitrary viewing position.
For a given surface a viewing position is then chosen using an iterative technique, to give
a curve most like that observed. The object is then recognized by searching a database,
and selecting the member that gives the best fit to the observed outline. This work shows
that outline curves strongly constrain the viewed surface, but has the disadvantage that
it cannot recover surface parameters without solving a large optimization problem, so
that for a big model base, each model may have to be tested in turn against the image
outline.
A number of recent papers have shown how indexing functions can be used to avoid
searching a model base (e.g. [5, 18, 14]). Indexing functions are descriptions of an object
that are unaffected by the position and intrinsic parameters of the camera, and are usually
constructed using the techniques of invariant theory. As a result, these functions have
the same value for any view of a given object, and so can be used to index into a model
base without search. Indexing functions and systems that use indexing, are extensively
described in [10, 11], and [16] displays the general architecture used in such systems.
To date, indexing functions have been demonstrated only for plane and polyhedral
objects. Constructing indexing functions for curved surfaces is more challenging, because
the indexing function must compute a description of the surface's shape from a single
outline. It is clear that it is impossible to recover global measures of surface shape from
a single outline if the surfaces involved are unrestricted. For example, we can disturb any
such measure by adding a bump to the side of the surface that is hidden from the viewer.
An important and unresolved question is how little structure is required for points to
yield indexing functions.
In this paper, we emphasize the structure of image points by demonstrating useful
indexing functions for surfaces which have a rotational symmetry, or are within a 3D
projectivity of a surface with a rotational symmetry. This is a large and useful class of
surfaces.
In this section, we show that lines bitangent to an image contour yield a set of indexing
functions for the surface, when the surface is either rotationally symmetric, or projectively
equivalent to a rotationally symmetric surface. This follows from a study of the properties
of the outline in a perspective image.
2.1 G e o m e t r i c a l p r o p e r t i e s o f t h e o u t l i n e
The outline of a surface in an image is given by a system of rays through the camera
focal point that are tangent to the surface. The points of tangency of these rays with the
surface form a space curve, called the contour generator. The geometry is illustrated in
figure 1.
Points on the contour generator are distinguished, because the plane tangent to the
surface at such points passes through the focal point (this is an alternative definition of
the contour generator). As a result, we have:
641
po~t
)
Fig. 1. The cone of rays, through the focal point and tangent to the object surfa~ce, that forms
the image outline, shown for a simple object.
L e m m a : Except where the image outline cusps 4 , a plane tangent to the surface
at a point on the contour generator (by definition, such a plane passes through
the focal point), projects to a line tangent to the surface outline, and conversely,
a line tangent to the outline is the image of a plane tangent to the surface at the
corresponding point on the contour generator.
As a corollary, we have:
This yields useful relationships between outline properties and surface properties. For
example:
The lemma and both corollaries follows immediately from considering figure 2. Generic
surfaces admit one-parameter systems of bitangent planes, so we can expect to observe
and exploit intersections between these planes.
One case in which the intersections are directly informative occurs when the surface
is rotationally symmetric. The envelope of the bitangent planes must be either a right
circular cone, or a cylinder with circular cross-section (this is a right circular cone whose
vertex happens to be at infinity). We shall draw no distinction between vertices at infinity
and more accessible vertices, and refer to these envelopes as b i t a n g e n t c o n e s . These
comments lead to the following
K e y r e s u l t : The vertices of these bitangent cones must lie on the axis (by
s y m m e t r y ) , and so are collinear. Assuming the focal point lies outside the surface,
as figure 2 shows, the vertices of the bitangent cones can be observed in an image.
T h e vertices a p p e a r as the intersection of a pair of lines bitangent to the outline.
~ bitangont
planes
axis o f s y m m e t r y
\
~ iraaBeof axis
F i g . 2. A r o t a t i o n ~ v symmetric object, and the planes bitangent to the object and passing
through the focal point, are shown. It is clear from the figure that the intersection of these
planes is a fine, also passing through the focal point. Each plane appears as a fine in the image:
the intersection of the planes appears as a point, which is the image of the vertex of the bitangent
cone. Note in particular that the image outline has no symmetry. This is the generic case.
As a result, if the surface has four or more bitangent cones, the vertices yield a system
of four or more collinear points, lying on the axis of the surface. These points project to
points t h a t are collinear, and lie on the image of the axis of symmetry. These points can
be measured in the image. This fact yields two i m p o r t a n t applications:
- Cross-ratios of the image points, defined below, yield indexing functions for the sur-
face, which can be determined from the outline alone.
- T h e image points can be used to construct the i m a g e of the axis of a rotationMly
s y m m e t r i c surface from its outline.
643
The second point can be used to extend work such as that of Brady and Asada [1] on
symmetries of frontally viewed plane curves to considering surface outlines.
We concentrate on the first point in this paper. The map taking these points to their
corresponding image points, is a projection of the line, and so the projective invariants of
a system of points yield indexing functions. A set of four collinear points A, B, C, D has
a projective invariant known as its cross ratio, given by:
(AC)(BD)
(AD)(BC)
where A B denotes the linear distance from A to B. The cross ratio is well known to be
invariant to projection, and is discussed in greater detail in [10]. The cross ratio depends
on the order in which the points are labeled. If the labels of the four points are permuted,
a different value of the cross ratio results. Of the 24 different labeling possibilities, only
6 yield distinct values. A symmetric function of cross ratios, known as a j-invariant is
invariant to the permut&tion of its arguments as well as to projections [11]. Since a
change in camera parameters simply changes the details of the projection of the points,
but does not change the fact that the map is a projection, these cross-ratios are invariant
to changes in the camera parameters.
A further result follows from the symmetries in figure 2. It is possible to show that,
although the outline does not, in general, have a symmetry, it splits into two components,
which are within a plane projectivity of one another. This means that the techniques for
computing projective invariants of general plane curves described in [15], can be used to
group corresponding outline segments within an image.
2.2 I n d e x i n g f u n c t i o n s f r o m i m a g e s o f r e a l objects
We demonstrate that the values of the indexing functions we have described are stable
under a change in viewing position, and are different for different objects. In figure 3, we
show two images each of two lampstands, taken from different viewpoints. These images,
demonstrate that the outline of a rotationally symmetric object can be substantially
affected by a change in viewpoint. Three images of each lampstand were taken in total,
including these images. For each series of images of each lampstand, bitangents were
constructed by hand, and the bitangents are shown overlayed on the images. The graph
in figure 4 shows the cross-ratios computed from the vertices in each of three images of
each of two lampstands. The values of the cross ratio are computed for only one ordering
of the points, to prevent confusion. As predicted in [5], the variance of the larger cross-
ratio is larger; this effect is discussed in [5], and is caused by the way the measurements
are combined in the cross-ratio. The results are easily good enough to distinguish between
the lampstands, from their outlines alone.
This approach can be generalized in two ways. Firstly, there are other sources of vertices
than bitangent lines. Secondly, the geometrical construction described works for a wider
range of surfaces than the rotationally symmetric surfaces. We will demonstrate a range
of other sources of vertices assuming that the surface is rotationally symmetric, and then
generalize all the constructions to a wider range of surfaces in one step.
644
F i g . 3. This figure shows two views each of two different lampstands. Bitangents, computed by
hand from the outlines, are overlaid.
3.1 O t h e r s o u r c e s o f v e r t i c e s
- T h e t a n g e n t s a t a c r e a s e o r a n e n d i n g i n t h e o u t l i n e : We assume t h a t we can
distinguish between a crease in the outline, which arises from a crease in the surface,
and a double point of outline, which is a generic event t h a t m a y look like a crease.
In this case, these tangents are the projections of planes tangent to the surface, at a
crease in the surface.
- A t a n g e n t t h a t p a s s e s t h r o u g h a n e n d i n g i n t h e o u t l i n e : These are projections
of planes t h a t are tangent to the surface, and pass through an ending in the surface.
- I n f l e c t i o n s o f t h e o u t l i n e : These are projections of planes which have three-point
contact with the surface.
In each case, there is a clear relationship between the tangent to the outline and a plane
tangent to the surface, and the envelope of the system of planes tangent to the surface
and having the required property, is a cone with a vertex along the axis. These sources
of information are d e m o n s t r a t e d in figure 5. These results can be established by a simple
modification of the argument used in the bitangent case.
3.2 G e n e r a l i z i n g t o a w i d e r c l a s s o f s u r f a c e s
10.
r
a ~p2
t
i
O
o
ImageNumber ~
-5.
4
-10,
~pl
F i g . 4. A graph showing one value of the cross-ratio of the vertex points for three different
images, taken from differing viewpoints, each of two different vases. This figure clearly shows
that the values of the indexing functions computed are stable under change of view, and change
for different objects, and so are useful descriptors of shape. As expected (from the discussion
in [5]), the variance in a measurement of the cross ratio increases as its absolute value increases.
to the new points of tangency. The other properties are preserved because projectivities
preserve incidence and multiplicity of contact.
These results m e a n t h a t if we take a r o t a t i o n a l l y s y m m e t r i c surface, and a p p l y a
projective m a p p i n g , the cones and vertices we o b t a i n from the new surface are j u s t the
images of the cones and vertices constructed from the old surface. Since a projective
m a p p i n g takes a set of collinear points to another set of collinear points, we can still
construct indexing functions from these points. This means t h a t , for our constructions
to work, the surface need only be projectively equivalent to a r o t a t i o n a l l y s y m m e t r i c
surface. One e x a m p l e of such a surface would be o b t a i n e d by squashing a r o t a t i o n a l l y
s y m m e t r i c surface so t h a t its cross section was an ellipse, rather t h a n a circle. This result
s u b s t a n t i a l l y increases the class of surfaces for which we have indexing functions t h a t
can be determined from image information.
A further generalisation is possible. The cross-ratio is a r e m a r k a b l e invariant t h a t
applies to sets of points lying on a wide range of algebraic curves. If a curve s u p p o r t s a
one-to-one p a r a m e t r i s a t i o n using rational functions, a cross-ratio can be c o m p u t e d for a
set of four points on t h a t curve. This follows because one can use the p a r a m e t r i s a t i o n
to c o m p u t e those points on the line t h a t m a p to the points distinguished on the curve,
and then take the cross-ratio of the points on the line 6. Curves t h a t can be p a r a m e t r i s e d
are also known as curves with genus zero. There is a wide range of such curves; some
examples in the plane include a plane conic, a cubic with one double point and a quartic
with either one triple point or three double points. In space, examples include the twisted
......../...... i''":rli.i!.i
I ................
b' I "
~
......1 : ~,.--"~'-
i . 9 ~
;: T" *;
i, , ;-
-.m f] ,...
Fig. 5. The known cases that produce usable coaxial vertices. Note that although this figure
appears to have a refiectional symmetry, this is not a generic property of the outline of a
rotationally symmetric object.
cubic, and curves of the form (t, p(t), q(t), 1), where p and q are polynomials (the points
at infinity are easily supplied).
Remarkably, the resulting cross-ratio is invariant to projectivities and to projection
from space onto the plane (in fact, to any birational mapping). This means that, for ex-
ample, if we were to construct a surface for which the bitangent vertices or other similarly
identifiable points lie on such a curve, that surface could easily be recognised from its out-
line in a single image, because the cross-ratio is defined, and is preserved by projection.
Recognition would proceed by identifying the image of the bitangent vertices, identifying
the image of the projected curve that passes through these points, and computing the
cross-ratio of these points on that curve, which would be an invariant.
Since there is a rich range of curves with genus zero, this offers real promise as
a modelling technique which has the specific intent of producing surface models that
are both convincing models of an interesting range of objects, and intrinsically easy to
recognise.
4 Conclusions
We have constructed indexing functions, which rely only on image information, for a
useful class of curved surfaces. We have shown these functions to be useful in identifying
curved objects in perspective images of real scenes.
This work has further ramifications. It is possible to use these techniques to determine
whether an outline is the outline of a rotationally symmetric object, and to determine
the image of the axis of the object. As a result, it is possible to take existing investiga-
tions of the symmetry properties of image outlines, and extend them to consider surface
properties, measured in a single image, in a principled way.
As we have shown, image information has deep geometric structure that can be ex-
ploited for recognition. In fact, image outlines are so rich that recent work at Iowa[6] has
shown that a generic algebraic surface of any degree can be recovered up to a projective
mapping, from its outline in a single image.
647
References
1. Brady, J.M. and Asada, H., "Smoothed Local Symmetries and their implementation,"
IJRR-3, 3, 1984.
2. Brooks, R. A., "Model-Based Three-Dimensional Interpretations of Two Dimensional Im-
ages," IEEE PAMI, Vol. 5, No. 2, p. 140, 1983.
3. Canny J.F. "Finding Edges and Lines in Images," TR 720, MIT AI Lab, 1983.
4. Dhome, M., LaPreste, J.T, Rives, G., and Richetin, M. "Spatial localisation of modelled
objects in monocular perspective vision," Proc. First European Conference on Computer
Vision, 1990.
5. D.A. Forsyth, J.L. Mundy, A.P. Zisserman, A. Heller, C. Coehlo and C.A. Rothwell (1991),
"Invariant Descriptors for 3D Recognition and Pose," IEEE Trans. Patt. Anal. and Mach.
Intelligence, 13, 10.
6. Forsyth, D.A., "Recognising an algebraic surface by its outline," Technical report, Univer-
sity of Iowa Department of Computer Science, 1992.
7. H. Freeman and R. Shapira, "Computer Recognition of Bodies Bounded by Quadric Sur-
fa~ces from a set of Imperfect Projections," IEEE Trans. Computers, C27, 9, 819-854, 1978.
8. Koenderink, J.J. Solid Shape, MIT Press, 1990.
9. Malik, J., "Interpreting line drawings of curved objects," IJCV, 1, 1987.
10. J.L. Mundy and A.P. Zisserman, "Introduction," in J.L. Mundy and A.P. Zisserman (ed.s)
Geometric lnvariance in Computer Vision, MIT Press, 1992.
11. J.L. Mundy and A.P. Zisserman, "Appendix" in J.L. Mundy and A.P. Zisserman (ed.s)
Geometric Invariance in Computer Vision, MIT Press, 1992.
12. Ponce, J. "Invariant properties of straight homogenous generalized cylinders," IEEE Trans.
Patt. Anal. Much. Intelligence, 11, 9, 951-965, 1989.
13. J. Ponce and D.J. Kriegman (1989), "On recognising and positioning curved 3 dimensional
objects from image contours," Prac: DARPA 1U workshop, pp. 461-470.
14. R~thweU, C.A., Zisserman, A.P., Forsyth, D.A. and Mundy, J.L., ~Using Projective In-
variants for constant time library indexing in model based vision," Proc. British Machine
Vision Conference, 1991.
15. Rothweli, C.A., Zisserman, A.P., Forsyth, D.A. and Mundy, J.L., "Canonical frames for
planar object recognition," Prac. ~nd European Conference on Computer Vision, Springer
Lecture Notes in Computer Science, 1992.
16. Rothwell, C.A., Zisserman, A.P., Forsyth, D.A. and Mundy, J.L., "Fast Recognition using
Algebraic Invariants," in J.L. Mundy and A.P. Zisserman (ed.s) Geometric Invariance in
Computer Vision, MIT Press, 1992.
17. Terzopolous, D., Witkin, A. and Kass, M. "Constraints on Deformable Models: Recovering
3D Shape and Nonrigid Motion," Artificial Intelligence, 36, 91-123, 1988.
18. Wayner, P.C. "Efficiently Using Invariant Theory for Model-based Matching," Proceedings
CVPR, p.473-478, 1991.
This article was processed using the LTF_~ mascro package with ECCV92 style
Using D e f o r m a b l e Surfaces to S e g m e n t 3-D Images
and Infer Differential Structures*
Abstract
In this paper, we generalize the deformable model [4, 7] to a 3-D model, which evolves in
3-D images, under the action of internal forces (describing some elasticity properties of
the surface), and external forces attracting the surface toward some detected edgels. Our
formalism leads to the minimization of an energy which is expressed as a functional. We
use a variational approach and a finite element method to actually express the surface in a
discrete basis of continuous functions. This leads to a reduced computational complexity
and a better numerical stability.
The power of the present approach to segment 3-D images is demonstrated by a set
of experimental results on various complex medical 3-D images.
Another contribution of this approach is the possibility to infer easily the differential
structure of the segmented surface. As we end-up with an analytical description of the
surface, this allows to compute for instance its first and second fundamental forms. From
this, one can extract a curvature primal sketch of the surface, including some intrinsic
features which can be used as landmarks for 3-D image interpretation.
measures the regularity or the smoothness of the surface v. Minimizing E,mooth(v) con-
strains the surface v to be smooth. The parameters wij represent the mechanical proper-
ties of the surface. They determine its elasticity (wl0,wox), rigidity (w2o,wo2) and twist
(wll). These parameters act on the shape of the function v, and are also called the
regularization parameters.
* This work was partially supported by Digital Equipment Corporation.
649
References
1. Isaac Cohen, Lanrent D. Cohen, and Nicholas Ayache. Using deformable surfaces to seg-
ment 3-D images and infer differential structures. Computer Vision, Graphics, and Image
Processing: Image Undeestanding, 1992. In press.
2. Laurent D. Cohen and Isaac Cohen. Finite element methods for active contour models
and balloons from 2-D to 3-D. Technical Report 9124, CEREMADE, U.R.A. CNRS 749,
Universit6 Paris IX - Dauphine, November 1991. Cahiers de Mathematiques de la Decision.
3. A. Gu6aiec and N. Ayache. Smoothing and matching of 3D-space curves. In Proceedings of
the Second European Conference on Computer Vision 199~, Santa Margherlta Ligure, Italy,
May 1992.
650
4. Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models.
In Proceedings of the First International Conference on Computer Vision, pages 259-268,
London, June 1987.
5. F. Leitner and P. Cinquin. Dynamic segmentation: Detecting complex topology 3D-object.
In Proceedings of International Conference of the IEEE Engineering in Medicine and Biology
Society, pages 295-296, Orlando, Florida, November 1991.
6. O. Monga, N. Ayache, and P. Sander. From voxel to curvature. In Proc. Computer Vision
and Pattern Recognition, pages 644-649. IEEE Computer Society Conference, June 1991.
Lahaina, Maui, Hawaii.
7. Demetri Terzopoulos, Andrew Witkin, and Michael Kass. Constraints on deformable models:
recovering 3-D shape and nonrigid motion. AI Journal, 36:91-123, 1988.
4 Experimental Results
Fig. I. In this example we use a deformable surface constrained by boundaries conditions (cylin-
der type) to segment the inside cavity of the left ventricle, Overlays of some cross sections (in
grey) of the initial estimation (top) and the obtained surface (bottom) and a 3-D representation
of the inside cavity of the left ventricle.
651
Fig. 2. We have applied the 3-D deformable model to a magnetic Resonance image of the head,
to segment the face. This figure represents some cross sections of the initial estimate given by
the user.
F i g . 3. Here we represent the surface, once we have reached a m i n i m u m of the energy g . Some
vertical and horizontal cross sections of the surface are given. They show an accurate localization
of the surface at the edge points.
652
1 Introduction
We advocate the view that the purpose of machine vision is not to reconstruct the scene
in its entirety, but rather to search for specific features that enter, via data aggregation,
a symbolic description of the scene necessary to achieve the specific task. Unfortunately,
the high degree of variability and unpredictability that is inherent in a visual signal makes
it impossible to design precise methods for detecting low-level features. Thus, almost any
output of early processing has to be treated only as a hypothesis for further processing.
In this paper we investigate a method for extracting simple geometric structures from
edge images in terms of parametric models, namely straight lines, parabolas, and ellipses.
These models satisfy the criteria for the selection of geometric representations (invari-
ance, stability, accessibility, 3-D interpretation, perceptual significance). We would like
to emphasize that our main objective is to develop a novel control strategy, by combin-
ing several existing techniques, that achieves a reliable and efficient recovery of geometric
parametric models and can serve as a powerful early vision tool for signal-to-symbol trans-
formation. The method consists of two intertwined procedures, namely model-recovery
and model-selection. The first procedure systematically recovers the models in an edge
image, creating a redundant set of possible descriptions, while the model-selection proce-
dure searches among them to produce the simplest description in terms of the criterion
function.
Due to space limitations we present only a summary of our algorithm. For a more
complete view on the procedure, experimental results, and a discussion of the related
work, the reader is referred to [3].
* The research described in this paper was supported in part by: The Ministry for Science
and Technology of The Republic of Slovenia, Project P2-1122; Navy Grant N0014-88-K-
0630, AFOSR Grants 88-0244, AFOSR 88-0296; Army/DAAL 03-89-C-0031PRI; NSF Grants
CISE/CDA 88-22719, IRI 89-06770, and ASC 91 0813; and Du Pont Corporation.
654
2 Model Recovery
(. . . . . .
~Modelexb'apo(ati~onJ "I" ~ c~)o,)~-of.n~,,,t~
a model m~ consists of three terms which are subsequently passed to the model-selection
a This is due to the limited amount of reliable information that can be gathered in an initially
local area. Only a small number of parameters can be estimated in order to avoid numerically
ill-conditioned cases.
655
procedure:
1. the set of edge elements that belong to the model,
2. the type of the parametric model and the corresponding set of parameters of the
model, and
3. the goodness-of-fit value which describes the conformity between the data and the
model.
While this description is general, specific procedures designed to operate on individual
types of models differ significantly. This is primarily due to the increased complexity of
the nonlinear fitting process for parabolas and ellipses, which is directly related to the
choice of the Euclidean distance as the error metric. The use of the Euclidean distance
results in a goodness-of-fit measure which has a straightforward interpretation, and the
recovered models extrapolate accurately to the vicinity of the end-points. The reader is
referred to [3] where we outline the model-recovery procedures for the three types of the
models and show how to switch between them.
3 Model Selection
where 1~r T : [ m l , m 2 , . . . , m M ] .
[ :
CM1 ...
:
CMM J
~ , (1)
5 Experimental Results
We tested our method on a variety of synthetic data as well as on real images. The two
images presented here (Figs. 2 (a) and 3 (a)), together with their respective edge images
(Figs. 2 (b) and 3 (b)), obtained with the Canny edge detector, were kindly supplied by
Dr. Etemadi from the University of Surrey. Fig. 2 (c) shows the initial curve segments
, ,I .... all.
(b) (c) (d)
Fig. 2. (a) Original image, (b) Edge-image, (c) Seed image, and (d) Reconstructed image
(seeds). Note that they are not placed on or near the intersections or junctions of the
edges. Besides, they do not appear in the areas with a high density of edge elements
(twisted cord). The size of the initial windows determines the scale (and the resolution)
on which the elements are anticipated to appear. If two lines fall into the same window, a
consistent initial estimate will not be found. One of the solutions would be to decrease the
size of the windows or to resort to orientation dependent windows. However, a missing
seed seldom poses a serious problem since usually only a few seeds are sufficient to
properly recover the complete curve. Of course, curves which are not initiated by any
657
seed at all will not appear in the final description. The final result is shown in Fig. 2 (d).
We observe that the procedure is robust with respect to noise (minor edge elements
scattered in the image). A standard approach utilizing a blind linking phase to classify
data points without support from models would encounter numerous problems. Besides,
the procedure determines its domain of applicability since it does not describe heavily
textured areas (densely distributed curves) in the image. Due to the redundancy present
in the scheme, the method degrades gracefully if the assumptions made by the choice
of the primitives are not met. A similar situation arises when the estimation about the
anticipated scale or resolution is not correct. Numerous small segments signal that a
different kind of models should be invoked or that the scale should be changed (the dial
in Fig. 2).
Fig. 3. (a) Original image, (b) Edge-image, (c) Seed image, and (d) Reconstructed image
In Fig. 3 (c) we show the seeds. Some of the seeds along the parallel lines are missing
due to the grid placement. Nevertheless, the lines are properly recovered, as shown in
Fig. 3 (d).
6 Conclusions
The method for extracting parametric geometric structures is a tool that has already
proven useful to other tasks in computer vision [4]. It offers several possible extensions by
using other types of models. Moreover, the same principle can be extended to operate on a
hierarchy of different models which would lead to the recovery of more and more abstract
structures. Besides, the scheme is inherently parallel and can easily be implemented on
a massively parallel machine.
References
1. Besl, P. J.: Surfaces in Range Image Understanding. Springer-Verlag, (1988)
2. Chen, D. S.: A data-driven intermediate level feature extraction algorithm. IEEE Transac-
tion on Pattern Analysis and Machine Intelligence. 11 (1989) 749-758
3. Leonardis, A.: A search for parametric curves in an image. Technical Report LRV-91-7.
Computer Vision Laboratory, University of Ljubljana, (1991)
4. Leonardis, A., Gupta, A., and Bajcsy, R.: Segmentation as the search for the best description
of the image in terms of primitives. In The Third International Conference on Computer
Vision. Osaka, Japan, (1990) 121-125
This article was processed using the ISTEX macro package with ECCV92 style
D e t e r m i n i n g T h r e e - D i m e n s i o n a l Shape from
Orientation and Spatial Frequency Disparities *
1 McGill University, Dept. of Electrical Engineering, Montr6al, PQ, Canada H3A 2A7
2 University of California, Berkeley, Computer Science Division, Berkeley, CA USA 94720
1 Introduction
Stereopsis has traditionally been viewed as a source of depth information. In two views
of a three-dimensional scene, small positional disparities between corresponding points in
the two images give information about the relative distances to those points in the scene.
Viewing geometry, when it is known, provides the calibration function relating disparity
to absolute depth. To describe three-dimensional shape, the surface normal, n(x, y), can
then be computed by differentiating the interpolated surface z(z,y). In practice, any
inaccuracies present in disparity estimates will be compounded by taking derivatives.
However, there are other cues available under binocular viewing that can provide di-
rect information about surface orientation. When a surface is not fronto-parallel, surface
markings or textures will be imaged with slightly different orientations and degrees of
foreshortening in the two views (Fig. 1). These orientation and spatial frequency dispar-
ities are systematically related to the local three-dimensional surface orientation. It has
been demonstrated that humans are able to exploit these cues, when present, to more
* This work has been supported by a grant to DJ from the Natural Sciences and Engineering
Research Council of Canada (OGP0105912) and by a National Science Foundation PYI award
(IRI-8957274) to JM.
662
accurately determine surface orientation (Rogers and Cagenello, 1989). In stimuli consist-
ing of uncorrelated dynamic visual noise, filtered to contain a certain spatial frequency
band, the introduction of a spatial frequency disparity or orientation disparity leads to
the perception of slant, despite the absence of any systematic positional disparity cue
(Tyler and Sutter, 1979; yon der Heydt et al., 1981). In much the same way that random
dot stereograms confirmed the existence of mechanisms that makes use of horizontal dis-
parities (:lulesz, 1960), these experiments provide strong evidence that the human visual
system possesses a mechanism that can and does make use of orientation and spatial
frequency disparities in the two retinal images to aid in the perception of surface shape.
Fig. 1. Stereo pair of a planar surface tilted in depth. Careful comparison of the two views
reveals slightly different orientation and spacing of corresponding grid lines.
There has been very little work investigating the use of these cues in computational
vision. In fact, it is quite common in computational stereo vision to simply ignore the
orientation and spatial frequency differences, or image distortions, that occur when view-
ing surfaces tilted in depth. These differences are then a source of error in computational
schemes which try to find matches on the assumption that corresponding patches (or
edges) must be identical or very nearly so. Some approaches acknowledge the existence
of these image distortions, but still treat them as noise to be tolerated, as opposed to an
additional signal that may exploited (Arnold and Binford, 1980; Kass, 1983; Kass, 1987).
A few approaches seek to cope using an iterative framework, starting from an initial
assumption that disparity is locally constant, and then guessing at the parameters of the
image distortion to locally transform and compensate so that image regions can again be
compared under the assumption that corresponding regions are merely translated copies
of one another (Mori et al., 1973; Quam, 1984; Witkin et al., 1987). The reliance of this
procedure on convergence from inappropriate initial assumptions and the costly repeated
"warping" of the input images make this an unsatisfactory computational approach and
an unlikely mechanism for human stereopsis.
This paper describes a novel computational method for directly recovering surface
orientation by exploiting these orientation and spatial disparity cues. Our work is in
the framework of a filter-based model for computational stereopsis (:/ones, 1991; Jones
and Malik, 1992) where the outputs of a set of linear filters at a point are used for
matching. The key idea is to model the transformation from one image to the other
locally as an affine transformation with two significant parameters, H~, Hy, the gradient
of horizontal disparity. Previous work has sought to recover the deformation component
instead (Koenderink and van Doom, 1976).
For the special case of orientation disparity, Wildes (1991) has an alternative approach
based on determining surface orientation from measurements on three nearby pairs of
corresponding line elements (Canny edges). Our approach has the advantage that it
treats both orientation and spatial frequency disparities. Another benefit, similar to least
squares fitting, it makes use of all the data. While measurements on three pairs may be
663
Consider the appearance of a small planar surface patch, ruled with a series of evenly
spaced parallel lines (Fig. 2A). The results obtained will apply when considering orienta-
tion and spatial frequencies of general texture patterns. To describe the parameters of an
arbitrarily oriented plane, start with a unit vector pointing along the z-axis. A rotation
Cz around the z-axis describes the orientation of the surface texture. Rotations of Cx
around the z-axis, followed by Cy around the y-axis, combine to allow any orientation of
the surface itself. The three-dimensional vector v resulting from these transformations
indicates the orientation of the lines ruled on the surface and can be written concisely:
[ ] v = /[sinr176176176
In order to consider orientation and spatial frequency disparities, this vector must be
projected onto the left and right image planes. In what follows, orthographic projection
will be used, since it provides a very close approximation to perspective projection,
especially for the small surface patches under consideration and when line spacing is
small relative to the viewing distance. The projection of v onto the left image plane is
achieved by replacing Cy with Cy +/~r (where/~r = tan-l(b/2d)), and then discarding
the z component to give the two-dimensional image vector yr. Similarly, replacing r with
Cy - ACy gives vr, the projection of v on the right image plane.
Y Y
Fig. 2. Differences in two views of a tilted surface. A. A planar surface is viewed at a distance
d, from two vantage points separated by a distance b. Three-dimensional vectors He parallel (v)
and perpendicular (w) to a generic surface texture (parallel lines). Arbitrary configurations are
achieved by rotations ~z, r and ~ , in that order. Different viewpoints are handled by adding
an additional rotation 5:A~y, where / ~ = tan-l(b/2d). B. Resulting two-dimensional image
textures are described by orientation, 0, and spacing, )~. Orientation disparity, 0r-01, and spatial
frequency disparity, ~l/~r, are systematically related to surface orientation, ~=, ~ .
664
Let 01 and/gr be the angles the image vectors vt and vr make with the z-axis (Fig. 2B).
These can be easily expressed in terms of the components of the image vectors.
tan 91 = cos r tan r
sin ~bxsin(r + A f t ) tan Cz + cos(r + Ar
This enables us to determine the orientation disparity, 0r - St, given the pattern orien-
tation Cz, 3-D surface orientation characterized by ~b~,~bv, and view angle Ar
Let Az, A~ be the spacing, and fl = 1/Ai, fr = 1/A~ be the spatial frequency, of
the lines in the left and right images (Fig. 2B). Since spatial frequency is measured
perpendicular to the lines in the image, a new unit vector w, perpendicular to v, is
introduced to indicate the spacing between the lines. An expression for w can be obtained
from the expression for v by replacing r with r + 90 ~ When these three-dimensional
vectors, v and w, are projected onto an image plane, they generally do not remain
perpendicular (e.g., va and wz in Fig.2B). If we let v f = (-vz~,vl~), then uz = vf/[]vzll
is a unit vector perpendicular to vl. The length of the component of wt parallel parallel
to ut is equal to A/, the line spacing in the left image.
At---- t o l . v f "
Ilvlll
Substituting expressions for vt and wt gives an expression for the numerator, and a simple
expression for the denominator can be found in terms of 01.
Ic~162 sin Cz I
wl'v x = cosr162 v+ACy) ; [[viii = sin01
Combining these with similar expressions for Ar gives a concise expression for spatial
frequency disparity.
)tl w, . v , ,,v,.,, ,cos(C,+Ar
fl =- Ar -- IIv,ll w ~ . v ~ = I Ic~162162
To determine spatial frequency disparity from a given pattern orientation Cz, surface
orientation r Cy, and view angle ACu , this equation and the previous ones to determine
0:, 0r are all that are needed.
For solving the inverse problem (i.e., determining surface orientation), it has been
shown that from the orientations 0t, 0r and 0~, 0" of two corresponding line elements, (or
0t, 0r and At, Ar for parallel lines), the three-dimensional surface normal can be recovered
(Jones, 1991). If more observations are available, they can be exploited using a least
squares algorithm. This is based on the following expression:
tan r = cos r cos(ACy) (tan 0~, -- tan 0t,) + sin r sin(ACy) (tan 0r, + tan 01,)
sin(2ACu) tan 0r, tan 0z,
= ai COS Cy q- bi sin Cy
This has the convenient interpretation that for a given surface orientation, all the ob-
servations (ai, bi) should lie along a straight line whose orientation gives r and whose
perpendicular distance from the origin is tan Cx. Details of the derivation and experi-
mental results may be found in (Jones and Malik, 1991).
In Section 4 we present an alternative solution which does not depend on the identifi-
cation of corresponding line elements, but simply on the output of a set of linear spatial
filters. To develop a solution in a filter-based framework, the next section first re-casts
the information present in orientation and spatial frequency disparities in terms of the
disparity gradient.
665
Consider a region of a surface visible from two viewpoints. Let P = (z, y) be the coor-
dinates of a point within the projection of this region in one image, and P~ = (x', y~) be
the corresponding point in the other image. If this surface is fronto-parallel, then P and
P~ differ only by horizontal and vertical offsets H, V throughout this region - - the image
patch is one view is merely a translated version of its corresponding patch in the other
view. If the surface is tilted or curved in depth then the corresponding image patches will
not only be translated, but will also be distorted. For this discussion, it will be assumed
that this distortion is well-approximated by an affine transformation.
Hx, Hy, Vx, Vv specify the linear approximation to the distortion and are zero when the
surface is fronto-parallel. For planar surfaces under orthogonal projection, the transfor-
mation between corresponding image patches is correctly described by this affine trans-
formation. For curved surfaces under perspective projection, this provides the best linear
approximation. The image patch over which this needs to be a good approximation is
the spatial extent of the filters used.
The vertical disparity V is relatively small under most circumstances and the vertical
components of the image distortion are even smaller in practice. For this reason, it will be
assumed that V~, W = 0, leaving H~ which corresponds to a horizontal compression or
expansion, and Hy which corresponds to a vertical skew. In both cases, texture elements
oriented near vertical are most affected. It should also be noted that the use of Hx, Hv
differs from the familiar Burt-fulesz (1980) definition of disparity gradient, which is with
respect to a cyclopean coordinate system.
Setting aside positional correspondence for the moment, since it has to do with relative
distance to the surface and not its orientation, this leaves the following:
If we are interested in how a surface, or how the tangent plane to the surface, is tilted
in depth, then the critical parameters are Hx and Hy. If they could be measured, then
the surface orientation could be estimated, up to some factor related to the angular
separation of the eyes. For a planar surface, with orientation Cx, Cv, the image distortion
is given by:
These are the parameters for moving from the left view to the right view. To go in
the other direction requires the inverse transformation. This can be computed either by
changing the sign of ACy in the above equations to interchange the roles of the two
viewpoints, or equivalently, the inverse of the transformation matrix can be computed
directly. Compression and skew depend on the angular separation 2ACv of the viewpoints
and are reduced as this angle decreases, since this is the angle subtended by the view-
points, relative to a point on the surface. More distant surfaces lead to a smaller angle,
making it more difficult to judge their inclination.
666
4 S u r f a c e S h a p e f r o m D i f f e r e n c e s in S p a t i a l F i l t e r O u t p u t s
We have been developing a new stereo algorithm based on the outputs of linear spatial
filters at a range of orientations and scales. The collection of filter responses at a position
in the image (the filter response vector, v~ = F T I~), provides a very rich description of
the local image patch and can be used as the basis for establishing stereo correspondence
(Jones and Malik, 1992). For slanted surfaces, however, even corresponding filter response
vectors will differ, but in a way related to surface orientation. Such differences would
normally be treated as noise in other stereo models.
From filter responses in the right image, we could, in principle, reconstruct the image
patch using a linear transformation, namely the pseudo-inverse (for details and exam-
ples, Jones and Malik, 1992). For a particular surface slant, specified by Hx, Hv, we could
predict what the image should look like in the other view, using another linear trans-
formation - - the affine transformation discussed earlier. A third linear transformation
would predict the filter responses in the other view (Fig. 3).
v~ = FT 9 T.=,.. 9 (FT) -1 v~
MH],H~
Here the notation (FT) -1 denotes a pseudo-inverse. This sequence of transformations
can, of course, be collapsed into a single one, M,=,N,, that maps filter responses from
one view directly to a prediction for filter responses in the other view. These M matrices
depend on Hr and Hv but not on the input images, and can be pre-computed once, ahead
of time. A biologically plausible implementation of this model would be based on units
coarsely tuned in positional disparity, as well as the two parameters of surface slant.
Surface
Response Response
Vector Vector
hi
vL ,M vR
vL' = M vR
This provides a simple procedure for estimating the disparity gradient (surface orien-
tation) directly from v n and VL, the output of linear spatial filters. For a variety of choices
of H . , H~, compare V~L= MH=,R. "VR, the filter responses predicted for the left view, with
VL, the filter responses actually measured for the left view. The choice of H . , Hv which
minimizes the difference between v~' and v~ is the best estimate of the disparity gradient.
The sum of the absolute differences between corresponding filter responses serves as a
efficient and robust method for computing the difference between these two vectors, or
an error-measure for each candidate H . , H~.
667
This approach was tested quantitatively using randomly generated stereo pairs with
known surface orientations (Fig.4). For each of 49 test surface orientations (H~, H~ E
{0.0,-4-0.1, =t=0.2, :i:0.4}), 50 stereo pairs were created and the values of H=, Hy were re-
covered using the method described in this paper. The recovered surface orientations
(Fig. 5) are quite accurate, especially for small slants. For larger slants, the spread in
the recovered surface orientation increases, similar to some psychophysical results. Small
systematic errors, such as those for large rotations around the vertical axis, are likely not
an inherent feature of the model, but an artifact of this particular implementation where
surface orientation was computed from coarse estimates using the parabolic interpolation.
Fig. 4. Stereo pair of surfaces tilted in depth. A white square marked on each makes the hor-
izontal compression/expansion, when H= # 0, and vertical skew, when H~ ~ 0, quite apparent.
Disl~rityC_,mditutI ~ ,
0.5
@ Oe| 0
J
-0.q
i i i i i , i , , i i , , ,
Fig. 5. Disparity gradient estimates. For various test surface orientations (open circles), the
mean (black dot) and standard deviation (ellipse) of the recovered disparity gradient are shown.
Because the test surfaces are marked with random textures, the orientation and spatial
frequency disparities at a single position encode surface orientation to varying degrees,
and on some trials would provide only very limited cues. Horizontal stripes, for example,
provide ao information about a rotation around the vertical axis. For large planar sur-
faces, or smooth surfaces in general, estimates could be substantially refined by pooling
over a local neighborhood, trading off spatial resolution for increased accuracy.
668
The approach for recovering three-dimensional surface orientation developed here makes
use of the fact that it is the identical textured surface patch that is seen in the two
views. It is this assumption of correspondence that allows an accurate recovery of the
parameters of the deformation between the two retinal images. However, orientation and
spatial frequency disparities lead to the perception of a tilted surface, even in the absence
of any systematic correspondence (Tyler and Sutter, 1979; yon der Heydt et al., 1981).
One interpretation of those results might suppose the existence of stereo mechanisms
which make use of orientation or spatial frequency disparities independen$ of positional
disparities or correspondence. Such mechanisms would seem to be quite different from
the approach suggested here. On the other hand, it is not immediately apparent how the
present approach would perform in the absence of correspondence.
Given a pair of images, the implementation used in the previous experiment deter-
mines the best estimate of surface orientation - - even if it is nonsense. This allow us to
examine how it performs when the assumption of correspondence is false. Stereo pairs
were created by filtering random, uncorrelated one-dimensional noise to have a band-
width of 1.2 octaves and either an orientation disparity (Fig. 6A) or spatial frequency
disparity. Since a different random seed is used for each image, there is no consistent
correspondence or phase relationship. A sequence of 100 such pairs was created and for
each, using the same implementation of the model used in the previous experiment, the
parameters of surface orientation, or the disparity gradient,/Ix, H~ were estimated.
.oQ.*" |
~ o ~ ''~
J
--0.5
, , | , w , | i , , , , i ,
There is a fair bit of scatter in these estimates (Fig. 6B), but if the image pairs were
presented rapidly, one after the other, one might expect the perceived surface slant to
be near the centroid. In this case, Hx = 0 and Hy is positive, which corresponds to a
surface rotated around the horizontal axis - - in agreement with psychophysical results
(yon der Heydt et al., 1981). In fact, the centroid lies close to where it should be based
on the 10~ orientation disparity (Hx = 0.0, Hy = 0.175), despite the absence of corre-
spondence. The same procedure was repeated for several different orientation disparities
669
and for a considerable range, the recovered slant (Hy) increases with orientation dispar-
ity. Similar results were found for stereo pairs with spatial frequency disparities, but no
systematic correspondence.
7 Conclusion
In this paper, a simple stereopsis mechanism, based on using the outputs of a set of
linear spatial filters at a range of orientations and scales, has been proposed for the direct
recovery of local surface orientation. Tests have shown it is applicable even for curved
surfaces, and that interpolation between coarsely sampled candidate surface orientations
can provide quite accurate results. Estimates of surface orientation are more accurate for
surfaces near fronto-parallel, and less accurate for increasing surface slants.
There is also good agreement with human performance on artificial stereo pairs in
which systematic positional correspondence has been eliminated. This suggests that the
psychophysical results involving the perception of slant in the absence of correspondence
may be viewed, not as an oddity, but as a simple consequence of a reasonable mechanism
for making use of positional, orientation, and spatial frequency disparities to perceive
three-dimensional shape.
References
Arnold RD, Binford TO (1980) Geometric constraints on stereo vision. Proc SPIE
238:281-292
Butt P, Julesz B (1980) A disparity gradient limit for binocular function. Science
208:651-657
Jones DG (1991) Computational models of binocular vision. PhD Thesis, Stanford Univ
Jones DG, Malik J (1991) Determining three-dimensional shape from orientation and
spatial frequency disparities I - - using corresponding line elements. Technical Report
UCB-CSD 91-656, University of California, Berkeley
Jones DG, Malik J (1992) A computational framework for determining stereo
correspondence from a set of linear spatial filters. Proc ECCV Genova
Julesz B (1960) Binocular depth perception of computer generated patterns. Bell
Syst Tech J 39:1125-1162
Julesz B (1971) Foundations of cyclopean perception. University of Chicago Press:Chicago
Kass M (1983) Computing visual correspondence. DARPA IU Workshop 54-60
Kass M (1988) Linear image features in stereopsis. Int J Computer Vision 357-368
Koenderink J J, van Doom AJ (1976) Geometry of binocular vision and a model for
stereopsis. Biol Cybern 21:29-35
Mori K, Kododi M, Asada H (1973) An iterative prediction and correction method for
automatic stereo comparison. Computer Graphics and Image Processing 2:393-401
Quam LH (1984) Hierarchical warp stereo. Proc Image Understanding Workshop.
Rogers BJ, Cagenello RB (1989) Orientation and curvature disparities in the perception of
3-D surfaces. Invest Ophth and Vis Science (suppl) 30:262
Tyler CW, Sutter EE (1979) Depth from spatial frequency difference: an old kind of
stereopsis? Vision Research 19:859-865
yon der Heydt R, H~nny P, Dursteller MR (1981) The role of orientation disparity in
stereoscopic perception and the development of binocular correspondence, in Advances
in Physiological Science: 16:461-470 Graystan E, Molnar P (eds) Oxford:Pergammon
Wildes RP (1991) Direct recovery of three-dimensional scene geometry from binocular
stereo disparity. IEEE Trans PAMI 3(8):761-774
Witkin AP, Terzopoulos D, Kass M (1987) Signal matching through scale space. Int J
Computer Vision 1(2):133-144
Using Force Fields Derived from 3D Distance Maps
for Inferring the Attitude of a 3D Rigid Object
1 Introduction
One of the most basic ability of any human or artificiM intelligence is the inference
of knowledge by matching various pieces of information [1]. When only a few data are
available, one can introduce a priori knowledge to compensate for the lack of information
and match it with the data. In this latter frame, one of the most classical problematics is
the inference of the attitude of a 3D object from sensorial data (2D projections or sparse
3D coordinates).
This problem can be formulated as follows: assume that we know a 3 D description
(model) or some features of an object in a first 3D attitude (location and orientation).
We acquire various sensorial data describing this object in another (unknown) attitude,
and we then attempt to estimate, from the model of the object and this new data,
this unknown attitude. This generally implies the determination of 6 parameters: three
components of translation (location) and three components of rotation(orientation).
In this paper, we will suppose the segmentation of the sensorial data achieved and
focus on the interpretation of the scene described by the segmented images.
In spite of a considerable amount of litterature (see [2] for a review of related works),
no general algorithm has been published yet. This paper presents a new complex object-
oriented geometric method based on the pre-computation of a force field derived from
3D distance maps. Experimental results, in the field of computer-assisted surgery, are
proposed.
set of 3D points disLributed on the surface of the object and defining our model of the
object. Such a model can be extracted from any 3D initial representation.
The problem is to estimate the transformation T between Refsensor (the reference
system of the sensorial data) and Ref3D (reference system in which the 3D model of
the object is defined). After the sensor calibration ( N-planes spline method ([3]), in
3D/2D matching every pixel ~i of each projection is associated with a 3-D line, Li,
called matching line, whose representation is known in Refsensor.
In 3D/2D matching, when the 3D object is in its final attitude, T, every line Li is
tangent to the surface S. In the same way, when matching the 3D model with a set of
sparse 3D control points, these latter are in contact with S. For sufficiently complex ob-
jects (i.e. without strong symmetries), T is the only attitude leading to such a geometric
configuration. Our algorithm is based on this observation :
1. We first define the 3-D unsigned distance between a point r and the surface S,
drg(r, S), as the minimum Euclidean distance between r and all the points of S.
We use this distance function to define a force field in any point of the 3D space.
Every point r is associated with a force vector F ( r ) = w - r where w is the point of
S the closest to r. We therefore have:
[F(r)[ : dE(r, S) (1)
2. In 3D/2D matching, an attraction force FL(Li) is associated to any matching line
Li by:
(a) if Li does not cross S , F L ( L i ) : F(M/) where Mi is the point of Li the closest
to S;
(b) else FL(L~) = F(N~) where N~ is the point of L~ inside the surface S the farthest
from S (see fig. 1).
A simple way to compute F L is to consider a signed distance, 0~, of same module than
dF, but negative inside S and to choose the point of Li of minimum signed module.
same module than
Li k~
/
3. l e m m a (not proved here) : The potential energy of the force field F at a point r with
respect to the surface S i.e. the energy necessary to bring r in contact with S is
P E ( r ) = 1F(r)~ + o ( F ( r ) ) (2)
For a set of N,I 3D control points, ri, to take into account the reliability of the data,
we introduce the variance of the noise of the measurement dE(ri, S), o'~, (see section
4) to weight the energy of a control point and consider the energy E :
Nq
E(p) = Z ~ [ d E ( r ~ , S)] ;. (3)
i-----I
672
4. In the same way, the potential energy of a matching line Li, i.e. the work necessary
to bring the line into contact with S is equal to the potential energy of the point
where the attraction force is applied (Mi or Ni). As previously, to take into account
the reliability of the d a t a on the matching lines, we weight the potential energy of
each matching line by the variance of the noise of the measurement d(li(p), S), o'~,
and consider the energy E:
Mr Mr 1 | .... :~
(4)
5. As shown above, when the object is in its final attitude, every line (every control
point) is in contact with S and the energy of the attitude is therefore zero, the lowest
possible energy. If the object is sufficiently complex the m i n i m u m of the energy
function is reached only once, in the final attitude, and the energy function is convex
in a large neighborhood of this attitude. A minimization procedure of convex function
can therefore be performed (see section 4).
The method described in the previous section relies on the fast computation of the
distances d~ and d. If the surface S is discretized in r~2 points, the computation of
the distance dE is a O(r~ 2) process. Similarly, if a line li(p) is discretized in m points,
the computation of the distance d is a O ( m n 2) process. To speed up this process, we
precompute a 3-D distance map, which is a function that gives the signed m i n i m u m
distance to S from any point q inside a bounding volume V that encloses S.
More precisely, let G a regular grid of N 3 points bounding V. W e first compute and
store the distance d for each point q of G. Then d(q, S~ can be computed for any point
q using a trilinear interpolation of the 8 corner values dijk of the cube that contains the
point q. If (u, v, w) E [0, 1] [0, 1] [0, i]) are the normalized coordinates of q in the
cube,
I I I
d(q,S)= ~__a~-'~bi(u)bj(v)bk(w)dijk with bt(t)=6zt+(1-6t)(1-t). (5)
i = 0 j--O k=O
W e can compute the gradient Vd(q, S) of the signed distance function by simply
differentiating (5) with respect to ~, v, and w. Because d is only C ~ Vd(q, S) is discon-
tinuous on cube faces. However, these gradient discontinuities are relatively small and do
not seem to affect the convergence of our iterative minimization algorithm.
In looking for an improved trade-off between m e m o r y space, accuracy, speed of com-
putation, and speed of construction, we have developed a new kind of distance m a p
which we call the octree spline. The intuitive idea behind this geometrical representation
is to have more detailed information (i.e.,more accuracy) near the surface than far away
from it. W e start with the classical octrce representation associated with the surface S
and then extend it to represent a continuous 3-D function that approximates the signed
Euclidean distance to the surface. This representation combines advantages of adaptive
splinc functions and hierarchical data structures. For more details on the concept of
octree-splines, see [2].
673
This section describes the nonlinear least squares minimization of the energy or error
function E(p) defined in eq. 4 and eq. 3.
Least squares techniques work well when we have many uncorrelated noisy measure-
ments with a normal (Gaussian) distribution 3. To begin with, we will make this assump-
tion, even though noise actually comes from calibration errors, 2-D and 3-D segmentation
errors, the approximation of the Euclidean distance by octree spline distance maps, and
non-rigid displacement of the surface between Ref3D and Refsen....
To perform the nonlinear least squares minimization, we use the Levenberg-Marquardt
algorithm because of its good convergence properties [4]. An important point of this
method is that in both equations 4 and 3 g ( p ) can be easily differentiated which allows
to exhibit simple analytical forms for ghe gradient and Hessian of E(p), used in the
minimization algorithm.
At the end of the iterative minimization process, we compute a robust estimate of
the parameter p by throwing out the measurements where e~(p) >> 0.2 and performing
some more iterations [5]. This process removes the influence of outliers which are likely
to occur in the automatic 2-D and 3-D segmentation processes (for instance, a partially
superimposed object on X-ray projections can lead to false contours).
Using a gradient descent technique such as Levenberg-Marquardt we might expect
that the minimization would fail because of local minima in the 6-dimensional parameter
space. However, for the experiments we have conducted, false local minima were few and
always far away from the solution. So, with a correct initial estimate of the parameters,
these other minima are unlikely to be reached.
Finally, at the end of the iterative minimization procedure, we estimate the uncer-
tainty in the parameters (covariance matrix) to compute the distribution of errors after
minimization in order to check that it is Gaussian.
5 Experimental results
We have performed tests on both real anatomical surfaces and on simulated surfaces. In
3D/2D matching, the projection curves of these surfaces were obtained by simulation in
order to know the parameters p* for which the correct pose is reached. Figures 2 and 3
show an example of convergence for an anatomical surface (VIM of the brain ; surface
$1) in 3D/2D matching. The state of the iterative minimization algorithm is displayed
after 0, 2, and 6 iterations. Figure 2 shows the relative positions of the projections lines
and the surface seen from a general viewpoint. Figure 3 shows the same state seen from
the viewpoints of the two cameras (computation times expressed below are given for a
DECstation 5000/200). Experiments have also been conducted to test this method for
3D/3D matching by simulating a complex transformation on a vertebra (surface $2) (see
fig. 4 for the convergence).
6 Discussion
In comparison with existing methods, the experiments we ran showed the method pre-
sented in this paper had five main advantages.
a Under these assumptions, the least squares criterion is equivalent to maximum likelihood
estimation.
674
:~/,/,;/i/
//]/L~.,'LI-~
ct 6 :
Fig. 2. Convergence of algorithm observed from a general viewpoint (surface SD is represented
by a set of points). Two sets of projection lines evolve in the 3D potential field associated with
the surface until each line is tangent to St: (a) initial configuration, (b) after 2 iterations, (c)
after 6 iterations. For this case, the matching is performed in 1.8 s using 77 projection lines, in
0.9 s using 40 projection lines.
First, the matching process works for any free-form smooth surface. Second, we
achieve the best accuracy possible for the estimation of the 6 parameters in p, because
the octree spline representation we use approximates the true 3-D Euclidean distance
with an error smaller than the segmentation errors in the input data. Third, we provide
an estimate of the uncertainties of the 6 parameters. Fourth, we perform the matching
process very rapidly. Fifth, in our method, only a f e w pizels on the contours are needed.
This allows to estimate the attitude of the object even if it is partially occluded. More-
over, reliability factors can be introduced to weight the contribution of uncertain data
(for instance, the variance of the segmentation can be taken into account).
This method could also be used for recognition problems, where the purpose is to
match some contour projections with a finite set of 3-D objects { O i } .
Researches are presently underway to adapt this algorithm to non-segmented gray-
levels images by selecting potential matching lines, then assign credibility factors to them
and maximize a matching energy.
References
- ; q ~:t+
Fig. 3. Convergence of algorithm for surface Sl observed from the 2 projection viewpoints. The
external contours of the projected surface end up fitting the real contours: (a) initial configura-
tion, (b) after 2 iterations (c) after 6 iterations.
't
..... I 'i
C4 t~
1 INRIA Sophia-Antipolis, 2004 Route des Lucioles, 06565 Valbonne Cedex, France
2 SRI International, 333 Ravenswood Avenue Menlo Park, CA 94025 USA
Abstract.
We propose an approach for building surfaces from an unsegmented set
of 3D points. Local surface patches are estimated and their differential prop-
erties are used iteratively to smooth the points while eliminating spurious
data, and to group them into more global surfaces. We present results on
complex natural scenes using stereo data as our source of 3D information.
1 Introduction
Deriving object surfaces from a set of 3D points produced, for example, by laser rangefind-
ers, stereo, or 3D scanners is a difficult task because the points form potentially noisy
"clouds" of data instead of the surfaces one expects. In particular, several surfaces can
overlap - - the 2 1/2 D hypothesis required by simple interpolation schemes is not neces-
sarily valid. Furthermore, the raw 3D points are unsegmented and it is well known that
segmentation is hard when the data is noisy and originates from multiple objects in a
scene. Most existing approaches to the problem of determining surfaces from a set of
points in space assume that all data points belong to a single object to which a model
can be fit [11,7,2].
To overcome these problems, we propose fitting a local quadric surface patch in the
neighborhood of each 3D point and using the estimated surfaces to'iteratively smooth the
raw data. We then use these local surfaces to define binary relationships between points:
points whose local surfaces are "consistent" are considered as sampled from the same
underlying surface. Given this relation, we can impose a graph structure upon our data
and define the surfaces we are looking for as sets of points forming connected components
of the graph. The surfaces can then be interpolated using simple techniques such as
Delaunay triangulation. In effect, we are both segmenting the data set and reconstructing
the 3D surfaces. Note that closely related methods have also been applied to magnetic
resonance imagery [1{3] and laser rangefinder images [3].
Regrettably, space limitations force us to omit many details; the interested reader is
referred to [6].
2 Local Surfaces
We iteratively fit local surfaces by frst fitting a quadric patch around each data point,
and then moving the point by projecting it back onto the surface patch. For our algorithm
to be effective with real data, it must be both orientation independent and insensitive to
outliers.
* Support for this research was partially provided by ESPRIT P2502 (VOILA) and ESPRIT
BRA 3001 (INSIGHT) and a Defense Advanced Research Projects Agency contract.
677
z = q ( x , y ) = ax 2 + b x y + cy 2 + dx + e y + f
where the (xi, Yl, Zl)l<i<n are the n neighbors in a spherical neighborhood of P0 and the
wi are associated weights. We then transform P0 as follows:
expressed in the local reference frame, and we repeat the fitting process with the trans-
formed points.
In effect, we are approximating the 2nd order Taylor expansion of the surface around
each P. Since we iterate the estimates, this procedure is a form of relaxation [6]. For
typical stereo data sets (~ 100,000 points), the algorithm converges within five to ten
iterations.
Fig. 1. (a) Two spatially proximate and noisy hemispheres. (b) Smoothed spheres after sev-
eral iterations. (c) Resampled points. (d) The two largest connected components of
the corresponding graph for a given value of the parameter mq of Eq.(4), one shown
as a wireframe and the other as a shaded surface.
Outliers. To deal with outliers (to which least-squares techniques are notoriously sensi-
tive), we define a metric dq that measures whether or not two points appear to belong
to the same surface. We take dg to be
dq is zero when the two points belong to the same local surface qi and increases when
their respective local surfaces become inconsistent. It can therefore be used to discount
outliers by computing the weighting factor wl of Eq.(1) at iteration t to be:
1
Wi-- di = dq(Po, Pi, q0'-I , qlt-l")
1 + (di/~,) 2'
where the (q~-l)0<~<n are the quadrics that had been computed at the previous iterations
and cr is an estimate of the variance of the process generating the data points. Note
that for processes such as stereo computation or laser range finding, a can actually be
estimated.
In this manner, as the algorithm progresses, the points that are on the same surface as
P0 gain influence while the others are increasingly discounted. We illustrate this behaviour
in Fig.l(b), where the two noisy hemispheres are smoothed without being merged and
the points between them are left as outliers; such a result would be difficult to achieve
with a simple 2 1/2D interpolation scheme.
3 Resampling
When the smoothing is done, as shown in Figure l(b), the data points still form an
irregular sampling of the underlying surfaces that is ill-suited for the generation of a
map. In order to produce meaningful triangulations, we need a more regularly spaced set
of vertices, and we use the local surfaces to compute them [6]. In effect we are replacing a
large number of irregularly spaced 3D points by a smaller set of regularly spaced ones and
their local surfaces: we are achieving both data organization and compression. This turns
out to be an effective way to merge data-points coming from several views or sensors.
4 Clustering
To cluster the isolated 3D points into more global entities, we use again the metric of
Eq.(2) to define a "same surface" relationship R between points P1, P2 as follows:
PIRP2 ~ dq(P1, P2, ql, q2) < mq and de(P1, P2) < rod,
where de is the euclidean distance and md and mq are two thresholds. In other words,
two points are assumed to belong to the same surface if their local fits are consistent
with one another. The data set equipped with the relationship 7~ can now be viewed as
a graph whose connected components are the surfaces we are looking for.
5 Triangulation
The clusters we have generated so far are collections of points and their associated local
surfaces. For many applications, such as robotics or graphics, it is important to be able
to unambiguously interpolate the surfaces. Delaunay triangulation [1] is an excellent way
of doing this [6] and, furthermore, lets us compute shaded models of our data sets.
679
Fig. 2. (a) (b) A stereo pair showing some rocks. (c) Another image taken from a completely
different viewpoint. (d) The stereo map derived by matching (a) and (b). The black
areas correspond to textureless areas for which no depth was computed, and lighter
colors correspond to increasing depth. (e) Wireframe representation. (f) Shaded rep-
resentation. Note that only the parts of the rocks visible in both (a) and (b) are
correctly reconstructed.
Fig. 3. (a) Triangulated ground surface for the rocks of Figure 2 (b). (c) Shaded views. Note
that the backs of the foreground rocks of Figure 2(a) are now clearly visible.
In this section, we show how our technique can be used to reconstruct 3D surfaces by
fusing stereo depth-maps computed from several viewpoints.
Our stereo algorithm [4] produces semi-dense maps that are 2 1/2D representations
of the world. They are necessarily incomplete and, in particular, cannot account for
occluded parts of the scene, such as the back of the rocks in Figure 2(a).
To reconstruct the scene more completely, we have used 5 sets of stereo-pairs corre-
sponding to camera positions between that of figures 2(a) and 2(c). After having reg-
680
istered the stereo results with one another [12], we can use our technique to generate
clusters of triangulated 3D points. The largest one, depicted in Figure 3, accounts for all
the large rocks. Erroneous data points have been discarded as outliers.
7 Conclusion
The experiments shown in this paper used depth data acquired from stereopsis by a mo-
bile robot, although we have successfully tested much of the same code on 3D biomedical
scanner images as well. It must be noted that when the quality of the data degrades, the
critical step becomes that of grouping, and it will be necessary to develop more sophisti-
cated methods. In fact, generating the triangulations of w could be recast as the problem
of finding the best description of the data in terms of a set of triangles with normals and
curvatures known at every vertex. This problem can. be handled within the framework
provided by m i n i m u m description length encoding [9,8,5], and such a framework should
provide us with a sound theoretical basis for future work.
References
This article was processed using the IgFEX macro package with ECCV92 style
Finding the Pose of an Object of Revolution
R. GLACHET, M. DHOME, J.T. LAPRESTE.
Electronics Laboratory, URA 830 of the CNRS
Blaise Pascal University of Clermont-Ferrand 63177 AUBIERE CEDEX (FRANCE)
tel: 73.40.72.28 ; fax: 73.40.72.62, Email: dhome @ le-eva.univ-bpclermont.fr
Abstract: An algorithm able to locate an object of revolution from its CAD model and a single per-
spective image is proposed. Geomeaic properties of object of revolution are used in order to simplify
the localization problem. The axis projection is first computed by a prediction verification scheme. It
enables to compute a virtual image in which the contours are symmetric. A rough localization is done
in this virtual image and then improved by an iterative process. Experiments with real images prove
its robnsmess and its capability to deal with partially occluded contours.
1 INTRODUCTION
Among the visual features that can be extracted from an image, contours are a major source of information
about the shape and the attitude of an object. Retrieving the object location from the contours detected in
a single perspective image is not usually an easy task. The main difficulty consists in finding perspective
invariants that enable to match model elements to image primitives.
In the case ofpolyhedric scenes, matching is greatly simplified, as straight lines are projected onto straight
lines in the image. Thus linear ridges can be successfully used as invariants ([LOW-85], [DHO-89]).
The problem is more difficult when dealing with a world including curved objects, mainly because the
primitives used to describe the model and the image curves are not necessarily of the same nature
([MAR-82]). Alignment of points ([HUT-87]) allows to deduce the location of an object as soon as three
pairs of corresponding model- and image-points are found. But finding such pairs is not trivial.
In the case of Straight Homogeneous Generalized Cylinders (S.H.G.C.) ([MAR-82]), invariants like zero
of curvature can provide matching pairs ([PON-89], [ULU-90]), [RIC-91], but the resultant methods are
very sensitive to occlusion.
When no model is available, the location problem is under-constrained; Meridians can then be used to
avoid ambiguity ([ULU-90]), but the extraction of meridians is not straightforward.
In this paper, we have chosen to deal with objects of revolution whose model is described by a generating
curve. The matching difficulty is avoided by taking advantage of local symmetries (cf also [PON-89]).
Our localization method consists in three stages, that make full use of the symmetry properties:
Algorithm (At) is devoted to the localization of the axis projection and to the creation of a virtual image
in which the contours are symmetrical [GLA-91]. This step contributes to the simplification of the
following computations.
Algorithm (A2) deals with the determination by any available means of a rough starting localization
[DHO-90] (see section 3).
Algorithm (A3)uses the contours equations (section 2), in order to improve iteratively the location found
with (A2) (see section 4).
This work can be seen as an enhancement of a method presented at the last ECCV ([ DHO-90]). In ([DHO-90])
the concept of the virtual image was still used to simplify (A2), but its determination was not very accurate
because it was only based on the knowledge of two image points resulting from a same object section; these
chosen points were double points of the limb projection; their detection is often biased. In case of noisy or
partially occluded contours, such a method fails. On the contrary, the processus (A1)is quite robust to noise
(due to the fact that it avoids brute accumulation schemes) and then provides a virtual image known with
good accuracy.
Algorithm (A2) has recantly be improved; the reader can find in [LAV-90] the details of its new imple-
mentation. An iterative improving scheme has been added to the previous algorithm to reach accuracy (A3).
Its goal is similar to Kriegman & Ponce's work [KRI-90]. Respective advantages of the two methods are
detailed in paragraph (4-2).
Implementation details and experimental results are given in section 5 (see also [GLA-92]).
682
2 CONTOURS EQUATION
In this paragraph, we derive the equations of the limb projections of a solid of revolution from the attitude
parameters. To obtain simple equations, we only consider the attitudes for which the limb projections are
symmetrical with respect to the vertical axis through the image center. Using algorithm (A0 it is always
possible from any brightness image of a solid of revolution to compute a virtual image having this property
(see [GLA-91]).
PO . . ~
A limb point of a curved object is a surface point for which the tangent plane passes through the optical
center of the camera.
Let R be the model of an object of revolution whose axis initiallylies along (O,k'), described by its generating
curve r(z) in 91.
Let Qo be a point of the model surface and No the normal to the surface at Qo 9
We have :
Qo =~ Q
N
"O--Q=/;l f'/ -N= n,
kn,)
Iff is the focal length and u, v the coordinates of perspectiveprojectionq of Q, we deduce from the limb
equation(~~'-~'= o):
i = f.r.CosO
r.Sinct.SinO+ zo.Cosct +c (c.Cosct-b.Sinct + zo).r' - r
f.(r.Cosot.SinO-zo.Sinet+b) where Sin0= (b.Cosct+c.Sinr for limb points (1)
r.Sin(z.SinO+ zo.Cosct+c
Then, for a given section Zo of the model, the perspective projection of limb points can easily be computed.
683
3 R O U G H L O C A L I S A T I O N I N T H E V I R T U A L I M A G E (A2)
Many processes are available to find an approximate value of attitude parameter vector p , especially
when the contour detected in the image presents a reflectional symmetry. The weak perspective view
assumption can be sufficient to treat the problem.
Moments, smoothness and compacmess may help to reach an initial value of p, but such features are not
reliable when a part of the contour is occluded.
When an object of revolution has a flat base, a part of an ellipse is seen in the image and can be used to
find an approximate location of the object ([DHO-90]); extracuon of the ellipse among contours points is
often difficult, but reliable algorithms exist ([RIC-85]).
A prediction verification method using zero curvature points ([RIC-91]), recently generalized to any
points of a symmetrical contour ([DHO-90] [LAV-90]) can also lead to an estimation of the localization,
useful to initialize the iterative improvement described in the next section,
4 IMPROVEMENT O F T H E L O C A L I S A T I O N I N T H E V I R T U A L I M A G E (A~)
With a rough localization at hand, we can implement an iterative process that will improve it.
4.1 Description of the method
At this stage, we know an approximate location Po of the solid R, in afIame such that the perspective
projection of the object limbs are symmetrical with respect to the vertical axis through the image center.
Let Co be the calculated contour for this initial value Po (see section 2) and I be the contour observed
in the virtual image.
The problem is now to decide of an association rule between the points belonging respectively to I
and Co, under the constraint that ffPo is the good attitude, the rule must give the exact pairing.
Once this rule of association is chosen, distance between paired points can be computed (as well as
partial derivatives ff the association is not too complicated) and a process of minimization can be run.
The perfect rule is to associate each point qco(Zo) of Co to the point ql(zo) of I corresponding to the
same section Zo of the model; unfortunately matching points that way is not trivial (excepted for zero
curvature points).
Another possible association is to connect each point qc~ to the point q~ whose normal passes through
qc~ (see [PON-89b]), but here, the preliminary computation of a virtual symetrical image allows to
choose a very simple rule of matching;
For each section Zo of the model, we compute U(Zo) and V(Zo) using equation (1) of section (2) and
we compute the nearest contour point on the im~ ~eline v = v (Zo), i.e. the association is made horizontally.
v
C~ h.C k+~
(
.~A (Ze)
----g-----= ......
v(za i F (z.)
4-u(z,,) 4U(v(z.)) : L k u
m
a) b)
Figure 2: Notations=. a) oar distance, b) following an image point between two iterations.
The image contour I is supposed given by a function U(v). Ck is the calculated contour at iteration k.
We must find the vector p minimizing D = Y. F2(zo) = ~, I U(v(Zo))-u(eo) [2 (see figure 2b).
Zo Zo
684
The section of the model which is projected onto a given image line at iteration k, can change at
iteration k + 1 . This variation of section can be calculated and taken into account in the equations
involved in a Newton-Raphson minimization.
Let us consider two successive iterations k, k + 1, providing an unknown variation Ap of the parameters
vector, such that the perspective projection At(zo) of a limb point of section Zo becomes Ak+l(Zo) at the
next iteration.
It is possible to find the section Zo + Azo, and consequently the point Bk(zo + AZo) which will be
Ixansformed at iteration Ic + 1 in a point Bt +i(Zo + AZo) having the same ordinate v as point A~(zo). In
fact we can write:
3 ~u ~u
(2) ,,,,
AU=/'J~I~PJ
A p ' ~ + ~ - ' ~Jo A z ~ (~Ov~ ~,
and as we imposed Av = 0 for the measure of F(zo) between two iterations,
(~ ~,a ~'~
we can d duce p,].o
3(~u (av ~v~ 8u
Then (2)becomes: Au:i~lt-~Ps-t-~psl-~ZoJ.-ff~zoJ.Aps
3 3u o-iv o-iv au
<.o, ",,,
Now we solve for Fk+l = 0 (the distance between C~ and Ck+~ tends to zero)
3(~u (~v ~v'~ ~u
'bus =- j ,,,,,
5 IMPLEMENTATION
5.1 Process
The contours issued from an image of an object of revolution are first extracted. The axis projection
is computed by a procedure A l; it enables to compute a rotation Ro that must be applied to the camera
frame to obtain a virtual image in which the contours are symmetrical.
A rough estimation of the parameter vector p is done by a procedure A2 in this virtual image; this
vector is used to compote the initial contour Co of our iterative scheme.
We think that the least squares B-Spline curve fitting provides a compact and faithful representation
for smooth contours (see also [LAU-87]). Thus the right part of the image contour I (related to the axis
projection) is approximated by a B-Spline curve (the left part could have been used to reconstruct
occluded parts or to enhance accuracy).
685
The calculated contour Ck at iteration k is computed using the generating curve equation and equation
(1) of section 2; internal loops of this contour are pruned by keeping, for a given image line, the farthest
point from the axis projection. This operation avoids ambiguous associations in computing the distance
F(zo) .
C, is then sampled to use only 20% of the contour points; for each of these points the derivatives
involved in equation (3) and the corresponding F~(zo) are computed (F~(zo) is obtained by B-Spline
interpolation). The resulting system is then solved by the classic Gauss elimination which gives the
correction vector Ap ;p + Ap becomes the initial parameter vector for the next iteration.
E, = D/nbpointsfound is used as stopping error criterion.
Once convergence is reached, Ro t is applied to obtain the localization in the original frame.
C
a
,~ , , , , i ~,
d ............ f
Figure 3 : Visualization of the different steps of the process:
a: the brightness image, b: the found axis projection (Al),
c." the initial location in the virtual image (As), d: successive iterations Co..Clo,
e: the found location in the virtual image with (A3),f." localization in the original frame.
686
6 CONCLUSION
We have provided a complete method able to find the pose of objects of revolution, with a great accuracy.
The symmetry properties that result both from the geometric characteristics of such an object and from the
perspective view assumption have given usefull cues to the localisation problem:
Positioning a revolution object needs to find five localization parameters. Computing the axis projection
is an elegant way to get rid of two of them. A rough estimation of the three remaining ones can easily be
done, using one among the many algorithms dedicated to this task. An iterative process can then be applied
to improve these parameters.
The iterative process presented in section 4 is both simple and robust; it is able to deal with B-Spline
curves representation. Elimination theory and implicit contour equations (comparison with Kriegman &
Ponce's work) are avoided.
A contour point observed at a given iteration results from the projection of a given section. For a fixed
image line, this section changes between two iterations. We have taken this variation of section into account
in our way of pairing points between two successive iterations; such operation avoids the main problems
of stability met in similar processes.
This precise localisation can be used as a preliminary step in modelling objects of revolution, as soon as
a partial model is known (see [LAV-90]).
An extension of our algorithm to S.H.G.C. is forecast and will be soon available.
References:
[DHO-89] M. Dhome, M. Richctin, J.T. Laprest~ & G. Rives, "Determination of the attitudeof 3-D objects
from a single perspective view", IEEE Trans. Pattern Anal. Machine Intell.,vol. PAMI-I I, n'12,
pp. 1265-1278, December 1989.
[DHO-90] M. Dhome, J.T. Laprest~, G. Rives & M. Richetin, "Spatial localization of modelled objects of
revolution in monocular perspectivevision",in Proc. European Conf. Comput. Vision,pp. 475-485,
Antibes, France. April 1990.
[GLA-91] R. Glachet, M. Dhome, J.T. Laprest~, "Finding the perspective projection of an axis of revolution",
Pattern Recognition Letters,vol. 12, pp. 693-700, October 199 I.
[GLA-92] R. Glachet, M. Dhome, J.T. Laprost6, "Finding the Pose of an Object of Revolution", technical
report R.92, January 1992.
[KRI-90] D.J. Kriegman & J. Ponce, "On recognizing and positioning curved 3D objects from image contours",
IEEE Trans. Pattern Anal. Machine lnteU., vol. PAMI-12, N~ pp. 1127-1137, ~ b e r 1990.
[HUT.S7] D.P. Huttenlocher & S. Ullman, "Object recognition using alignment", in Prac. Int. Conf. Comput.
Vision, pp. 102-111, London, England, June 1987.
[LAU-87] P.J. Laurent, "Courbes ouvertes ou ferrules par B-Splines r~gularistes", rapport de recherche RR
652 -M-Anformatique et Math~matiques Appliqutos de Grenoble, France, Mars 1987. (In French)
[LAV-90] J.M. Lavest, R. Glaebet, M. Dbome, J.T. Laprest~, "Objects of revolution: Reconstruction using
monocular vision", sutmaittedto IEEE Trans. Pattern Anal. Machine lnteU., 1991.
[LOW-8$] D.G. Lowe, Perceptual Organization and Visual Recognition, MA: Kluwer, ch. 7,1985.
[MAR-82] D. Marl Vision, ed. Freeman, pp. 215-233, 1982.
[PON-89] J. Ponce, D. Chellberg & W. Mann, "Invariant properties of straight homogeneous generalized
cylinders and their contours", IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-I 1, n'9, pp.
951-966, September 1989.
[PON-89b] Ponce J.& DJ. Kriegman. On recognizing and positioningcurved 3D objects from image contours.
Proceedings of lEEE Workshop on Interpretation of 3D Scenes, Austin, Texas, November 1989,
61-67.
[RIC-85] M. Richetin, J.T. Laprest~ & M. Dhome, "Recognition of cenJcs in contours using theirgeometrical
properties", in Proc. of Conf. Compul. Vision Pattern Recognition, pp. 464-469, San Fransisco,
Califomia, June 1985.
[RIC-91] M. Richetin, M. Dhome, J.T. Laprest~ & G. Rives, "Inverse perspective transform using zero-
curvature contour points: Application to the localization of some generalized cylinders from a single
view", 1EEE Trans. Pattern Anal. Machine lntell., vol. PAMI-13, n~ pp. 185-192, February 1991.
[ULU-90] F. Ulupinar & R. Nevatia, "Shape from contour: Straight homogeneous generalized cones", in Proc.
Int. Conf. Comput. Vision, pp. 582-586, Osaka, Japan, December 1990.
Extraction of Line Drawings from Gray Value Images
by Non-Local Analysis of Edge Element Structures
1 Introduction
Edge element detection is a basic task in computer vision. The results of a subsequent
chaining of neighbouring edge elements are used for a symbolic description. Since the
performance of high level vision tasks based on edge pictures depends on the quality of
those edge pictures, it is important to detect good edge pictures. This implies to minimize
the number of edge elements due to noise and to detect edge elements along grey-level
transitions even in low contrast image regions.
To obtain good edge pictures one can enhance the original picture and choose an
appropriate edge detector or enhance the detected edge picture.
If the edge detection and enhancement process is part of a real-time application, it
must be guaranteed that the output of the edge detection and enhancement process has
a constant delay with respect to the input data. The approach developed in this article
satisfies this real-time constraint.
The input of our algorithm is a picture of edge elements represented by their location,
gradient magnitude, and orientation. A brief review of edge detection approaches is fol-
lowed by a discussion of the thresholding problem and other approaches concerned with
the enhancement of detected edge element pictures based on locally observed properties.
Many authors have been concerned with developing a robust edge detector. [John-
son 90] expects a better response of the edge detector in image regions of low contrast
by a contrast depending edge detection. [Canny 86] suggests smoothing of the original
image with Gaussians whose scale parameters depend on the signal-to-noise ratio. [Korn
88] uses normalized first directional derivatives of a Gaussian to compute the gradient
magnitude and orientation. Edge elements are selected if they correspond to a magnitude
maximum in gradient direction.
Good noise reduction and high positional accuracy can be achieved by coarse-to-fine
tracking of edge elements in scale-space approaches (e.g. example [Witkin 83], [Bergholm
688
87] and [Williams & Shah 90]). The scale-space technique presented in [Perona & Malik
90] should prevent edge shifting by using different coefficients to encourage intraregion
smoothing in preference to interregion smoothing.
To reduce the number of noisy edge elements many approaches employ gradient mag-
nitude thresholding. This often has the consequence that edge elements in regions of low
contrast can no longer be detected.
[Abdou & Pratt 79] suggest computation of a threshold by applying the Bayes-rule.
[Zuniga &; Haralick 88] and [Haddon 88] assume Gaussian noise with known parameters
and determine a suitable threshold by evaluating mean and standard deviation of the
noise distribution. T h e hysteresis thresholding of [Canny 86] uses two thresholds. Edge
elements with gradient magnitude greater than the higher threshold value are considered
to be true edge elements. All connected edge elements which contain at least one true
edge element and whose edge elements lie above the lower threshold are also taken to
represent true edge elements. [Kundu & Pal 86] select different thresholds in image regions
of different brightness.
Figure 5 shows two parallelepipeds and the edge picture containing all maxima of
the gradient magnitude in gradient direction. One can recognize the edge elements of
the objects, but the entire edge picture is swamped by edge elements due to noise.
Thresholding the gradient magnitude at a value of 4 (third image of Figure 5) results in
an almost noise-free edge picture but also in breaking up contours.
The goal of having simultaneously a minimum number of edge elements due to noise
and unbroken contours cannot be solved with one uniform threshold value.
The publications discussed in the following deal with the enhancement of detected
edge pictures. [Zucker et al. 77] and [Hancock & Kittler 90] are using relaxation algorithms
which are of iterative nature and therefore not suitable for real-time applications. An
overview of relaxation algorithms is given in [Kittler & Illingworth 85].
In the article of [Sakai et al. 69], the enhancement of edge pictures is effected by
consideration of orientational information. Their first step consists in thresholding the
gradient magnitude and evaluating the direction of deepest descend in order to preserve
edge elements in local environments around open edge segments: the gxadient magnitudes
of the neighbouring pixels are added up depending on their gradient direction and the
expected direction of the new line segment. If this sum exceeds another threshold, the
current pixel is considered to represent the location of an edge element. Since the local
environment examined here consist only of 3x3 pixels and all pixels passing the second
threshold test are marked, new edge segments may have widths of more than one pixel.
A heuristic approach is proposed by [McKee & Aggarwal 75]. An open edge segment
is extended up to eight edge elements in the direction of the end of the open segment. If
this does not close a gap, the extended edge segment is shortened by twelve edge elements
and a new attempt at an extension is started. The extension/shortening process will be
finished if a gap is closed or the entire segment vanishes.
[Chen & Siy 87] increase the contrast of local image regions around open edge seg-
ments in an iterative manner to make the edge detection process more sensitive. [Haralick
& Lee 90] compute the probability of an edge element for each pixel. An edge element
evaluation function with feedback is used to estimate the parameters for an adaptive
edge detection process.
[Hayden et al. 87] try to close gaps by observation of image sequences. Gaps can
only be closed by their approach if a corresponding closed segment can be found in the
previous and in the following image.
In the article of [Canning et al. 88] each pixel of the edge picture is observed within
a 3x3 environment with different gradient magnitude thresholds. This leads to sets of
edge element configurations for each pixel. Neighbouring 15ixels must have overlapping
masks of the resulting edge element configurations to close gaps. The problem with such
an approach is that the assignment between neighbouring edge elements is not unique,
the thinning does not guarantee a width of one pixel, and the computational cost of the
resulting combinatoric explosion.
689
[Deriche et al. 88] suggest to threshold with a high gradient magnitude to suppress all
noisy edge elements and to close the gaps afterwards. The edge picture is scanned until
an open edge segment is found. If an open segment is recognized the filling algorithm
will start. To build a closing path they consider the three topological neighbours of' the
open end. The new candidates for edge elements are successively extended by their three
topological neighbours. This approach results in a tree with the open end edge element
of the original segment as a root and whose branches represent all possible closing paths.
This recursive algorithm stops if the length of the path exceeds a maximum length or the
path encounters an edge element. If closing paths exist then the best path is computed
based on an evaluation of length and gradient magnitudes along it. This approach has
the disadvantage that the time to compute all closing paths grows exponentially in the
gap length. Moreover, closing paths may be circular and cross themselves.
The described enhancement approaches for detected edge pictures require almost
always high computational costs. Their results are not always convincing, moreover,
unwanted new edge segments may arise.
A real-time enhancement of edge pictures is described in [Otte 90]. A window of size
5x5 is moved line by line over the entire edge picture. The edge or non-edge element in the
center of the window will be changed or left unchanged dependent on the edge element
configuration of this local environment. Each point of the window represents an edge or
non-edge element, which implies 225 ~ 34 million different combinatoric possibilities. To
be able to describe all different configurations, [Otte 90] employs the predicate calculus
for low level vision applications. By means of this tool it is possible to give a formal model
which divides all combinations uniquely and completely into four equivalence classes. One
class leaves an edge element in the center unchanged, the second class eliminates an edge
element of the center, the third class adds an edge element in the center and the last
class laves a non-edge element unchanged. This approach makes it possible to fill gaps
up to two pixels long in fragmented lines and vertices and to thin edge segments whose
width exceed one.
Many authors have worked on thinning algorithms, e.g., [Wang & Zhang 89], [Zhang
Suen 84] or [Topa & Schalkoff 89]. Their thinning algorithms transform binary patterns of
several pixel width into skeletons. In edge pictures with edge elements defined as gradient
magnitude maxima in gradient direction, the width of edge segments lies in the range of
one to three. Therefore, the thinning of edge pictures can be done on the basis of edge
element configurations within a 5x5 environment.
The thinning process can be improved significantly if the gradient magnitude is taken
into account additionally as done by [Otte & Nagel 91a] who, moreover, describe a real-
time process for filling gaps of straight and curved lines which considers the gradient
magnitude and orientation within a 7x7-environment. An open edge segment is extended
by comparing the gradient orientation and the estimated normal direction of an expected
structure.
The algorithm described in Section 3 considers all edge elements without thresholding
the gradient magnitude. Edge elements thinned by the approach reported in [Otte &
Nagel 91a] are chained to obtain edge element chains. The properties of these edge
element chains allow to distinguish between edge element chains due to noise and due
to structures of the original image. The combination of an observation of edge element
chains instead of single edge elements and strictly local decision rules allow to consider
much more global aspects without loosing a real-time execution capability.
This section gives a brief overview of the work in [Otte 90] and [Otte & Nagel 91a].
To describe an edge enhancement process based on local structures we have to explain
what we CM1 a local structure. A local structure is a straight or curved line, a corner
or a vertex. Structures in form of lines are, e.g., straight line, ellipse, or circle segments.
690
Straight lines of Figure l(a) and (b) are called major straight lines, lines like (c) are
named minor straight lines. Structures in the form of (d) and (e) are called general lines
or simply a line. Each point of a major or minor straight line and (general) lines with
the exception of their endpoint(s) has exactly two neighbouring points.
Figure 1 demonstrates also some examples of corners where we distinguish type (f)
as acute corners, (g) represents an orthogonal and (h) an obtuse corner, (i) shows a
rounded corner, which obey simultaneously the line criterion. Vertices could be simple
vertices (j), crossings (k) and (1) or multiple vertices (m). Crossings may be considered
as a combination of two simple vertices, whereas multiple vertices are characterised as
two or more vertices appearing in a close neighbourhood of each other. The structure
"endpoint" is an open end of a line.
I...'" 'i ,::, .:::: :::,. I.........'" :::. ::'"'. ::":. "i '= :'T':
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (1) (m)
Fig. 1. Local structures divided into major straight lines (a) and (b), minor straight lines
like (c) and (general) lines (d) and (e), acute corner (f), orthogonal corner (g), obtuse
corner (h) and rounded corner (i), vertices in form of T-vertices (j), crossings built from
two T-vertices (k) or from Y- and W-vertices (1) and multiple vertices (m).
The structures presented above - including analogies obtained by rotation and reflec-
tion - are the basis of our edge picture enhancement approach. Goal of the enhancement
is the preservation and restoration of disturbed structures and the elimination of the
influence of noise.
2.1 E q u i v a l e n c e classes
The enhancement of edge pictures in K~e.r. 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
[Otte 90] is based on an edge element
configuration within a window of 5x5
pixels which is shifted over the edge I ~ I I ~=" I I P,v I
_L_a..~e_r_ _5..............................................
picture. It yields 225 ~ 34 million dif-
ferent edge element configurations. In
order to obtain a unique, complete and
consistent description it is necessary
~.~r_ ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
to divide all configurations into four
equivalence classes. One class leaves an
edge element in the center unchanged,
the second class eliminates edge ele- 1 t13r !: l IE3D El D El El I iD
_L_a2Le_r_
_3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ments, the third class adds an edge el-
ement in the center and the last class
leaves an non-edge element unchanged.
[Otte 90] introduced predicate calcu- K~cr_ 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
lus for low level vision applications in
order to provide a tool whereby each
of the 34 million masks of edge ele-
ment configurations is assigned to ex- ]~_a~e_r_~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
actly one equivalence class.
The construction of the presented
structures is subdivided into seven lay-
ers - see Figure 2. First of all, each
r-v-i r- 73 i-V:l rv:3 rz:3 i-zr3
K~r. 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
point of a 5x5-mask needs to be la-
beled. Within a mask, a point rep- Fig. 2. Layers of the model
resents either an edge or a non-edge
691
element: we call this fact the state of a point. Single mask points together with their states
are called objects. On layer 0, the mask points do not have any relationships between
them, but they are the basis for the construction of composite objects by introducing
relationships, for example neighbourhoods, in the form of rules. Objects of the lowest
layer are also called basic objects.
Objects of layer n > 1 are composed of objects from lower layers, containing at least
one object of layer n - 1. Therefore, objects of layer n > 1 consist of at least two basic
objects. Objects with similar properties are combined into sets, for example the set of
orthogonal edges.
The model of layers in Figure 2 shows the sets which are used to build structures up to
the four equivalence classes, where objects at the different layers represent substructures
and disturbances. A detailed explanation is given in [Otte 90]. Here we can give only a
coarse overview.
Layer 0 contains the basic object set G of all edge elements and non-edge elements
within a 5x5 mask. This set is subdivided into regional sets Qw, Qn, Qo and Qs which
contains edge elements in the western, northern, eastern 3 and southern part of the mask.
The set Mp contains all edge elements and NMp all non-edge elements.
Layer 1 contains connections from the center to the boundary of the mask in form
of straight connection set Rs, curved Ra, orthogonal Ro and long connections in the set
R1. The sets Nh, Nv, Ns and Nf contain horizontal, vertical, diagonally increasing and
diagonally decreasing neighbours, respectively.
With layer 2, the first structures are described. The sets Hgh, Hgv, Hgf, and I-Igs
represent major straight lines and the sets Nghf, Nghs, Ngvf, and Ngvs minor straight
lines, respectively, based on their orientation. The set E combines orthogonal corners and
the set Np all neighbours of a given mask point p.
On layer 3, the first disturbed edge element structures are explicitly modeled. For
example, the set Iso contains objects of isolated edge elements which represent an edge
element with no neighbouring edge element within the 5x5 mask. The sets Ph and Pn
consists of potential major and minor straight lines which can be related to straight lines
by filling gaps. Due to lack of space, we cannot explain all sets.
To show the efficacy of the pred- E := {x E (Rs~Rs) [ e(z)}
icate calculus, the set E of orthog- e(z) : ~ (3y e (Rs n Nh) 3z 9 (Rs n Nv) : x = y~z) v
onal corners is explained in more (3y 9 (Rs n Nf) 3z 9 (Rs n Ns) : x = y ~ z )
detail:
An object z is called an orthogonal corner if it satisfies the predicate e(x). e(z) is true,
if x consists (expressed a s z = y ~ z ) of two straight connections y, z E Rs. The objects y
and z must either lie in the set of horizontal and vertical or in the set of increasing and
decreasing neighbours. The exact definitions of the symbols, sets, and predicates used
here are given in [Otte 90].
With the four equivalence classes of layer 6 it is possible to fill gaps of up to two
pixels as well as to thin lines of more than one pixel width and to reduce the number of
noisy edge elements.
One disadvantage of the thinning process applied to binary edge pictures is the fact
that it is undecidable which edge element of lines with more than one pixel width to
choose in order to obtain edge elements located as closely as possible to the expected
edge line. This problem is solved in [Otte & Nagel 91a] by taking the gradient magnitude
into account. If an edge element configuration of a 5x5 window corresponds to an object of
the equivalence class "erase edge element", a subtask is activated which decides whether
the edge element should be preserved or not. The edge element in the center of the
window is preserved if the following conditions are satisfied:
1. There exists a north-south connection and an edge element to the right of the center
or there exists a west-east connection and an edge element below the center.
2. There exists a north-south connection and an edge element to the right of the center
3 Symbols refer to abbreviations of german terms, e.g. the index "o" in Qo stands for "Ost'.
692
To distinguish between noise and structural edge element chains, the length, the sum of
gradient magnitude and the orientational standard deviation have to be compared with
the desired value. This leads to the following definition:
D e f i n i t i o n 3.1 ( V a l i d i t y o f e d g e e l e m e n t c h a i n s ) L e t the observed edge element
chain be k with length lk. Sk is the sum of gradient magnitudes, Ek(dk) is the orien-
rational mean and sk is the orientational standard deviation, rl is the desired minimum
value of average edge element chain gradient magnitude, r2 is the desired minimum length
of edge element chains and "c3 is the desired maximum orientational standard deviation.
The observed edge element chain is valid if the following conditions are fulfilled:
&
lk=l: lk Sk _> vi (1)
Sk
2<lk<4: ~-k > r l and sk < r 3 (2)
(Sk/Ik'~ 2 Sk S~ sk
1~>5:_ ..-- - 12_a~, > _ - (3)
\ 7-1 / v1~'2 kJl 2 "r3
Condition 1 is needed for edge element chains containing exactly one edge element. They
have an adjacent vertex, because isolated edge elements are removed by the thinning
process of [Otte & Nagel 91@ For those candidates, the traditional gradient magnitude
threshold must be used.
It is necessary to distinguish between edge element chains with up to four edge ele-
ments and chains with more than four edge elements, because otherwise the computation
of the orientational mean and standard deviation of short edge element may take too few
values into account. The Condition 2 preserves those edge element chains with average
gradient magnitude greater than or equal to the desired value and with an orientational
standard deviation below a given threshold.
In Condition 3 the relation between average gradient magnitude and desired minimum
value has more influence due to the quadratic term which is the quotient of the sum of
gradient magnitude divided by the product of the desired minimum values for length and
average gradient magnitude. The product rl 7"2gives a desired minimum value for the sum
of gradient magnitudes. The left part of the Inequality 3 is a measure for the difference
between a desired and an observed chain. If the left part is greater or equal to one, then
a greater standard deviation will be allowed and vice versa. The quadratic term prefers
shorter edge element chains of high contrast compared to longer edge element chains
of lower contrast. The Condition 3 considers more than one parameter with different
weights. Therefore it is better to speak of desired values in a control theoretical sense
instead of thresholds.
Definition 3.2 ( V a l i d i t y o f v e r t i c e s ) A vertex v is valid, /f the vertex is connected
to at least one valid adjacent edge element chain or it is linked to a neighbouring vertex
marked as valid.
4 Results
This section illustrates results of the validation process of edge element chains as defined
in the last section. The next four pictures are part of an image sequence taken from
[Koller et al. 91]. For the current version, the desired value of average gradient magnitude
is rl = 6, the desired length is 1"2 = 23 and the desired orientational standard deviation
is set to r3 = ~r. The results are compared with the corresponding edge pictures with
gradient magnitude threshold equal to rl.
Comparing the two edge images of Figures 4, one can see that parts of the curb below
the entrance building and the right border of the road in front of the barrier is better
preserved by applying the new approach. Furthermore we observe less edge elements due
694
to noise with the chain based edge enhancement process than by thresholding (e.g. the
roof of the entrance building or the road surface).
Fig. 5. Two parallelepipeds and the corresponding edge pictures without and with gra-
dient magnitude thresholded at 4 and the edge picture of the new approach.
The contours of the two parallelepipeds are much better preserved by the new al-
gorithm with a simultaneous weakening of double edge lines due to signal overshooting
of the video camera. But it has not yet been possible to preserve the entire top right
horizontal edge segment of the left parallelepiped.
We have deliberately shown this last example in order to demonstrate both the pos-
sibilities and the limits of the current version of our approach. Based on the experience
accumulated throughout the investigations which yielded the results presented here, we
are confident to be able to improve this approach further!
Acknowledgement
This work was supported in part by the Basic Research Action INSIGHT of the Europeen
Community. We thank D. Koller and V. Gengenbach for providing us with the grey-level
images appearing in Figures 4 and 5, respectively. We also thank K. Daniilidis for his
comments on a draft version of this contribution.
References
[Abdou & Pratt 79] I.E. Abdou, W.K. Pratt, Qualitative design and evaluation of enhance-
ment/thresholding edge detectors, Proceedings of the IEEE 67 (1979) 753-763.
[Bergholm 87] F. Bergholm, Edge focusing, IEEE Trans. Pattern Analysis and Machine Intel-
ligence PAMI-9 (1987) 726-741.
695
[Canning et al. 88] J. Canning, J.J. Kim, N. Netanyahu, A. Rosenfeld, Symbolic pixel labeling
for curvilinear feature detection, Pattern Recognition Letters 8 (1988) 299-310.
[Canny 86] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal-
ysis and Machine Intelligence PAMI-8 (1986) 679-698.
[Chen & Siy 87] B.D. Chen, P. Siy, Forward/backward contour tracing with feedback, 1EEE
Trans. Pattern Analysis and Machine Intelligence PAMI-9 (1987) 438-446.
[Deriche et al. 88] R. Deriche, J.P. Cocquerez, G. Almouzny, An efficient method to build early
image description, Proc. Int. Conf. on Pattern Recognition, Rome, Italy, Nov. 14-17, 1988,
pp. 588-590.
[Haddon 88] J. Haddon, Generalized threshold selection for edge detection, Pattern Recognition
2z (1988) 195-203.
[Hancock & Kittler 90] E.R. Hancock, J. Kittler, Edge labeling using dictionary-based relax-
ation, IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-12 (1990) 165-181.
[Haxalick & Lee 90] R.M Haralick, J. Lee, Context depending edge detection and evaluation,
Pattern Recognition 23 (1990) 1-19.
[Hayden et al. 87] C.H. Hayden, R.C. Gonzales, A. Ploysongsang, A temporal edge-based image
segmentor, Pattern Recognition 20 (1987) 281-290.
[Johnson 90] R.P. Johnson, Contrast based edge detection, Pattern Recognition23 (1990) 311-
318.
[Kittler & IUingworth 85] J. Kittler, J. Illingworth, Relaxation labelling algorithms - a review,
Image and Vision Computing 3 (1985) 206-216.
[Koller et al. 91] D. Koller, N. Heinze, H.-H. Nagel, Algorithmic characterization of vehicle tra-
jectories from image sequences by motion verbs, Proc. IEEE Conf. Computer Vision and
Pattern Recognition, Lahaina, Maul, Hawaii, June 3-6, 1991, pp. 90-95.
[Korn 88] A.F. Korn, Toward a Symbolic Representation of Intensity Changes in Images, IEEE
Trans. Pattern Analysis and Machine Intelligence PAMI-10 (1988) 610-625.
[Kundu & Pal 86] M.K. Kundu, S.K. Pal, Thresholding for Edge Detection Using Human Psy-
chovisual Phenomena, Pattern Recognition Letters 4 (1986) 433-441.
[McKee & Aggarwal 75] J.W. McKee, J.K. Aggarwal, Finding edges of the surface of 3-D curved
objects by computer, Pattern Recognition 7 (1975) 25-52.
[Otte 90] M. Otte, Entwicklung eines Verfahrens zur schnellen Verarbeitung yon Kanten-
element-Ketten, Diplomarbeit, Institut ffr Algorithmen und Kognitive Systeme, Fakult~t
ffir Informatik der Universit~t Karlsruhe (TH), Karlsruhe, Deutschland, Mai 1990.
[Otte & Nagel 91a] M. Otte, H.-H. Nagel, Pr~dikatenlogik als Grundlage far eine videoschnelle
Kantenverbesserung, Interner Bericht, Institut ffir Algorithmen und Kognitive Systeme,
FakultSt f6r Informatik der Universit~t Karlsruhe (TH), Karlsruhe, Deutschland, August
1991.
[Otte & Nagel 91b] M. Otte, H.-H. Nagel, Extraktion yon Strukturen aus Kantenelementbildern
dutch Austoertung yon Kantenelementketten, Interner Bericht, Institut ffir Algorithmen und
Kognitive Systeme, Fakult~t ffir Informatik der Universit~t Karlsruhe (TH), Karlsruhe,
Deutschland, September 1991.
[Perona & Malik 90] P. Perona, J. Mahk, Scale-space and edge detection using artisotropic dif-
fusion, IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-12 (1990) 629-639.
[SakaJ et al. 69] T. Sakai, M. Nagao, S. Fujibayashi, Line extraction and pattern detection in a
photograph, Pattern Recognition 1 (1969) 233-248.
[Topa & Schalkoff 89] L.C. Topa, R.J. Schalkoff, Edge Detection and Thinning in Time-Varying
Image Sequences Using Spatio-Temporal Templates, Pattern Recognition 22 (1989) 143-
154.
[Wang & Zhang 89] P.S.P. Wang, Y.Y. Zhang, A Fast and Flexible Thinning Algorithm, IEEE
Transactions on Computers 38 (1989) 741-745.
[Williams & Shah 90] D.J. Williams, M. Shah, Edge contours using multiple scales, Computer
Vision, Graphics, and linage Processing 51 (1990) 256-274.
[Witkin 83] A.P. Witkin, Scale-space filtering, International Joint Conf. Artificial Intelligence,
Karlsruhe, Germany, Aug. 8-12, 1983, pp. 1019-1021.
[Zhang &Suen 84] T.Y. Zhang, C.Y. Suen, A Fast Parallel Algorithm for Thinning Digital
Patterns, Communications of the A C M 27 (1984) 236-239.
[Zucker et al. 77] S.W. Zucker, R.A. Hummel, A. Rosenfeld, An application of relaxation label-
ing to line and curve enhancement, IEEE Trans. on Computers C-26 (1977) 394-403.
[Zuniga & Haralick 88] O. Zuniga, R. Haralick, Gradient threshold selection using the facet
model, Pattern Recognition 21 (1988) 493-503.
A m e t h o d for t h e 3 D r e c o n s t r u c t i o n o f i n d o o r s c e n e s
from monocular images
, [ ' ~ -
C T T T . ~ D T T ~ .
~...-.~ L-v
Fig. 1. The recovery of a line-drawing. A: an image of 512x512 pixels acquired with a Panasonic
camera and digitalised with a FG100 Imaging Technology board. B: the segments obtained
with a polygonal approximation of the edges extracted with a Canny filter. C: the line-drawing
with labelled junctions (L, T, Y and X). The thresholds used to merge ~he segments are: =t:5~
(collinearity), 8 pixels (adjacent), 50 pixels (distance). D: the final line-drawing after the recursive
deletion of unconnected segments. The L junctions are detected if two vertices are closer than
7 pixels.
2 Extraction of polygons
Using the line-drawing of Fig. 2D, it is possible to extract maximal simple polygons (see
IS1]), which are the perspective projection on the image of planar surfaces with similar
attitude in space. Each simple polygon may be labelled with a different orientation; this
depends on the attitude in space of the projected planar surfaces. Simple polygons in
images of scenes belonging to Legoland can have, at most, three different orientations.
The fig. 3B shows the polygons extracted from the line-drawing obtained from the image
3A; the three different textures correspond to horizontal, vertical and planar surfaces,
white regions correspond to complex polygons.
3 D e t e c t i o n of t h e d i m e n s i o n s of t h e c o r r i d o r
The 3D structure of viewed scenes is described by simply using 3D boxes, the largest
box corresponding to the empty corridor and other boxes representing different objects
698
B
Jl t
,I [
C D
U-- i --!~-T~T
~ -,,
or obstacles. T h e algorithm able to extract this information can be divided into three
m a i n steps:
1. identification, on the image, of the b o t t o m end of the corridor. (see Fig. 3C).
2. identification on the image of lines separating the floor and wails, and those separating
the ceiling and walls (see Fig. 3D).
3. validation of the consistency of the first two steps.
By assuming t h a t the distance from the floor of the optical center of the viewing
c a m e r a is known, it is possible to make an absolute estimate of the side of the box in Fig.
3E and F. T h e image of Fig. 3A was acquired with an objective having a focM length
of 8 m m and the T.V. camera placed at, 115 cm from the floor. The estimate of 195 cm
for the width of the corridor (the true value is 200 cm) can be obtained by using simple
trigonometry.
699
C 0
|
iiiiiiiiiiiiiii..............
I
i
,: :.4 '
4 D e t e c t i o n of obstacles
When the largest 3D box corresponding to the empty corridor has been detected it is
useful to detect and localize other objects or obstacles, such as filing cabinets and drawers.
The algorithm for the detection of these boxes is divided into four steps:
1. detection of polygons, which are good candidates for the frontal panel of the obstacle
for example polygons a, b, c and d in Fig. 3B.
2. validation of the candidates.
3. a 3D box is associated with each validated polygon using a procedure which is very
similar to that used in constructing the 3D box associated with the empty corridor.
4. the consistency of the global 3D structure of the scene is checked, that is to say all
obstacles must be inside the corridor.
Figs. 3E and F reproduce two views of the 3D structure of the scene of image 3A. It
is evident that the global 3D structure of viewed corridors is well described by the boxes
illustrated
Conclusion
The algorithm described in this paper seems to be efficient for the recovery of the 3D
structure of indoor scenes from one or a sequence of images. The proposed algorithm
produced good results for different corridors under a variety of lighting and complexity.
Similar procedures can be used in order to determine the presence of, and locate, other
rectangular objects, such as cabinets, boxes, tables, ... Therefore, when a sequence of
many images is available it is possible to obtain an accurate and robust 3D description of
the scene by exploiting geometrical properties of Legoland and by using a simple Kalman
filter.
Acknowledgements
We wish to thank Dr. M. Campani, Dr. E. De Micheli and Dr. A. Verri for helpful
suggestions on the manuscript. Cristina Rosati typed the manuscript and Clive Prestt
checked the English. This work was partially supported by grants from the EEC (ESPRIT
II VOILA), E.B.R.A. Insight Project 3001, EEC BRAIN Project No. 88300446/JU1,
Progetto Finalizzato Trasporti PROMETHEUS, Progetto Finalizzato Robotica, Agenzia
Spaziale Italiana (ASI).
References
[B2] Barrow, H.G, Tenenbaum, J.M.: Interpreting line-drawings as three-dimensional surfaces.
Artif. Intell. 17 (1981) 75-116
[C1] Coelho C., Straforini M., Campani M.: A fast and precise method to extract vanishing
points, SPIE's International Symposia on Applications in Optical Science and Engineer-
ing, Boston 1990.
[H1] Haralick, R.M.: Using perspective transformation in scene analysis. Comput. Graphics
Image Process 13 (1980) 191-221
IS1] Straforini, M., Coelho, C., Campani, M., Torre V.: The recovery and understanding of
a line drawing from indoor scenes. PAMI in the press (1991)
A c t i v e D e t e c t i o n and Classification of J u n c t i o n s
by Foveation w i t h a H e a d - E y e S y s t e m
Guided by the Scale-Space Primal Sketch *
A prevalent view of low-level visual processing is that it should provide a rich but sparse
representation of the image data. Typical features in such representations are edges, lines,
bars, endpoints, blobs and junctions. There is a wealth of techniques for deriving such
features, some based on firm theoretical grounds, others heuristically motivated. Never-
theless, one may infer from the never-ending interest in e.g. edge detection and junction
and corner detection, that current methods still do not supply the representations needed
for further processing. The argument we present in this paper is that in an active system,
which can focus its attention, these problems become rather simplified and do therefore
allow for robust solutions. In particular, simulated foveation I can be used for avoiding
the difficulties that arise from multiple responses in processing standard pictures, which
are fairly wide-angled and usually of an overview nature.
We shall demonstrate this principle in the case of detection and classification of
junctions. Junctions and corners provide important cues to object and scene structure
(occlusions), but in general cannot be handled by edge detectors, since there will be
no unique gradient direction where two or more edges/lines meet. Of course, a number
of dedicated junction detectors have been proposed, see e.g. Moravec [15], Dreschler,
Nagel [4], Kitchen, Rosenfeld [9], FSrstner, Giilch [6], Koenderink, Richards [10], Deriche,
Giraudon [3] and ter I-Iaar et al [7]. The approach reported here should not be contrasted
to that work. What we suggest is that an active approach using focus-of-attention and
foveation allows for both simple and stable detection, localization and classification, and
in fact algorithms like those cited above can be used selectively in this process.
In earlier work [1] we have demonstrated that a reliable classification of junctions can
be performed by analysing the modalities of local intensity and directional histograms
during an active focusing process. Here we extend that work in the following ways:
- The candidate junction points are detected in regions and at scale levels determined
by the local image structure. This forms the bottom-up attentional mechanism.
* This work was partially performed under the ESPRIT-BRA project INSIGHT. The support
from the Swedish National Board for Industrial and Technical Development, NUTEK, is
gratefully acknowledged. We would also like to thank Kourosh Pahlavan, Akihiro Horii and
Thomas Uhlin for valuable help when using the robot head.
1 By foveation we mean active acquisition of image data with a locally highly increased resolu-
tion. Lacking a foveated sensor, we simulate this process on our camera head.
702
- The analysis is integrated with a head-eye system allowing the algorithm to actually
take a closer look by zooming in to interesting structures.
- T h e loop is further closed, including an automatic classification. In fact, by using the
active visual capabilities of our head we can acquire additional cues to decide about
the physical nature of the junction.
In this way we obtain a three-step procedure consisting of (i) selection of areas of interest,
(ii) foveation and (iii) determination of the local image structure.
The motivation for this scheme is that for example, in the neighbourhood of a point
where three edges join, there will generically be three dominant intensity peaks corre-
sponding to the three surfaces. If that point is a 3-junction (an arrow-junction or a Y-
junction) then the edge direction histogram will (generically) contain three main peaks,
while for a T-junction the number of directional peaks will be two etc. Of course, the
result from this type of histogram analysis cannot be regarded as a final classification
(since the spatial information is lost in the histogram accumulation), but must be treated
as a hypothesis to be verified in some way, e.g. by backprojection into the original data.
Therefore, this algorithm is embedded in a classification cycle. More information about
the procedure is given in [1].
1.1 C o n t e x t I n f o r m a t i o n R e q u i r e d for t h e F o c u s i n g P r o c e d u r e
Taking such local histogram properties as the basis for a classification scheme leads to
two obvious questions: Where should the window be located and how large should it be2?
We believe that the output from a representation called the scale-space primal sketch
[11, 12] can provide valuable clues for both these tasks. Here we will use it for two main
purposes. The first is to coarsely determine regions of interest constituting hypotheses
about the existence of objects or parts of objects in the scene and to select scale levels
for further analysis. The second is for detecting candidate junction points in curvature
data and to provide information about window sizes for the focusing procedure.
In order to estimate the number of peaks in the histogram, some minimum number
of samples will be required. With a precise model for the imaging process as well as the
2 This is a special case of the more general problem concerning how a visual system should be
able to determine where to start the analysis and at what scales the analysis should be carried
out, see also [13].
703
noise characteristics, one could conceive deriving bounds on the resolution, at least in
some simple cases. Of course, direct setting of a single window size immediately valid
for correct classification seems to be a very difficult or even an impossible task, since if
the window is too large, then other structures than the actual corner region around the
point of interest might be included in the window, and the histogram modalities would
be affected. Conversely, if it is too small then the histograms, in particular the directional
histogram, could be severely biased and deviate far from the ideal appearance in case the
physical corner is slightly rounded - - a scale phenomenon that seems to be commonly
occurring in realistic scenes 3.
Therefore, what we make use of instead is the process of focusing. Focusing means
that the resolution is increased locally in a continuous manner (even though we still have
to sample at discrete resolutions). The method is based on the assumption that stable
responses will occur for the models that best fit the data. This relates closely to the
systematic parameter variation principle described in [11] comprising three steps
Several different types of corner detectors have been proposed in the literature. A prob-
lem, that, however, has not been very much treated, is that of at what scale(s) the
junctions should be detected. Corners are usually treated as pointwise properties and are
thereby regarded as very fine scale features.
In this treatment we will take a somewhat unusual approach and detect corners at
a coarse scale using blob detection on curvature data as described in [11, 13]. Realistic
corners from man-made environments are usually rounded. This means that small size
operators will have problems in detecting those from the original image.
Another motivation to this approach is that we would like to detect the interest points
at a coarser scale in order to simplify the detection and matching problems.
2.1 C u r v a t u r e o f L e v e l C u r v e s
Since we are to detect corners at a coarse scale, it is desirable to have an interest point
operator with a good behaviour in scale-space A quantity with reasonable such properties
is the rescaled level curve curvature given by
This expression is basically equal to the curvature of a level curve multiplied by the
gradient magnitude 4 as to give a stronger response where the gradient is high. The
motivation behind this approach is that corners basically can be characterized by two
properties: (i) high curvature in the grey-level landscape and (ii) high intensity gradient.
Different versions of this operator have been used by several authors, see e.g. Kitchen,
Rosenfeld [9], Koenderink, Richards [10], Noble [16], Deriche, Giraudon [3] and Florack,
ter Haar et al [5, 7].
3 This effect does not occur for an ideal (sharp) corner, for which the inner scale is zero.
4 Raised to the power of 3 (to avoid the division operation).
704
Figure l(c) shows an example of applying this operation to a toy block image at a
scale given by a significant blob from the scale-space primal sketch. We observe that the
operator gives strong response in the neighbourhood of corner points.
2.2 R e g i o n s o f I n t e r e s t - - C u r v a t u r e B l o b s
The curvature information is, however, still implicit in the data. Simple thresholding on
magnitude will in general not be sufficient for detecting candidate junctions. Therefore,
in order to extract interest points from this output we perform blob detection on the
curvature information using the scale-space primal sketch. Figure l(d) shows the result
Fig. 1. Illustration of the result of applying the (rescaled) level curve curvature operator at
a coarse scale, (a) Original grey-level image. (b) A significant dark scale-sp~ce blob extracted
from the scale-space primal sketch (marked with black). (c) The absolute value of the rescaled
level curve curvature computed at a scale given by the previous scale-space blob (this curvature
data is intended to be valid only in a region around the scale-space blob invoking the analysis).
(d) Boundaries of the 50 most significant curvature blobs (detected by applying the scale-spa~:e
primal sketch to the curvature data). (From Lindeberg [11, 13]).
of applying this operation to the data in Figure l(c). Note that a set of regions is extracted
corresponding to the major corners of the toy block. Do also note that the support regions
of the blobs serve as natural descriptors for a characteristic size of a region around the
candidate junction. This information is used for setting (coarse) upper and lower bounds
on the range of window sizes for the focusing procedure.
A trade-off with this approach is that the estimate of the location of the corner will
in general be affected by the smoothing operation. Let us therefore point out that we
are here mainly interested in detecting candidate junctions at the possible cost of poor
locMization. A coarse estimate of the position of the candidate corner can be obtained
from the (unique) local maximum associated with the blob. Then, if improved localization
is needed, it can be obtained from a separate process using, for example, information from
the focusing procedure combined with finer scale curvature and edge information.
The discrete implementation of the level curve curvature is based on the scale-space for
discrete signals and the discrete N-jet representation developed in [11, 14]. The smoothing
is implemented by convolution with the discrete analogue of the Gaussian kernel. From
this data low order difference operators are applied directly to the smoothed grey-level
data implying that only nearest neighbour processing is necessary when computing the
derivative approximations. Finally, the (rescaled) level curve curvature is computed as a
polynomial expression in these derivative approximations.
procedure has been integrated with a head-eye system (see Figure 2 and Pahlavan, Ek-
lundh [17]) allowing for algorithmic control of the image aquisition.
Fig. 2. The KTH Head used for acquiring the image data for the experiments. The head-eye
system consists of two cameras mounted on a neck and has a total of 13 degrees of freedom. It
allows for computer-controlled positioning, zoom and focus of both the cameras independently
of each other.
The method we currently use for verifying the classification hypothesis (generated
from the generic cases in the table in Section 1, given that a certain number of peaks,
stable to variations in window size, have been found in the grey-level and directional
histogram respectively) is by partitioning a window (chosen as representative for the
focusing procedure [1, 2]) around the interest point in two different ways: (i) by back-
projecting the peaks from the grey-level histogram into the original image (as displayed
in the middle left column of Figure 5) and (ii) by using the directional information
from the most prominent peaks in the edge directional histograms for forming a simple
idealized model of the junction, which is then fitted to the data (see the right column
of Figure 5). From these two partitionings first and second order statistics of the image
data are estimated. Then, a statistical hypothesis test is used for determining whether
the data from the two partitionings are consistent (see [2] for further details).
We will now describe some experimental results of applying the suggested methodology
to a scene with a set of toy blocks. An overview of the setup is shown in Figure 3(a). The
toy blocks are made out of wood with textured surfaces and rounded corners.
Fig. 3. (a) Overview image of the scene under study. (b) Boundaries of the 20 most significant
dark blobs extracted by the scale-space primal sketch. (c) The 20 most significant bright blobs.
Figures 3(b)-(c) illustrate the result of extracting dark and bright blobs from the
overview image using the scale-space primal sketch. The boundaries of the 20 most signif-
icant blobs have been displayed. This generates a set of regions of interest corresponding
to objects in the scene, faces of objects and illumination phenomena.
706
Fig. 4. Zooming in to a region of interest obtained from a dark blob extracted by the scale-space
primal sketch. (a) A window around the region of interest, set from the location and the size of
the blob. (b) The rescaled level curve curvature computed at the scale given by the scale-space
blob (inverted). (c) The boundaries of the 20 most significant curvature blobs obtained by
extracting dark blobs from the previous curvature data.
Fig. 5, Classification results for different junction candidates corresponding to the upper left,
the central and the lower left corner of the toy block in Figure 4 as well as a point along the
left edge. The left column shows the maximum window size for the focusing procedure, the
middle left column displays back projected peaks from the grey-level histogram for the window
size selected as representative for the focusing process, the middle right column presents line
segments computed from the directional histograms and the right column gives a schematic
illustration of the classification result, the abstraction, in which a simple (ideal) corner model
has been adjusted to data. (The grey-level images have been stretched to increase the contrast).
In Figure 4 we have zoomed in to one of the dark blobs from the scale-space primal
sketch corresponding to the central dark toy block. Figure 4(a) displays a window around
t h a t blob indicating the current region of interest. The size of this window has been set
from the size of the blob. Figure 4(b) shows the rescaled level curve curvature computed at
the scale given by the blob and and Figure 4(c) the boundaries of the 20 most significant
curvature blobs extracted from the curvature data.
In Figure 5(a) we have zoomed in further to one of the curvature blobs (corresponding
to the upper left corner of the dark toy block in Figure 4(c)) and initiated a classification
procedure. Figures 5(b)-(d) illustrate a few o u t p u t results from t h a t procedure, which
707
classified the point as being a 3-junction. Figures 5(e)-(1) show similar examples for two
other j u n c t i o n candidates (the central and the lower left corners) from the s a m e toy
block. The interest point in Figure 5(e) was classified as a 3-junction, while the p o i n t in
Figure 5(i) was classified as an L-junction. Note the weak contrast between the two front
faces of the central corner in the original image. Finally, Figures 5(m)-(p) in the b o t t o m
row indicate the ability to suppress "false alarms" by showing the results of applying the
classification procedure to a point along the left edge.
Fig. 6. Illustration of the effect of varying the focal distance at two T-junctions corresponding t o
a depth discontinuity and a surface marking respectively. In the upper left image the camera was
focused on the left part of the approximately horizontal edge while in the upper middle image
the camera was focused on the lower part of the vertical edge. In both cases the accomodation
distance was determined from an auto-focusing procedure, developed by Horii [8], maximizing
a simple measure on image sharpness. The graphs on the upper right display how this mea-
sure varies as function of the focal distance. The lower row shows corresponding results for a
T-junction due to a surface marking. We observe that in the first case the two curves attain
their maxima at clearly distinct positions (indicating the presence of a depth discontinuity),
while in the second case the two curves attain their maxima at approximately the same position
(indicating that the T-junction is due to a surface marking).
relative depth between the two edges, which in turn can be related to absolute depth
values by a calibration of the camera system. For completeness, we give corresponding
results for a T-junction due to surface markings, see Figure 6(d)-(e). In this case the two
graphs attain their maxima at approximately the same position, indicating that there is
no depth discontinuity at this point. (Note that this depth discrimination effect is more
distinct at a small depth-of-focus, as obtained at high zoom rates).
In Figure 7 we demonstrate how the vergence capabilities of the head-eye system can
provide similar clues for depth discrimination. As could be expected, the discrimination
task can be simplified by letting the cameras verge towards the point of interest. The
vergence algorithm, described in Pahlavan et al [18], matches the central window of one
camera with an epipolar band of the other camera by minimizing the sum of the squares
of the differences between the grey-level data from two (central) windows.
Fig. 7. (a)-(b) Stereo pair for a T-junction corresponding to a depth discontinuity. (c) Graph
showing the matching error as function of the baseline coordinate for two different epipolar
planes; one along the approximately horizontal line of the T-junction and one perpendicular to
the vertical line. (d)-(e) Stereo pair for a T-junction corresponding to a surface marking. (f)
Similar graph showing the matching error for the stereo pair in (d)-(e). Note that in the first
case the curves attain their minima at different positions indicating the presence of a depth
discontinuity (the distance between these points is related to the disparity), while in the second
case the curves attain their minima at approximately the same positions indicating that there
is no depth discontinuity at this point.
Let us finally emphasize that a necessary prerequisite for these classification methods
is the ability of the visual system to foveate. The system must have a mechanism for
focusing the attention, including means of taking a closer look if needed, that is acquiring
new images.
6 S u m m a r y and D i s c u s s i o n
The main theme in this paper has been to demonstrate that feature detection and classi-
fication can be performed robustly and by simple algorithms in an active vision system.
Traditional methods based on prerecorded overview pictures may provide theoretical
foundations for the limits of what can be detected, but applied to real imagery they
will generally give far too many responses to be useful for further processing. We argue
that it is more natural to include attention mechanisms for finding regions of interest
709
and follow up by a step taking "a closer look" similar to foveation. Moreover, by looking
at the world rather than at prerecorded images we avoid a loss of information, which is
rather artificial if the aim is to develop "seeing systems".
The particular visual task we have considered to demonstrate these principles on is
junction detection and junction classification. Concerning this specific problem some of
the technical contributions are:
References
1. Brunnstr6m K., Eklundh J.-O., Lindeberg T.P. (1990) "Scale and Resolution in Active
Analysis of Local Image Structure", Image ~ Vision Comp., 8:4, 289-296.
2. Brunnstr6m K., Eklundh J.-O., Lindeberg T.P. (1991) "Active Detection and Classification
of Junctions by Foveation with a Head-Eye System Guided by the Scale-Space Primal
Sketch", Teeh. Rep., ISRN KTH/NA/P-91/31-SE, Royal Inst. Tech., S-100 44 Stockholm.
3. Deriche R., Giraudon G. (1990) "Accurate Corner Detection: An Analytical Study", 3rd
ICCV, Osaka, 66-70.
4. Dreschler L., Nagel H.-H. (1982) "Volumetric Model and 3D-Trajectory of a Moving Car
Derived from Monocular TV-Frame Sequences of a Street Scene", CVGIP, 20:3, 199-228.
5. Florack L.M.J., ter Haar Romeny B.M., Koenderink J.J., Viergever M.A. (1991) "General
Intensity Transformations and Second Order Invariants', 7th SCIA, Aalborg, 338-345.
6. F/Srstner M.A., Gfilch (1987) "A Fast Operator for Detection and Precise Location of Dis-
tinct Points, Corners and Centers of Circular Features", ISPRS Intercommission Workshop.
7. ter Haar Romeny B.M., Florack L.M.J., Koenderink J.J., Viergever M.A. (1991) "Invariant
Third Order Detection of Isophotes: T-junction Detection", 7th SCIA, Aalborg, 346-353.
8. Horii A. (1992) "Focusing Mechanism in the KTH Head-Eye System", In preparation.
9. Kitchen, L., Rosenfeld, R., (1982), "Gray-Level Corner Detection", PRL, 1:2, 95-102.
10. Koenderink J.J., Richards W. (1988) "Two-Dimensional Curvature Operators", J. Opt.
Soc. Am., 5:7, 1136-1141.
11. Lindeberg T.P. (1991) Discrete Scale-Space Theory and the Scale-Space Primal Sketch,
Ph.D. thesis, ISRN KTH/NA/P-91/8-SE, Royal Inst. Tech., S-100 44 Stockholm.
12. Lindeberg T.P., Eklundh J.-O. (1991) "On the Computation of a Scale-Space Primal
Sketch", J. Visual Comm. Image Repr., 2:1, 55-78.
13. Lindeberg T.P. (1991) "Guiding Early Visual Processing with Qualitative Scale and Region
Information", Submitted.
14. Lindeberg T.P. (1992) "Discrete Derivative Approximations with Scale-Space Properties",
In preparation.
15. Moravec, H.P. (1977) "Obstacle Avoidance and Navigation in the Real World by a Seeing
Robot Rover", Stanford AIM-3$O.
16. Noble J.A. (1988) "Finding Corners", Image ~ Vision Computing, 6:2, 121-128.
17. Pahlavan K., Eklundh J.-O. (1992) "A Head-Eye System for Active, Purposive Computer
Vision", To appear in CVGIP-IU.
18. Pahlavan K., Eklundh J.-O., Uhlin T. (1992) "Integrating Primary Occular Processes", ~nd
ECCV, Santa Margherita Ligure.
19. Witkin A.P. (1983) "Scale-Space Filtering", 8th IJCAI, Karlsruhe, 1019-1022.
A N e w Topological Classification of Points in
3D Images
x ESIEE, Labo IAAI, Cite Descartes, 2 bd Blaise Pascal, 93162 Noisy-le-Grand C~dex, France,
z INRIA, project Epidaure, Domaine de Voluceau-l~ocquencourt, 78153 Le Chesnay C~dex,
France, e-marl: malandaln@bora.inria.fr
Abstract.
We propose, in this paper, a new topological classification of points in
3D images. This classification is based on two connected components num-
bers computed on the neighborhood of the points. These numbers allow to
classify a point as an interior or isolated, border, curve, surface point or as
different kinds of junctions.
The main result is that the new border point type corresponds exactly
to a simple point. This allows the detection of simple points in a 3D image
by counting only connected components in a neighborhood. Furthermore
other types of points are better characterized.
This classification allows to extract features in a 3D image. For exam-
ple, the different kinds of junction points may be used for characterizing
a 3D object. An example of such an approach for the analysis of medical
images is presented.
1 Introduction
Image analysis deals more and more with three-dimensional (3D) images. They may come
from several fields, the most popular one is the medical imagery. 3D images need specific
tools for their processing and their interpretation. This interpretation task involves often
a matching stage, between two 3D images or between a 3D image and a model.
Before this matching stage, it is necessary to extract useful information of the image
and to organize it into a high-level structure. It can be done by extracting the 3D edges
of the image (see [6]) and then by searching some particular qualitative features on these
edges. These features are geometrical (see [5]) or topological (see [4]). In both cases, they
are : intrinsic to the 3D object, stable to rigid transformations and locally defined.
In this paper, we propose a new topological classification which improves the one
proposed in [4]. After recalling some basic definitions of 3D digital topology (section 2),
we give the principle of the topological classification (section 3.1) and we present its
advantages (section 3.3). It is defined by computing two connected components numbers.
The main result is that we can characterize simple points with these numbers without
any Euler number (genus) computation. An example of application in medical imagery
is given (section 5).
2 Basic Definitions
We recall some basic definitions of digital topology (see [1] and [2]).
A 3D digital image is a subset of Z 3. A point z E 7/3 is defined by (zl, z2, za)
with zl E )Y. We can use the following distances defined in ]i~n with their associated
neighborhoods :
711
A binary image consists of one object X and its complementary set X called the back-
ground. In order to avoid any connectivity paradox, we commonly use the 26-connectivity
for the object X and the 6-connectivity for the background X. These connectivities are
the one's used in this paper.
3.1 P r i n c i p l e
Let us consider an object X in the real space ]R3, let ~ E X, and let V(X) be an
arbitrarily small neighborhood of z. Let us consider the numbers C ~ , and C ~ , which
are respectively the numbers of connected components in X n (V(z) \ (z}) and in
X N (V(z) \ {z}) adjacent to z. These numbers may be used as topological descriptors
of z. For example a point of a surface is such that we can choose a small neighborhood
V(X) such as Crt, = 1 and Crt, = 2.
Such numbers are commonly used for thinning algorithms and for characterizing
simple points in 3D. The acute point of their adaptation to a digital topology is the
choice of the small neighborhood V(X).
The distance associated to the 26-connectivity is D ~ , it is then natural to choose
V~ (x) = N2e(z) which is the smallest neighborhood associated to Dor Usually, the same
neighborhood is chosen when using other connectivities. But the distance associated to
the 6-connectivity is Dx, then V~(z) = N~(z) is the smallest neighborhood associated
to D1. The trouble is that N~(x) is not 6-connected. Then V~(z) seems to be the good
choice. In this neighborhood, some points have only one neighbor and have no topological
interest, by removing them we obtain the 18-neighborhood Nls(z).
1. Each point is labeled with a topological type using the computation of two connected
components numbers in a small neighborhood.
2. Because some points (junctions points) are not detected with the two numbers, a less
local approach is used for extracting them.
T y p e A - interior point : C = 0
T y p e B - isolated point : C -- 0
T y p e C - border point : C = 1,C = 1
T y p e D - curve point : C=1,C=2
T y p e E - curves junction : C = 1,C > 2
T y p e F - surface point : C = 2,C = 1
T y p e G - s u r f a c e - c u r v e junction : C=2,C>_2
T y p e H - surfaces junction : C > 2,C = 1
T y p e I - s u r f a c e s - c u r v e junction: C>2,C>2
We o b t a i n then a first local topological classification of each point of the object using
these two numbers (see Table 1).
However, this classification depends only on the 26-neighborhood of each point and
some j u n c t i o n points belonging to a set of junction points which is not of unit-width are
not detected. We propose the following procedures for extracting such points :
For curves : we only need to count the number of neighbors of each curve point (type
D ) , if this number is greater t h a n two, the point is a missed curves junction point
(type E).
For surfaces : we use the notion of simple surface introduced in [4]. If a point of type
F or G is adjacent to more t h a n one simple surface in a 5x5x5 neighborhood, it is
considered as a missed point of type H or I.
3.3 A d v a n t a g e s
The main difference of our new classification is t h a t we count the connected components
of the background X in a 18-neighborhood N~s(z ) instead of in a 26-neighborhood N~6(x )
as in [4]. By using a smaller neighborhood, we are able to see finer details of the object.
The main result due to this difference is t h a t the border point type corresponds
exactly to the characterization of simple points (see [7] and [1]).
C -- N C a [ X N N~6(z)] = I (1)
-C = N C o [ X n N;~(~)] = 1 (2)
Proof. The complete proof of this proposition can not be written here by lack of space
(see [3] for details).
This new characterization of simple points needs only two conditions (instead of
three as usual, see [1]), and these two conditions only need the c o m p u t a t i o n of numbers
of connected components. The c o m p u t a t i o n of the genus, which requires quite a lot of
c o m p u t a t i o n a l effort, is no more necessary.
713
There exists some optimal algorithms for searching and labeling the k-connected com-
ponents in a binary image (see [8]). These algorithms need only one scan of the picture
by one half of the k-neighborhood, and use a table of labels for managing the conflicts
when a point owns to several connected components already labeled.
We can use the same algorithm in our little neighborhoods, but it has a high compu-
tational cost. In these neighborhoods, we have an a priori knowledge about the possible
adjacencies. We can store this knowledge in a table and use it in s propagation algorithm.
For that, we scan the neighborhood, if we find an object's point which is not labeled,
we assign a new label to it and we propagate this new label to the whole connected
component which contains the point. Using this knowledge, the propagation algorithm
is faster than the classical one.
5 Results
We consider two NMR 3D images of a skull scanned in two different positions (see Fig-
ure 1).
We apply a thinning algorithm (derived from our characterization of simple point)
to the 3D image containing a skull. The 3D image contains 256"256"151 quasi-isotropic
voxels of 0.8"0.8"1 m m 3.
We obtain then the skeleton of the skull. We apply our classification algorithm to
label each point. Projections of the labeled skeleton are shown in Figure 2. It is easy
to check the astonishing likeness between both results, in spite of the noise due to the
scan and the skeletonization. This will be used with profit in a forthcoming 3D matching
algorithm.
6 Conclusion
A new topological classification of points in a 3D image has been proposed. This classifi-
cation allows the characterization of a point as an interior, isolated, border, curve, surface
point or as different kinds of junctions. This classification allows also the detection of
simple points (and applications like thinning or shrinking). This is done by computing
two connected components numbers. The Euler number which leads to a lot of compu-
tational effort does not need to be evaluated. Furthermore the method for computing
connected components in s small neighborhood enables fast computations of the two
numbers.
References
I. T.Y. Kong and A. Rosenfeld. Digital topology: introduction and survey. Computer Vision,
Geaphica and Image Processing, 48:357-393, 1989.
2. V.A. Kovalevsky. Finite topology as applied to image analysis. Computer Viaion, Graphics,
And Image Processing, 46:141-161, 1989.
3. G. Malandaln and G. Bertrand. A new topological segmentation of discrete surfaces. Tech-
nical report, I.N.R.I.A., Rocquencourt, 78153 Le Chesnay C~dex, France, 1992.
4. G. Malandain, G. Bertrand, and N. Ayache. Topological segmentation of discrete surfaces.
In IEEE Computer Vision and Pattern Recognition, June 3-6 1991. Hawaii.
714
5. O. Monga, N. Ayache, and Sander P. From voxel to curvature. In IEEE Computer Vision
and Pattern Recognition, June 3-6 1991. Hawaii.
6. O. Monga, R. Deriche, G. Malandaln, and J.P Cocquerez. Recursive filtering and edge clos-
ing : two primary tools for 3d edge detection. In First European Con/erence on Computer
Vision (ECCV), April 1990, Nice, France, 1990. also Research Report INlZIA 1103.
7. D.G. Morgenthaler. Three-dimensional digital topology: the genus. Tr-980, Computer Sci-
ence Center, University of Maryland, College Park, MD 20742, U.S.A., November 1980.
8. C.M. Park and A. Rosenfeld. Connectivity and genus in three dimensions. Tr-156, Com-
puter Science Center, University of Maryland, College Park, MD 20742, U.S.A., May 1971.
Fig. 2. Projection of the topological characterization of the skeleton of the skull : border are in
black, surfaces in light grey and surfaces junctions in grey
A T h e o r y o f 3D R e c o n s t r u c t i o n o f H e t e r o g e n e o u s
E d g e P r i m i t i v e s from T w o P e r s p e c t i v e V i e w s *
INRIA Sophia Antipolis, 2004 Route des Lucioles, 06561 Valbonne, France.
1 Introduction
3D computer vision is concerned with recovering the 3D structure of the observed scene
from 2D projective image data. One major problem of 3D reconstruction is the precision
of the obtained 3D data (see [1] and [2]). A promising direction of research is to combine or
fuse 3D data obtained from different observations or by different sensors. However, simply
adopting the fusion approach is not enough: an additional effort needs to be contributed
at the stage of the 3D reconstruction by adopting a new strategy. For example, we think
that a strategy of 3D reconstruction of heterogeneous primitives would be an interesting
direction of research. The main reason behind this idea is that a real scene composed of
natural or man-made objects would be characterized efficiently by a set of heterogeneous
primitives, instead of uniquely using a set of 3D points or a set of 3D line segments.
Therefore, the design of a 3D vision system must incorporate the processing of a set of
heterogeneous primitives as a central element. In order to implement the strategy above,
we must know at the stage of 3D reconstruction what kind of primitives will be recovered
and how to perform such a 3D reconstruction of the primitives selected beforehand. In
fact, we are interested in the 3D reconstruction of primitives relative to the boundaries
of objects, i.e., the edge primitives. For the purpose of simplicity, we can roughly classify
such primitives into four types which are contour points, line segments, quadratic curves
and closed curves. Suppose now that a moving camera or moving stereo cameras observe
a natural scene to furnish some perspective views. Then, a relevant question will be:
Given two perspective views with the relative geometry being knowing, how to re-
cover the 3D information from the matched 2D primitives such as contour points,
line segments, quadratic curves and closed curves ?
2 Camera Modelling
outside the camera). Similarly, we associate a coordinate system oxy to the image plane,
with the origin being at the intersection point between OZ axis and the image plane; oz
and oy axis being respectively parallel to OX, OY. If we denote P = (X, Y, Z) a point
in OXYZ and p = (z, y) the corresponding image point in ozy, by using a perspective
projection model of the camera, we shall have the following relationship:
(z=~ (1)
Y
where f is the focal length of the camera. Without loss of generality, we can set f - 1.
The two perspective views in question may be furnished either by a moving camera
at two consecutive instants or by a pair of stereo cameras. Thus, it seems natural to
represent the relative geometry between two perspective views by a rotation matrix R
and a translation vector T. In the following, we shall denote (Rvlv2, Tvzv2) the relative
geometry between the perspective view vl and the perspective view v2. Now, if we denote
Pvl = (X~I, Yvl, Z.1) a 3D point in the camera coordinate system of the perspective view
Vl and P,2 = (Xv2,Yv2, Zv2) the same 3D point in the camera coordinate system of the
perspective view v2, then the following relation holds:
4 Solutions of 3D Reconstruction
4.1 3D R e c o n s t r u c t i o n of C o n t o u r P o i n t s
In an edge map, the contour points are the basic primitives. A contour chain that can
not be described analytically could be considered as a set of linked contour points. So,
the 3D reconstruction of non-describable contour chains will be equivalent to that of
contour points. Given a pair of matched contour points: (p~l, P~2), we can first determine
the projecting line which passes through the point Pv2 and the origin of the camera
coordinate system of the perspective view v2. Then, we transform this projecting line
into the camera coordinate system of the perspective view vl. Finally, the coordinates of
the corresponding 3D point can be determined by inversely projecting the contour point
pvx onto the transformed line. Therefore, our solution for recovering 3D contour points
can be formulated by the following theorem:
717
where:
Y~2 t=--t~
4.2 3D R e c o n s t r u c t i o n o f Line s e g m e n t s
The problem of 3D reconstruction of line segments has been addressed by several re-
searchers (see [3] and [4]). In this paper, we shall develop a more simple solution with
respect to the camera-centered coordinate system, knowing two perspective views. The
basic idea is first to determine the projecting plane of a line segment in the second per-
spective view, then to transform this projecting plane into the first perspective view and
finally to determine the 3D endpoints of the line segment by inversely projecting the
corresponding 2D endpoints (in the image plane) to the transformed projecting plane in
the first perspective view. In this way, we can derive a solution for the 3D reconstruction
of line segments. This solution can be stated as follows:
where Lv~ = (av2, bv2,cv2) and (Xvl,Yvl) is the known projection of (Xvl,Yvl,Zvl) in
the image plane of the first perspective view.
4.3 3D R e c o n s t r u c t i o n o f Q u a d r a t i c C u r v e s
In this section, we shall show that an analytic solution exists for the 3D reconstruction
of quadratic curves from two perspective views. By quadratic curve, we mean the curves
whose projection onto an image plane can be described by an equation of quadratic form.
718
To determine the 3D points belonging to a 3D curve, the basic idea is first to determine
t h e projecting surface of a 3D curve observed in the second perspective view, then to
transform this projecting surface to the first perspective view and finally to determine
the 3D points belonging to the 3D curve by inversely projecting the corresponding 2D
points (in the image plane) to the transformed projecting surface in the first perspective
view. If we denote Pv = (Xv, y~, 1) the homogeneous coordinates of an image point in the
perspective view v, we can formulate our solution for the 3D reconstruction of quadratic
curves by the following theorem:
T h e o r e m 3. A 3D curve is observed from two perspective views: the perspective view vx
and the perspective view vs. In these two views, the corresponding projected eD curves
(in image planes) can be described by equations of quadratic form. The description of
the 2D curve in the second perspective view is given by: av2 x2v2+ by2 y22 + %2 Xv2 Yv2 +
ev2 xv2 + fv2 yv2 + gv2 = O. If the relative geometry between the two perspective views is
known and is represented by (3), then given a point P~I = ( z . l , Y . l , 1) on the 2D curve
in the first perspective view, the corresponding 3D point (X.I,Yvl, Z.1) on the 3D curve
is determined by the following equations:
I Xvl -: -B ~rvl.
Yvl = - B 4 - ~ B 4 " ~ Yvl. (6)
Zvl - 2A
where:
!' = P,I 9 R~1~2 9 Q~2 9 R~1~2 9 P~I.
and:
11 )
be more than three points on a closed curve. As for the closed curve C, if we define:
Z~l Yvl I ~ /1/Zll
Anx3 =
2 Yvl
Xvl 2 1 ; Bnxl = 1/Z~l ; wax:= .
*vnl ynI I / ~I 1
then a linear system will be established as follows:
A.W=B. (10)
To estimate the unknown vector W, we use a least-squares technique. So, the solution
for (a, b, c) can be obtained by the following calculation:
W = (A t . A ) -1 9 (A** B). (11)
Knowing the supporting plane determined by (a, b, c), the 3D points of the closed
curve C can be calculated as follows (by combining (1) and (8)):
{ Y/1 = Yh
'
Xv/1 = az~,+bY~l+c"
xvl
i = 1, 2,..., n. (12)
az',+bY~l+C"
zL =
5 Conclusions
References
[I] BLOSTEIN, S. D. and H U A N G , T. S.: Error Analysis in Stereo Determination of 3D
Point Positions.IEEE PAMI, Vol.9,No.6, (1987).
[2] R O D R I G U E Z , J. J. and A G G A R W A L , J. K.: StochasticAnalysis of Stereo Quantization
Error. IEEE PAMI, Vol.12,No.5, (1990).
[3] K R O T K O V , E. HENRIKSEN, K. and KORIES, P.:StereoRanging with Verging Cameras.
IEEE PAMI, Vol.12,No.12, (1990).
[4] AYACHE, N. and L U S T M A N , F.: Trinocnlar Stereo Vision for Robotics. IEEE PAMI,
Vol.13, No.l, (1991).
This article was processed using the lATEX macro package with ECCV92 style
Detecting 3-D Parallel Lines for Perceptual
Organization*
Xavier Lebhgue and J. K, Aggarwal
Computer and Vision Research Center, Dept. of Electrical and Computer Engr., ENS 520,
The University of Texas at Austin, Austin, Texas 78712-1084, U.S.A.
Abstract.
This paper describes a new algorithm to simultaneously detect and clas-
sify straight lines according to their orientation in 3-D. The fundamental
assumption is that the most "interesting" lines in a 3-D scene have orien-
tations which fall into a few precisely defined categories. The algorithm we
propose uses this assumption to extract the projection of straight edges from
the image and to determine the most likely corresponding orientation in the
3-D scene. The extracted 2-D line segments are therefore "perceptually"
grouped according to their orientation in 3-D. Instead of extracting all the
line segments from the image before grouping them by orientation, we use
the orientation data at the lowest image processing level, and detect seg-
ments separately for each predefined 3-D orientation. A strong emphasis is
placed on real-world applications and very fast processing with conventional
hardware.
1 Introduction
This paper presents a new algorithm for the detection and organization of line segments
in images of complex scenes. The algorithm extracts line segments of particular 3-D
orientations from intensity images. The knowledge of the orientation of edges in the
3-D scene allows the detection of important relations between the segments, such as
parallelism or perpendicularity.
The role of perceptual organization [5] is to highlight non-accidental relations between
features. In this paper, we extend the results of perceptual organization for 2-D scenes
to the interpretation of images of 3-D scenes with any perspective distortion. For this,
we assume a priori knowledge of prominent orientations in the 3-D scene. Unlike other
approaches to space inference using vanishing points [1], we use the information about
3-D orientations at the lowest image-processing level for maximum efficiency.
The problem of line detection without first computing a free-form edge map was
addressed by Burns et al. [2]. His algorithm first computes the intensity gradient ori-
entation for all pixels in the image. Next, the neighboring pixels with similar gradient
orientation are grouped into "line-support regions" by a process involving coarse ori-
entation "buckets." Finally, a line segment is fit to the large line-support regions by a
least-squares procedure. An optimized version of this algorithm was presented in [3].
The algorithm described in this paper is designed not only to extract 2-D line segments
from an intensity image, but also to indicate what are the most probable orientations
for the corresponding 3-D segments in the scene. Section 2 explains the geometry of
* This research was supported in part by the DoD Joint Services Electronics Program through
the Air Force Office of Scientific Research (AFSC) Contract F49620-89-C-0044, and in part
by the Army Research Office under contract DAAL03-91-G-0050.
721
projecting segments of known 3-D orientation. Section 3 describes a very fast algorithm
to extract the line segments from a single image and to simultaneously estimate their
3-D orientation. Finally, Sect. 4 provides experimental results obtained with images of
indoor scenes acquired by a mobile robot.
We chose to concentrate on objects which have parallel lines with known 3-D orientations
in a world coordinate system. For example, in indoor scenes, rooms and hallways usually
have a rectangular structure, and there are three prominent orientations for 3-D line
segments: one vertical and two horizontal orientations perpendicular to each other. In
this paper, any 3-D orientation is permitted, as long as it is given to the algorithm.
Therefore, more complex environments, such as polygonal buildings with angles other
than 90 degrees, are handled as well if these angles are known. It is important to note that
human vision also relies on prominent 3-D orientations. Humans feel strongly disoriented
when placed in a tilted environment.
Vertical lines constitute an interesting special case for two reasons: they are especially
common in man-made scenes, and their 3-D orientation can easily be known in the 3-D
camera coordinate system by measuring the direction of gravity. If a 2-axis inclinometer
is mounted on the camera and properly calibrated, a 3-D vertical vector can be expressed
in the 3-D coordinate system aligned with the 2-D image coordinate system. Inexpensive
commercial inclinometers have a precision better than 0.01 degree. Humans also sense
the direction of gravity by organs in their inner ears. In our experiments, we estimate the
third angular degree of freedom of the camera relative to the scene from the odometer
readings of our mobile robot. Provided that the odometer is constantly corrected by
vision [4], the odometer does not drift without bounds.
We can infer the likely 3-D orientation of the line segments from their 2-D projections
in the image plane. With a pinhole perspective projection model, lines parallel to each
other in the 3-D scene will converge to a vanishing point in the 2-D projection. In partic-
ular, if the orientation of the camera relative to the scene is known, a vanishing point can
be computed for each given 3-D orientation before the image is processed. All the lines
that have a given orientation in 3-D must pass through the associated vanishing point
when projected. Conversely, if a line does not pass through a vanishing point, it cannot
have the 3-D orientation associated with that vanishing point. In practice, if a line does
pass through a vanishing point when projected, it is likely to have the associated 3-D
orientation.
To summarize, the line detection algorithm of Sect. 3 knows in each point of the
image plane the orientation that a projected line segment would have if it had one of
the predefined 3-D orientations. Therefore, the basic idea is to detect the 2-D segments
with one of the possible orientations, and mark them with the associated 3-D orientation
hypothesis.
The coordinate systems are W (the World coordinate system, with a vertical z-axis),
R (the Robot coordinate system, in which we obtain the inclinometer and odometer
readings), C (the Camera coordinate system), and P (the coordinate system used for
722
11 -- + 01 )
defines another point of the estimated 2-D line. A 2-D vector d in the image plane pointing
to the vanishing point from the current point is then collinear to [u ~ - u, v ' - v ] w.
Algebraic manipulations lead to [ du, dv IT = [ax -- azu, ay -- azv ]W where
The current pixel is retained for the 3-D direction under consideration if the angle
between d and the local gradient g is 90 degrees plus or minus an angular threshold 7.
This can be expressed by
lid x g[I
Ildll' IIg[~ > cos-/
or equivalently:
gy - dy gx) > + (gx + r
with F = (cosT) 2 computed once for all. Using this formulation, the entire line sup-
port extraction is reduced to 8 additions and 11 multiplications per pixel and per 3-D
orientation. If an even greater speedup is desired, (gx2 + g~) may be computed first and
thresholded. Pixels with a very low gradient magnitude may then be rejected before
having to compute d.
4 Results
The algorithm was implemented in C on an IBM RS 6000 Model 530 workstation, and
tested on hundreds of indoor images obtained by our mobile robot. The predefined 3-D
orientations are the vertical and the two horizontal orientations perpendicular to each
other and aligned with the axes of our building. Figures 1 and 2 show the results of
line extraction for one image in a sequence. The processing time is only 2.2 seconds for
each 512 by 480 image. Preliminary timing results on a HP 730 desktop workstation
approach only a second of processing, from the intensity image to the list of categorized
segments. The fast speed can be explained partly by the absence of multi-cycle floating-
point instructions from the line orientation equations, when properly expressed.
The lines are not broken up easily by a noisy gradient orientation, because the ori-
entation "buckets" are wide and centered on the noiseless gradient orientation for each
3-D orientation category. The output quality does not degrade abruptly with high im-
age noise, provided that the thresholds for local gradient orientations are loosened. The
sensitivity to different thresholds is similar to that of the Burns algorithm: a single set
of parameters can be used for most images. A few misclassifications occur in some parts
of the images, but are marked as ambiguities.
We have compared the real and computed 3-D orientation of 1439 detected segments
from eight images in three different environments. The presence of people in some scenes,
as well as noise in the radio transmission of images, did not seem to generate many mis-
classifications. The most frequent ambiguities occurred with horizontal segments parallel
to the optical axis: 1.1% of them were classified as possibly vertical in 3-D.
5 Conclusion
We have presented a new algorithm for detecting line segments in an image of a 3-D
scene with known prominent orientations. The output of the algorithm is particularly
well suited for further processing using perceptual organization techniques. In partic-
ular, angular relationships between segments in the 3-D scene, such as parallelism or
perpendicularity, are easily verified. Knowledge of the 3-D orientation of segments is a
considerable advantage over the traditional 2-D perceptual organization approach. The
orientation thresholds of the 2-D perceptual organization systems cannot handle a sig-
nificant perspective distortion (such as the third orientation category in Fig. 2). The
independence from the perspective distortion brings more formal angular thresholds to
724
Fig. 1. (a) The input intensity image, and (b) the 2-D segments
II1t
Fig. 2. The line segments associated with each 3-D orientation
the perceptual organization process. By using the 3-D orientation at the lowest image
processing level, both the quality and speed of the algorithm were improved. The ultimate
benefits of this approach were demonstrated on real images in real situations.
References
This article was processed using the IbTEX macro package with ECCV92 style
Integrated Skeleton and Boundary Shape Representation for
Medical Image Interpretation*
1 Introduction
We are currently developing an improved shape representation for use in the Guy's
Computer vision system for medical image interpretation [RC1]. The requirement is for
an efficient method of shape representation which can be used to store information about
the expected anatomical structure in the model, and also represent information about the
shape of features present in the image. In this paper we present an integrated approach
to shape representation which also addresses the problem of grouping dot pattern and
disconnected edge sections to form perceptual objects. The method of shape
representation is based on the dual of the Voronoi diagram and the Delaunay
triangulation of a set of points. For each object, the boundary and the skeleton are
represented hierarchically. This reduces sensitivity to small changes along the object
boundary and also facilitates coarse to fine matching of image features to model entities.
2 Previous work
Many approaches to shape representation have been proposed, and more extensive
reviews can be found in reference [Mal]. Boundary representation of the shape of objects
such as those described in references [Frl] and [Ayl] tend to be sensitive to small changes
along the object boundaries, hierarchical representation is often difficult, as is the sub-
division of objects into their sub-parts. The hierarchical approach of the curvature primal
" The research described in this paper has been supported by the SERC grant ISIS
726
Fairfield [Fal], like ourselves, is concerned with both the detection of the boundary of
objects from dots, and also the segmenting of these objects into their sub-parts. He uses
the Voronoi diagram to detect areas of internal concavity and replaces Voronoi diagram
sides with the corresponding Delaunay triangulation sides to produce both the object
boundary and sub-parts. This work is dependent on a user defined threshold and does not
differentiate between object boundaries and the sub-part boundaries.
Ogniewicz et al [OI1] use the Voronoi diagram of a set of points to produce a medial axis
description of objects. This method requires that the points making up the boundary have
known connectivity, and a threshold is used to prune the skeleton description.
The method we propose does not require connected boundaries as input, merely a set of
points (dots) which are believed to be edge points of objects. Our method produces
distinct objects from these potential edge points, and concurrently generates both a
skeleton and a boundary representation of the shape of these objects.
Objects can now be defined by stretches of unbroken and possibly branching skeletons.
Each branch in a skeleton has associated with it two properties. Firstly, the mean
direction of the skeleton branch, and secondly the area of the object corresponding to
that branch. Fig. la shows an example set of points corresponding to the bodies of the
lateral cerebral ventricles, extracted via a DOG from a transverse MR image. These points
are shown as a series of crosses which are unfortunately drawn so close that they partially
overlap. Fig. lb-c show the Delaunay triangulation and Voronoi diagram of these points
respectively. Fig ld shows the result of the proximity based selection criterion.
The unified nature of the representation that comes from the duality of the Delaunay
triangulation and the Voronoi diagram allows simple changes in the data structure to
change the perceived number of objects. For example, considering the lateral ventricles
in figs 1-2, we may wish to further divide the object into left and right ventricles. This can
be easily achieved by simply forcing the connection between the two bodies. This requires
only a local change in the data structure, but generates two new objects. Fig. 4b shows the
effect of this simple change.
The converse of this can be just as easily achieved (merging two objects into one) by
forcing a boundary section to become a skeleton section.
6 Concluding Remarks
We have defined a hierarchical, object-centred shape description. The algorithm for
computing this description works on both connected and disconnected edge points. The
728
technique is based on a scale invariant proximity measure and so requires no user defined
thresholds.
We are extending our technique to make use of criteria other than proximity, for example
gradient magnitude at edge points, and directional continuity of edge sections.
a b c d
a b
, ,,./) /
I
) c / \
( , '.,
J\ \
i " / "'. "
' /) ~" \
) "% / \D
, M.,
k,
Fig. 2. a) Boundary and sub part of lateral bodies; b) low-level in object hierarchy; c) fine
detail of the object hierarchy. Dotted lines are Delaunays forming the virtual boundaries.
a b c
F
/ " ~ 1 7 6
i \
I
9
\ \
Fig. 3. a)-c) same features as fig. 2 for the small area indicated in fig. 2a.
729
Fig. 4. a) Lateral bodies of figs. 2-3; b) result of splitting the object in two.
References
Thomas Buchanan
Eberstadt, Troyesstr. 64, D-6100 Darmstadt, Germany
1 Introduction
P i j = d e t ( X l x J )Yj -- --
That the Plj do indeed have properties (a), (b) and (c) is shown in [11] for example.
An algebraic set is defined to be a set which is defined by a set of polynomial equa-
tions. In line geometry these equations involve the line coordinates Pij as unknowns. An
algebraic set is called reducible if it can be written as the union of two nonempty proper
algebraic subsets. For example, in the cartesian plane the set of points satisfying xy = 0,
which consists of the coordinate axes, is a reducible algebraic set, because the set is the
union of the y-axis (x = 0) and the x-axis (y = 0). On the other hand, the x-axis de-
scribed by y = 0 is irreducible, because the only proper algebraic subsets of the x-axis
are finite sets of points of the form (x, 0). An irreducible algebraic set is called a variety.
It can be shown that any algebraic set can be described as the finite union of varieties.
A variety V has a well-defined dimension, which is the number of parameters required
to parametrize smooth open subsets of V. We can think of the dimension of V as the
number of degrees of freedom in V. For example, the plane has dimension 2, 3-space has
dimension 3, etc.
Line varieties A are subvarieties of/2. Since dim/2 = 4, there four possibilities for
dim A, when A is not all of/2. If dim A = 0, then A is a single element of/2, i.e., a line.
Line varieties of dimension 1, 2 and 3 are called a ruled surface, a (line) congruence and a
(line) complex respectively. The unfortunate choice of terminology for line varieties goes
back to the 19th century. The terms have so thoroughly established themselves in the
literature, however, that it would be futile to try to introduce new names for the line
varieties.
732
Note that a ruled surface as defined above is a 1-parameter family of lines, not a set
of points. For example, the hyperboloid of one sheet contains a 1-parameter family of
lines--a ruled surface.
A particularly simple ruled surface is a pencil defined to be the set of lines passing
through a given point P and lying in a given plane ~r.
An important descriptor for a line complex F is its order. The orderof F is defined to
be the number of lines P has in common with a general pencil. It is important to count
not only lines real space but also properly count lines in the space of complex numbers.
For any point P in 3-space we may consider all lines of P which pass through P. This
subset of F is called the eomplez cone at P. The order of F could equivalently be defined
as the order of a general complex cone of F. If a general complex cone has as its base a
plane curve of degree d, then d is the order of the cone and the order of F.
A theorem of Felix Klein states that in the space over the complex numbers a line
complex can be described by a single homogeneous polynomial equation
This polymonial has degree 1 so the order of the complex is 1. The polynomial in a
and p = (Pol,Po2,Pos, Pz2,Pla, P2s) is denoted by 12ap. Equation (1), which we are always
tacitly assuming, can be expressed by the equation 12pp = 0.
For a given complex/" it may happen that F contains all lines through some special
point P. In this case/~ is called a total paint of F.
Given a line congruence ~ only a finite number of lines pass through a given point
in general. Again we count not only lines in real space but lines in the space over the
complex numbers. The number of such lines is constant for almost all points of 3-space;
this number is defined to be the order of ~. Analogously, a general plane ~r in 3-space
contains only a finite number of lines of ~. This number is defined to be the class of k~.
Points lying on an infinite number of lines of ~" and planes containing an infinite number
of lines of ~ are called singular.
Given a congruence fir and a line l in 3-space not in ~P, we may consider the subset
of k~ consisting of elements of ~ which meet I. This set can then be described b y the
equations which define ~P together with an additional linear equation of the form of (2).
If this set is irreducible, it is a ruled surface.
In general, there exist a finite number of points P on ! with the property that I
together with two elements of ~ through P lie in a plane. This number is the same for
almost all i and is defined to be the rank of ~P. A congruence of order n, class rn and rank
r is referred to as a (n, m, r)-congruence.
Given a point P all lines through P form a (1, 0, 0)-congruence called the star at P.
A ruled surface p can be considered to be an algebraic space curve in 5 dimensional
projective space, which is the space coordinatized by the six homogeneous line coordinates
Pi~" (0 < i < j < 3). The curve lies on the variety defined by (1). The order of p is
defined to be the number of lines which meet a general given line, where again lines are
counted properly in the space of complex numbers. For example, ruled surfaces lying on
a hyperboloid have order 2.
If a (space) curve in complex projective space is smooth, it is topologically equivalent
(homeomorphic) to a surface (a so-called Riemann surface), which is either a sphere, a
torus or a surface having a finite number of handles.
genus is applicable to ruled surfaces, since these can be regarded as space curves.
Given a congruence ~P, the sectional genus of ~P is defined to be the genus of a general
ruled surface p consisting of the elements of ~ which meet a given line 1 not lying in ~.
In this section we assume three cameras are set up in general position with centers
O1,02, 03. The image planes are denoted by I1,12, 13. The imaging process defines col-
lineations 71 : star(O/) ~ Ii (i -- 1,2,3), which we assume to be entirely general.
To consider the critical set, we consider another three centers 0 1 , 0 2 , 0 3 , which are in
general position with respect to each other and the first set of centers O1,02, 03. The
symbols with bars denote an alternative reconstruction of the scene and the camera po-
sitions. The stars at the O/'s project to the same image planes defining collineations
~i : star(O/) ----* Ii, also of general type. The compositions ai = 7i o ~-1 define collinea-
tions between the lines and the planes through Oi and 0i.
We shall describe what we mean by "general position" after stating our main result.
Theorem 3.1 With respect to images from three cameras the general critical set ~ for
the reconstruction problem using lines is a (3,6,5)-congruence. The sectional genus of ~
is 5. ~ contains 10 singular points, 3 of which are located at the camera centers. The
singular cones have order 3 and genus 1. ~ has no singular planes.
The proof of this theorem is given in [1]. Essentially, the proof determines ~ ' s order
and class and ~ ' s singular points and planes. These invariants suffice to identify ~ in
the classification of congruences of order 3 given in [3]. In this classification the other
properties of ~ can be found.
Just as a ruled surface can be considered to be a curve in 5-space, a congruence can
be considered to be a surface in 5-space.
According to [3, p. 72] !P is a surface of order 9 in 5-space. This surface has a plane
representation: the hyperplane sections of g', i.e., the intersection of ~ with complexes of
order 1, correspond to the system of curves of order 7, which have nodes at 10 given base
points. The plane cubic curves which pass through 9 of the 10 base points correspond to
the singular cones of ~.
Let us now describe what is meant by "general position".
First, we assume the centers of projection O1, 02, O3 and 01, 02, 0a are not coil/near.
Let Ir denote the plane spanned by O1,O2,O3 and ~ denote the plane spanned by
01,02, 03.
Next, we assume that the images of 7r under the various (~i intersect in a single point
15 = 7r~1 f3 lr~2 f3 Ir~3. Analogously, we assume the images of ~ under a i - l , c ~ - l , a ~ 1
intersect in a single point P = ~Y1 f3 ffa~* t3 ~ 7 * .
Each pair of centers Oi, 0 i and collineations cq, a j ( i # j = 1, 2, 3) determines a point
locus Qii, which is critical for 3D reconstruction using points. In the general projective
setting Qij is a quadric surface passing through O/ and 0 i. We assume each Qij is
a proper quadric and each pair of quadrics Qii,Q/k ({i, j, k} = 1, 2, 3) intersect in a
irreducible curve of order 4. Moreover, we assume that all three quadrics intersect in
8 distinct points. The analogous assumptions are assumed to hold for the quadries (~/i
determined by the centers 0i, 0 i.
Finally, we assume that for each fixed i = 1,2, 3 the two lines (OiOi)~J, j = 1, 2, 3, j
i) are skew. Here OiOj denotes the line joining Oi and Oj.
735
The algorithm proposed in [7] sets about to determine the rotational components of the
camera orientations with respect to one another in its first step. We shall only concern
ourselves with this step in what follows.
If three cameras are oriented in a manner that they differ only by a translation, we
can define a collineation between the lines and planes through each center of projection
Oi (i = 1, 2, 3) by simply translating the line or the plane from one center to the other.
This collineation coincides with the collineation at the Oi induced by the images, namely
where the points Pi and Pj in the i-th and j-th image correspond when they have the
same image coordinates (i, j = 1, 2, 3).
Regardless of camera orientation, introducing coordinates in the images preemptively
determines collineations between the images and as a result between the corresponding
lines and planes through the centers of projection. We call such lines and planes homol-
ogous, i.e., the images of homologous elements have the same coordinates in the various
images.
In the case where the cameras are simply translated, homologous elements in the stars
at Oi are parallel. Projectively speaking, this means that homologous rays intersect in
the plane at infinity and homologous planes are coaxial with the plane at infinity.
A generalization of the translational situation arises when the collineations between
the lines and planes through the centers are induced by perspectivities with a common
axial plane 7r, i.e., a ray ri through Oi corresponds to rj through Oj when rj = (rl n
lr)Oj (i, j = 1, 2, 3). Here (rl N ~r)Oj denotes the line joining points ri NIr and Oj.
Note that the projections of points X on ~r give rise to homologous rays OiX, which
per definition have the same coordinates in the images. Let ! be a line in 3-space and
li (i = 1,2, 3) denote the images of i. If ! meets ~r in X, the points Pi corresponding to
the projection of X in the images have the same coordinates. (In the translation case X
corresponds to the vanishing point of l.) Thus if the li are drawn in a single plane using
the common coordinate system of the images, they are concurrent, because the Pi E Ii
all have the same coordinates. In the translational case, this point is the vanishing point
of the parallel class of i.
The idea behind the first step in the algorithm of [7] is to find the rotational com-
ponents of the camera orientation by collinearly rearranging two of the images so that
all corresponding lines in all three images are simultanously concurrent with respect to
a given coordinate systems. If
2 2 2
~ i = 0 uizi = 0, El=0 vlzl = 0, ~ i = 0 wixi = 0
are the equations of the projections of a line l, we look for rotations, i.e., 3 3 orthogonai
matrices, or more generally simply 3 3 invertible matrices M1,Ms such that u =
(uo, ul, u~), May = Mx(vo, vx, vs) and Msw = Ms(wo, wl, wz) are linearly dependent,
the linear dependancy being equivalent to concurrency. This means we look for M1, Ms
such that
det(u, Mlv, M2w) = 0
for all triples of corresponding lines in the images. The algorithm would like to infer that
after applying M1 and Ms, the cameras are now oriented so that they are translates
of each other, or in the projective case that the images are perspectively related by a
common axial plane.
Consider the cameras with general orientations, where again homologous rays through
the centers corespond to points in the image having the same coordinates. If a line i in
736
space meets 3 homologous rays rl, r2, rs, then the projections of 1 are concurrent, the
point of concurrency being the point corresponding to rl, r2 and 7"3.
The set of all lines which meet the rays rl, r2 and r3 when the homologous rays are
skew is a ruled surface of order 2 denoted by [rl, r2, rs]. Let F = ~r~,r2,r3 [rl, r2, rs] be
the set of all lines of 3-space meeting triples of homologous rays. If all the lines in the
scene lie in F, then their projections have the property that they are concurrent. But
since the cameras were in general position, they are not translates of each other. Thus F
defeats the algorithm.
To find the equation for F let ql, q2, q3 denote the line coordinates of 3 rays through
O1, not all in a plane. Then ql, q2, q3 form a frame of reference for rays through O1; the
coordinates of any ray through 01 can be written as a nonzero linear combination
Alql-I'A2q2+A3q3 (3)
Thus a line i with coordinates p intersect this homologous triple if and only if
0 = ap,)~lql+A2qa+Anqa= ~i=1
3 ~it~pq,
0 Op,x,,,+x~,~+x,,~ ~i=1~~iap,,
In general I with line coordinates p lies in F if there exist (A1, As, As) not all zero such
that p satisfies the equations above. This will be the case when
Thus (4) is the equation for F; the left-hand side of (4) is a homogeneous polynomial
in p = (P01, P02, P03, P12, pls,p23) of degree 3. We have the following theorem.
T h e o r e m 4.1 The set F which defeats the Liu-Huang algorithm is in general a line com-
plex of order 3 given by (~), where ql , q2, qs; sl , 82, 83 and t l , t2, t3 denote line coordinates
of rays through 0 1 , 0 2 and 03 respectively. The centers are total points of F.
To prove the assertion about the total points note that if say O1 E ! then ~2pq~ = 0
for i = 1, 2, 3; hence p satisfies (4). n
In the euclidean case (4) takes on the special form in which the triples ql, q2, qs; sl, s2, s3
and tl, t2, t3 are line coordinates for an orthogonal triple of lines through O1,02 and Os
respectively.
Definition: F is called the complex of common transversals of homologous rays. The
essential properties of F were first noted in [10, p. 106] in the context of constructive
geometry. The projective geometry of F has also been studied in [5] and [13, IV,pp. 134
ft.].
737
Before going into the relation between/" and g' we state some properties of still another
congruence.
The Roccella congruence .4 is a (3,3,2)-congruence of sectional genus 2 which consists
of all common transversals of 3 homographically related plane pencils in general position.
If we restrict the collineations of three stars to a plane pencil, we obtain .4 as a subset
of the complex of common transversals determined by collinear stars. (Cf. [9], [2, pp.
152-15T].)
Let us return to the situation used in defining ~. Here O1,O2, Oa and O1,O2,Oa
denote the location of the cameras for two essentially different 3D reconstructions and
ai denote collineations between the stars at Ok and Oi, which are induced by the images.
Any plane # in 3-space not meeting 01,02, Oa determines perspectivities between
stars at Oi via ri ~-} (rl n #)Oj (i, j = 1, 2, 3). Since star(Oi) and star(Oi) are collinear
via al, these perspectivities also induce collineations between the stars at Oi. Hence #
also gives rise to a complex F~ of common transversals of homologous rays, as explained
in the previous section.
P r o p o s i t i o n 5.1 If #1, #2 are two distinct planes in 3-space in general position, then
r,,nr,2=~u.4u U star(Oi)
i----1,2,3
where .4 denotes the Roccella congruence induced by the pencils at Ok in the planes
(Ok(#l N #2))"71 (i = 1,2,3).
References
1. Buchanan, T.: On the critical set for photogrammetric reconstruction using line tokens in
P3(C). To appear.
2. Fano, G.: Studio di alcuni sistemi di rette considerati comme superflcie dello spazio a cinque
dimensioni. Ann. Mat., Ser. 2, 21 141-192 (1893).
3. Fano, G.: Nuove richerche sulle congruenze di rette del 3 ~ ordine prive di linea singolare.
Mere. r. Acad. Sci. Torino, Ser. 2, 51, 1-79 (1902).
4. Hartshorne, R.: Algebraic Geometry. Berlin-Heidelberg-New York: Springer 1977.
5. Kliem, F.: Uber Otter yon Treffgeraden entsprechender Strahlen in eindeutig und linear ver-
wandter Strahlengebilden erster his vierter Stufe. Dissertation. Borna--Leipzig: Bnchdruck-
erei Robert Noske 1909.
6. Krames, J.: Uber die bei der Hauptaufgabe der Luftphotogrammetrie auftretende
,,gef~hrliche" Fl~chen. Bildmessung und Luftbildwesen (Beilage zur Allg. Vermessungs-
Nachr.) 17, Heft 1/2, 1-18 (1942).
7. Liu, Y., Huang, T.S.: Estimation of rigid body motion using straight line correspondences:
further results. In: Proe. 8th ]nternat. Conf. Pattern Recognition (Paris 1986). Vol. I. pp.
306-309. Los Angeles, CA: IEEE Computer Society 1986.
8. Rinner, K., Burkhardt, R.: Photogrammetrie. In: Handbuch der Vermessungskunde. (Hsgb.
Jordan, Eggert, Kneissel) Band III a/3. Stuttgart: J.B. Metzlersche Verlagsbnchhandlung
1972.
9. Roccella, D.: Sugli enti geometrici dello spazio di rette generate dalle intersezioni de' com-
plessi corrispondenti in due o pi~ fasci proiettivi di complessi lineari. Piazza Armerina:
Stabilimento Tipograflco Pansini 1882.
10. Schmid, T.: Uber trilinear verwandte Felder als Raumbilder. Monatsh. Math. Phys. 6, 99-
106 (1895).
11. Semple, J.G., Kneebone, G.T.: Algebraic Projective Geometry. Oxford: Clarendon Press
1952, Reprinted 1979.
12. Severi, F.: Vorlesungen iiber Algebraische Geometrie. Geometrie a~f einer Kurve, Rie-
mannsche Fliichen, A beische lntegrale. Deutsche Ubersetzung yon E. L6ffler. Leipzig--Berlin:
Teubner 1921.
13. Sturm, R.: Die Lehre yon den geometrischen Verwandtschaften. Leipzig - Berlin: B. G.
Teubner 1909.
14. Walker, R.J.: Algebraic Curves. Princeton: University Press 1950. Reprint: New York: Dover
1962.
15. Zindler, K.: Algebraische Liniengeometrie. In: EncyklopSdie der Mathematischen Wis-
senschaften. Leipzig: B.G. Teubner 1928. Band II, Teil 2, 2. H~lfte, Teilband A., pp. 973-1228.
This article was processed using the IbTF~ macro package with ECCV92 style
Intrinsic Surface Properties from Surface
Triangulation
1 Introduction
Intrinsic surface properties are those properties which are not affected by the choice of the
coordinate system, the position of the viewer relative to the surface, and the particular
parameterization of the surface. In [2], Besl and Jain have argued the importance of
the surface curvatures as such intrinsic properties for describing the surface. But such
intrinsic properties may be useful only when they can be stably computed. Most of the
techniques proposed so far for computing surface curvatures can only be applied to range
data represented in image form (see [5] and references therein). But in practice, it is not
always possible to represent the sampled data under this form, as in the case of closed
surfaces. So other representations must be used.
Surface triangulation refers to a computational structure imposed on the set of 3D
points sampled from a surface to make explicit the proximity relationships between these
points [1]. Such structure has been used to solve many problems [1]. One question con-
cerning such structure is what properties of the underlying surface can be computed from
it. It is obvious that some geometric properties, such as area, volume, axes of inertia,
surface normals at the vertices, can be easily estimated [1]. But it is less clear how to
compute some other intrinsic surface properties. In [8], a method for computing the min-
imal (geodesic) distance on a triangulated surface has been proposed. Lin and Perry [6]
have discussed the use of surface triangulation to compute the Gaussian curvature and
the genus of surface. In this paper, we propose a scheme for computing the principal
curvatures at the vertices of a triangulated surface.
The basic recipe of the computation is based on the Meusnier and the Euler theorem.
We will firstly describe how to use them to compute the principal curvatures. Then the
concrete application of the idea to surface triangulation is presented. Throughout we take
II~ as the L 2 norm, and < . , @> and A the inner and cross product, respectively.
2.1 C o m p u t i n g P r i n c i p a l C u r v a t u r e s b y M e u s n i e r a n d E u l e r T h e o r e m
Let N be the unit normal to a surface S at point P. Given a unit vector T in the tangent
plane to S at P, we can pass through P a curve C C S which has T as its tangent vector
at P. Now let s be the curvature of C at P, and cosO = < n, N >, where n is the normal
vector to C at P (see Fig. 1). The number
~ T = ~ ~ cosO (1)
is called the normal curvature of C at P. Note that the sign of the normal curvature of
C changes with the choice of the orientation of surface normal N: The Meusnier theorem
states that all curves lying on S and having the same tangent vector T at P have at
740
this point the same normal curvature [4]. Among all these curves, a particular one is
the normal section of S at P along T, which is obtained by intersecting S with a plane
containing T and N (see Fig. 1). For this curve, its normal n is aligned with N but with
the same or an opposite orientation. Thus from equation (1), its curvature satisfies the
expression ~ = I~TI.
Ct
ng T
9
J
Fig. 1. Local surface geometry around point P. Fig. 2. Choice of vertex triples.
If we let the unit vector T rotate around N, we can define an infinite number of
normM sections, each of which is associated with a normal curvature ~T" Among them,
there are two sections, which occur in orthogonal directions, whose normal curvature
attains maximum and minimum, respectively [4]. These two normal curvatures are the
principal curvatures gl and ~r their associated directions are the principal directions T1
and T~.. The Euler theorem gives the relation between the normal curvature ~T of an
arbitrary normal section T and ~r ~2 as follows [4]:
~T = ~1c~ + Ir (2)
where 9 is the angle between T and T1. Let
cosr sin~
= (l~wl)l/2, r l - (l~Tl)X/2. (3)
Then relation (2) becomes
s1~ 2 % ~r/2 = -I-1, (4)
where the sign of the right hand depends on the choice of orientation of the normal
N at P . Equation (4) defines what is known as the Dupin indieatrix of the surface S
at P . We see that the Dupin indicatrix is a conic defined by ~r and s2 in the tangent
plane to S at P. If P is an elliptic point, the Dupin indicatrix is an ellipse (~I and ~2
have the same sign). If P is a hyperbolic point, ~I and tr have opposite signs, thus the
Dupin indicatrix is made up of two hyperbola. If axes other than those in the directions
of principal curvatures are used, the Dupin indicatrix would take the following general
form:
A~ 2 + 2 B ~ + Cr/2 --- + l , (5)
Given these two theorems, a possible scheme to calculate the principal curvatures is
as follows: Let n plane curves (not necessarily normal sections) passing through the point
P be given. For each of them, we can compute its curvature and tangent vector (and thus
its normal vector) at P. From the n computed tangent vectors, the surface normal at P
741
can be determined by applying a vector product to two of these tangent vectors. Using
equation (1), the normal curvatures along n tangent directions are computed. Having
chosen two orthogonal axes on the tangent plane to S at P, we use equation (3) to
compute a pair of coordinates (~, y) for each direction (note that this time ~ is the angle
between the tangent vector and one of the chosen axes), thus we obtain an equation (5).
With n (n > 3) such equations, the three unknowns A, B, and C can be solved. Finally,
the principal curvatures ~1 and tr are
The principal directions are determined by performing a rotation of the two orthogonal
axes, in the tangent plane, by an angle r where tan2r = 2 B / ( A - C).
2.2 P r i n c i p a l C u r v a t u r e s f r o m S u r f a c e T r i a n g u l a t i o n
Suppose that a surface triangulation has been obtained. This triangulation connects
each vertex P to a set Np of vertices, which are the surface neighbors of P in different
directions. It is to be noted that such a neighborhood relationship is not only a topological
one, but also a geometric one. Our goal is to calculate the principal curvatures of the
underlying surface at P using this neighborhood relationship.
We see from the above section that ifa set of surface curves passing through P can be
defined, the calculation of the principal curvatures and directions will be accomplished
by simply invoking the Meusnier and the Euler theorem. So the problem is reduced to
define such a set of surface curves. As the neighborhood of P defined by the triangulation
reflects the local surface geometry around P, it is natural to define the surface curves
from the vertices in this neighborhood. A simple and direct way is to form n vertex
triples {Ti -- (P, Pi, Pj)IPi, Pj 9 Np, 1 < l < n}, and to consider curves interpolating
each triple of vertices as the surface curves. Two issues arise here: one is how to choose
two neighbor vertices P~ and Pj to form with P a vertex triple; another is which kind of
curve will be used to interpolate each triple of vertices.
To choose the neighbor vertices Pi and Pj, we have to take into account the fact that
the vertices in triangulation are sampling points of the real surface which are corrupted
by noise. Since in equation (1) the function cosine is nonlinear, it is better to use the
surface curves which are as close to the normal section as possible. In this way, the angle 0
between the normal vector n of the surface curve and the surface normal N at P is close to
0 or ~r depending on the orientation of the surface normal, which falls in the low variation
range of the function cosine, thus limiting the effects of the computation error for angle
0. On the other hand, the plane defined by two geometrically opposite vertices (with
respect to P) and P is usually closer to the normal section than that defined by other
combinations of vertices. These considerations lead to a simple strategy: we first define a
quantity M to measure the geometric oppositeness of two neighbor vertices Pi and Pj as
(see Fig. 2): M = < P - Pi, Pj - P >. We then calculate this quantity for all combination
of neighbor vertices, and sort them in a nonincreasing order. The first n combinations
of the vertices are used to form n vertex triples with P. This strategy guarantees that n
can always be greater than or equal to 3 which is the necessary condition for computing
the principal curvatures by the Meusnier and the Euler theorem.
Having chosen the vertex triples, the next question is which kind of curve is to be
used for their local interpolation. In our case, the only available information being P and
its two neighbor vertices, their circumcircle appears to be the most stable curve which
can be computed from these three vertices. Such a computation is also very efficient.
742
So we use the circles passing through each triple of vertices as an approximation of the
surface curves.
Now suppose we have chosen a set of vertex triples {Tzl 1 < l < n}. Each Vertex triple
Tz = (P, Pi, Pj) defines an intersecting plane to the underlying surface passing through
P. The center Cz of the circumcircle of these three vertices can be easily computed [4].
Thus, the curvature and the unit normal vector of this circle at P are ~a = 1/l[Cz - P][
and nz = (Cz - P ) / I I C I - PI[, respectively. The unit tangent vector t~ at P is then
~. ^ (u A v)
tt = lln~A (u A v)ll' (7)
where u = Pi - P and v = Pj - P. Hence for each triple Tz, we obtain for its circumcircle
the curvature value tq, the tangent vector tz, and the normM vector nl at P. We can
then compute the surface normal N at P as
Note that the orientation of each Nmn must be chosen to coincide with the choice of
exterior (or interior) of the object (which can be decided from the surface triangulation).
Now, the normal curvature ~t~ of the surface along the direction tt can be obtained
by equation (1) as:
~t, = ~wosO, (9)
where 0 is the angle between N and n~. After choosing a coordinate system on the tangent
plane passing through P, we can use equation (3) to compute a pair of coordinates (~l, r/z)
for each direction tz and obtain an equation (5). Normally, n is greater than 3, so the three
unknowns A, B, and C are often overdetermined. We can therefore use the least-squares
technique to calculate them by minimizing the function:
tl
3 Experimental Results
In order to characterize the performance of the method proposed above (TRI), we com-
pare it with the method proposed by Besl and Jain (OP) [2] in which the second order
orthogonal polynomials is used to approximate range data. This comparison is realized
on a number of synthetic data sets. It is well known that even only with the trun-
cation error inherent in data quantization, the curvature computation will be gravely
deteriorated [2,5]. A common method to improve this is to first smooth the data by an
appropriate Gaussian filter and then retain the results in floating-point form [5]. But the
problem with such a smoothing is that it is often performed on the image coordinate
system which is not intrinsic to the surface in image. The result is that the surface will
be slightly modified which will lead to an incorrect curvature computation. So we will not
use the smoothed version of the image in our comparison. We generate three synthetic
range images of planar, spherical, and cylindrical surface with the following parameters:
Plane: 0.0125x - 0 . 0 1 2 5 y + 0.1f(x, y) = 1, - 3 0 < z, y < 30.
S p h e r e : z 2 + y2 + f ( z , y ) 2 = 100, - 1 0 _< z , y < 10, f ( x , y ) > O.
743
References
[1] Boissonnat, J.D.: Geometric structures for three-dimensional shape representation. ACM
Trans. on Graphics, vol. 3, no. 4, pp. 266-286, 1984.
[2] Besl, P.J, Jain, R.C.: Invariant surface characteristics for 3D object recognition in range
image. Comput. Vision, Graphics, Image Processing, vol. 33, pp. 33-80, 1986.
[3] Chen, X., Schmitt, F.: Intrinsic surface properties from surface triangulation. Internal
report, T~l~com Paris, Dec. 1991.
[4] Faux, I.D., Pratt, M.J.: Computational geometry for design and manufacture. Ellis Hot-
wood Publishers, 1979.
[5] Flynn, P.J., Jain, A.K.: On reliable curvature estimation, in Proc. Conf. Computer Vision
and Pattern Recognition, June 1989, pp. 110-116.
[6] Lin, C., Perry, M.J.: Shape description using surface triangularization, in Proc. IEEE
Workshop on Computer Vision: Repres. and Control, 1982, pp. 38-43.
[7] Schmitt, F., Chen, X.: Fast segmentation of range images into planar regions, in Proc.
IEEE Conf. on Computer Vision and Pattern Recognition, June 1991, pp.710-711.
Is] Wolfson, E., Schwartz, E.L.: Computing minimal distances on polyhedral surfaces, IEEE
Trans. Pattern Anal. Machine Intell., vol. 11, no. 9, pp. 1001-1005, 1989.
This article was processed using the I~TEX macro package with ECCV92 style
Edge Classification and D e p t h Reconstruction by
Fusion of Range and Intensity Edge Data *
1 Introduction
Fusion of intensity and range data can provide a fuller, more accurate scene description,
improving segmentation of range data and allowing semantic classification of edge labels.
In a previous paper [3], we presented an approach for the classification of edge labels at a
single site by combining the range and intensity edge data with a simplified Lambertian
shading model. This paper extends the approach to improve the results by incorporating
the constraints of edge continuity and surface smoothness in a relaxation process. The
whole fusion process is summarized and shown in Figure l(a).
Representing a pixel image as a set of sites on a square lattice, the edges are located
halfway between vertical and horizontal pixel pairs, as shown in Figure l(b). Fusion
of range and intensity data can be simplified by the assumption of spatial registration
between the two sources, either by acquisition from the same viewpoint, or by geometric
transformation. A study of an edge shading model [3] showed the different appearances in
the two sources of intensity and range data with various types of edge labels. A complete
classification of edge labels is: {blade, eztremal, fold, mark, shadow, specular, no_edge }.
An informal basis for the classification of edge labels is derived in [3]. In order to
obtMn a quantitative estimate of the edge labels, a maximum likelihood estimation is
employed on a reduced set of edge labels, { blade,fold, mark, no_edge }. Eztremal edges
are not distinguished from blade edges, nor are specularand shadow edges distinguished
from mark edges. This may be accomplished by separate analysis of surface curvature
adjacent to the edge and by variation of the lighting parameters.
The initial edge labelling is based solely on the fltered range and intensity data at
single sites without consideration of the neighbourhood context of either adjacent edge
* The work has been supported by a TC scholarship from the Chinese Education Commis-
sion and the British Council, and by the LAIRD project which is led by BAe and funded
by the SERC/IED (GR/F38327:1551). The LAIRD project is a collaboration between BAe,
BAeSema Ltd, NEL, Heriot Watt University and the Universities of Edinburgh and Surrey.
745
[--7 n=g
DDD =O =
D
D~thndghbourhoodEdgencighbourhood
Ill
- NI 1 != I1 |
"0-1=
C22 C12 C14
b~c ~ge configunuions
(a) co)
Fig. 1. (a) The diagram of the fusion process. (b) The dual lattice representation, edge and
depth neighbourhoods, and three basic edge cliques.
or depth sites. In order to improve this initial estimate we apply the well established
process of relaxation, incorporating general constraints based on edge continuity and
surface smoothness.
The conditional probability of obtaining the estimate (R, L) of depth and edge labels
from (r, Ai) is expressed by:
p(r, Ai [ R, L)P(R, L) (1)
P(R, L I r, Ai) = E(R,L) p(r, Ai [ R, L)P(R, L)
where (r, Ai) are the depth observation and the filtered intensity discontinuity respec-
tively. To obtain the estimate (/~, L) which maximizes the a-posteriori probability, we
make three assumptions similar to [1].
(a): the filtered intensity discontinuity Ai and range observation r are conditionally
independent of each other given the object geometry R and the edge label L, i.e. p(r, Ai ]
R, L) = p(Ai I R, L)p(r [ R, L).
(b): given the edge label Lu at edge site u, the filtered intensity discontinuity Aiu, is
independent of range data R, i.e. p(Aiu [ R, Lu) = p(Aiu I Lu).
(c): given the range value Rm at the depth site m, the measurement rrn is independent
of the edge labels L, i.e. p(rrn [ Rm, L) = p(r,n [ Rm), which is the observation model
of the range sensor and assumed to be a Gaussian function.
Simplifying (1) with these assumptions, we have
P(R, L I r, Ai) = tip(r, zSi I L, R)P(R, L) = tlp(Ai I L)p(r I R)P(R, L) (2)
where tl is a constant, and P(R, L) describes the prior knowledge about the interaction
between depth and edge sites. At edge sites, P(R, L) is expressed by P(R, L) = p(R I
746
L)P(L). As edge labels are only associated with the change of range data, p(R I L) is
expressed by p(AR [ L)p(AN I L) ( AR and AN are conditionally independent given the
underlying edge label L). At depth sites P(R, L) = P(L I R)p(R), the first term shows
the depth estimates consistent with the current edge configuration, and the second the
probability density function of range value R, which is assumed to be a uniform function.
Expressing (2) at depth and edge sites separately with the above expressions, we have
P(R, L [ r, Ai) = f p(r I R)P(L [ R)p(R) at depth sites
[p(Ai[L)p(ARIL)p(AN [L)P(L) at edge sites (3)
It is not feasible nor desirable to consider the whole image context with reference to
a single depth or edge site within the dual lattice. Consequently, we reduce the scope
of the analysis from the whole image to a local neighbourhood by application of the
MRF model. Using the Markovian property, the constraints of global edge labelling L
and surface smoothness are expressed locally by
P(L) = H [P(Afe(u) I L~)P(Lu)] and P(R,L) = I I [P(L,Afd(m) I Rm)p(Rm)] (4)
where P(Lu) is the prior probability of an edge label, and AZe(u), A{d(m) are the neigh-
bourhood of edge and depth sites respectively.
The interaction between depth and edge sites under surface smoothness and edge
continuity constraints can also be expressed by Gibbs energy function from the Clifford-
Hammersley theorem with a temperature parameter T [1]. The use of temperature in-
troduces an extra degree of freedom by analogy with temperature control of a material
lattice. In the context of refinement of the edge labels and reconstruction of the depth
data, T intuitively reflects whether the source data or prior knowledge of context is more
important. Putting (4) into (3), taking negative logarithms to convert into an energy
form, we have
E(R, L [ r, Ai) = f ~'~rn [TTm(Rm - rm)2/2ar2 + E(L,A/'a(m) [ Rm) + t2] at depth sites
~. E ~ [Tg(Au I Lu) + C(P(L,,)) + s I L,)] at edge sites
(5)
where ~2 is a constant, err is the standard deviation of Gaussian noise in range data, and
%n is 0 if there is no depth observation at site m and 1 otherwise. Au is a discontinuity
vector at edge site u representing the changes of range, surface orientation and reflectance
[3].
A widely used smoothness model is the thin membrane model, which is a small deflection
approximation of the surface area,
1
t, dy) J xdy (7)
The derivative is represented simply by the difference of range values at adjacent sites.
If the edge label between depth sites m and n is mark or no_edge, the depth sites are on
the same surface and a penalty term is given as fl(Rm - Rn) 2 for any violation of surface
smoothness; otherwise a penalty is given from the edge process for the creation of an edge.
The energy s A/'d(m) ] Rm) is the summation of penalties over the neighbourhood.
Using the thin membrane model, we obtain a quadratic energy function at depth site
m and a analytic solution
.E .., , (s)
where Lmn is 0 if the edge label between depth site m and n is blade or fold, and 1
otherwise. This shows the best depth estimate ~ is a weighted sum of the neighbours
and the raw observation at this site.
Local HCF [2], developed from HCF [1], updates the site within an edge neighbourhood
that has the largest energy decrease if replaced by a new one. The interaction between
edge and depth processes illustrates the importance of the updating order as early up-
dated sites have influence over the other sites. The updating is switched between depth
a n d edge processes depending which one reduces energy the most.
Once a site is updated, the energy decrease is zero, i.e. no better estimate, and there-
fore less unstable sites can be updated. The changes of edge labels and range values are
propagated to other sites.
7 E x p e r i m e n t s and d i s c u s s i o n
So far we have experimented with the fusion algorithm on both synthetic and real data,
but only the results of real data are shown in Figure 2. We interpolate the gap between
stripes by a linear function along each row as the laser stripes are projected on the scene
vertically. The standard deviations of the noise are assumed to be ~r = ~i - 2.0. The
parameters used are T = 1.0, fl - 2.0 and the number of iterations is 10.
748
Dense depth data are obtained from reconstruction from sparse range data. Even
though fold edges are sensitive to noise due to application of a second order derivative
filer, the results are greatly improved by the fusion algorithm. The homogeneity criteria
defined locally have limitations for extracting reliable edge labels. Some edges are near to
each other, but outside the scope of the edge neighbourhood. Although more iterations
m a y improve the results, it m a y be more productive to further process the data using
some global grouping. For example, Hough transformation can be used to extract reliable
space lines and arcs. Surface patches m a y also be extracted from the labeled edges and
reconstructed depth to generate intermediate level object description.
References
1. P. B. Chou and C. M. Brown. The theory and practice of Bayesian image labeling. Int. J.
of Comput. Vision, 4:185-210, 1990.
2. M. J. Swain, L. E. Wixson, and P. B. Chou. Efficient parallel estimation for Markov random
fields. In L. N. Kanal et al ( editors ), Uncertainty in Artificial Intelligence 5, pages 407-419.
1990.
3. G. Zhang and A. M. Wallace. Edge labelling by fusion of intensity and range data. In Proc.
British Machine Vision Conf., pages 412-415, 1991.
4. G. Zhang and A. M. Wallace. Semantic boundary description from range and intensity data.
to appear, IEE Int. Conf. on Image Processing and its Applications, 1992.
Fig. 2. Results of real data widget. From left to right, top row: original intensity, range data
and depth reconstruction; bottom row: classified blade, fold and mark edges.
This article was processed using the IbTFfl macro package with ECCV92 style
Image Compression and Reconstruction Using a 1-D Feature
Catalogue
Brian Y.K. A w 1, Robyn A. Owens 1 and John Ross 2
l Department of Computer Science, University of Western Australia.
2 Department of Psychology, University of Western Australia.
1. Introduction
The image compression ratio achieved by early image coding techniques based on information
theory operating on natural images saturated at a value of 10:1 in the early eighties [KIKI].
Later techniques that code an image in terms of its feature map managed to obtain higher
compression ratios but at a sacrifice of image quality, namely the loss of original local
luminance form (i.e. feature profiles ) at feature points. The technique described in this paper is
able to correct these defects yet maintains compression ratios around 20:1.
An image can he decomposed into two parts: a feature map and a featureless portion.
In other existing techniques (see review article [KIKI]), the feature map is thresholded and
only the location of feature points is coded. Consequently, all information about the original
luminance profiles that give rise to those feature points is lost in the reconstruction phase,
where artificial graded step profiles are used instead. To recover this lost information, our
technique makes use of a common feature catalogue that consists of a number of 1-
dimensional feature templates. Each template describes a feature profile in terms of
normalised mean luminance values and standard deviations at various pixel locations of the
profile. In [AORI], it has been shown that a catalogue whose templates approximate closely
most feature luminance profiles in many natural images can be derived from some appropriate
sample images. In the coding phase of our technique, both the locations of feature points and
pointers indicating feature types are also retained. Each pointer points to the feature templates
in the catalogue that best approximates the original luminance profile in the neighbourhood of
the indexing feature point. Subsequently, the information encoded in the pointers is used to
recover the luminance profile of features at the various locations in the inaage. The same
feature catalogue is used for all images. In our technique, we also encode the featureless
portion of an image in terms of a small number of fourier coefficients which is used in a later
stage to recover the background shading in the reconstructed image.
from 0 (lowest point) to 255 (highest point). The feature is located at pixel 3. The mean
luminance values of each template are marked by horizontal white bars. A ganssian function is
plotted as shades along the vertical axis for each pixel location of a feature profile. The wider
the spread of the shading, the larger the standard deviation value.
1 N (x i . ~i)2
z = x exp [- - - ] (1)
N i=1 Oi2
The template that produces the highest z value is used to represent the feature profile
at the feature point. A pointer from this feature point is then set up and coded with the best-
matched template number and the necessary rescaling parameters (a d.c. shift and a multiplier).
For example, the feature point map of an original image "Baby" (Fig. 2(a)) is shown
in Fig. 2(b). This map combines the feature points found in the horizontal and vertical
directions. Centred at each (black dot) location, the scaled I-D profile of the best matched
feature type from the catalogue is shown in Fig. 2(c). Some features are in the horizontal
direction and some are in the vertical direction. Visually, it is evident that the information
retained in Fig. 2(c), i.e. location plus local form, is richer than just the locationai information
itself represented in Fig. 2(b).
Besides coding the feature portion of an image, we also code a low-pass version of its
featureless portion in terms of the coefficients of its low frequency harmonics. For a 256x256-
pixel image, we retain the lowest 10xl0 2-dimensional complex coefficients of its FFT (Fig.
2(d)).
We can attain a compression ratio of 20:1 if we (a) assume around 2% of the original
image pixels are feature points either in the horizontal and/or vertical direction, and (b) use the
following code words for the various messages of an image:
751
an average of 4.5 bits per feature point to code the positional information in
Huffman code;
a 5-bit code word to code the d.c. parameter.
a 4-bit code word for the multipling (scaling) parameter.
a 3-bit code word to code the feature template number,
a 1-bit code word to indicate the 1-D feature direction (horizontal or
vertical);
a 16-bit code word for the complex FFT coefficients.
Figure 2. (a) Original image "Baby". (b) The locations of feature points (shown as black
dots). (c) The template profile of the best matched feature template from the catalogue for
each feature point in the image is superimposed at the location of that point. (d) The low-
passed version of "Baby". Only the lowest 10xl0 2-dimensional complex coefficients of the
FFT of"Baby" axe retained.
Figure 3. (a) Initial stage of reconstruction of the image "Baby". Feature profiles and low-
pass FFr coefficients are retrieved from the compressed data. (b) Result after 10 iterations.
(c) Result after 50 iterations.
reconstruction process is shown in Fig. 3. The first image (a) is the initial result when the
coded feature local forms and the low-pass data axe retrieved. This is followed by the local
averaging operation interlaced by the reinforcement of the coded low-pass fourier coefficients.
The results after 10 and 50 rounds of iterations are shown in (b) and (c) respectively.
Figure 4. Test images. The original images at different scales are in the upper rows and the
recomtructed ones are in the lower rows. The images are named "Baby" (top-left), "Animal"
(top-right), "Machine" (bottom-left) and "X-Ray" (bottom-right). The three different image
sizes are 256><256, 128x128 and 64x64 pixels.
An error percentage per pixel on the basis of a 256 level grey scale is computed
(Table 1) as an indication of the difference between the original image and its reconstructed
753
version. The error is calculated as the average of the absolute difference in luminance values
between corresponding pixel locations in the two images.
Table I. Average percentage difference per pixel between the original and reconstructed
images.
6. Conclusions
It was noted that recent image compression and reconstruction technique achieved high
compression ratios at the expense of losing information at feature points. This is due to the
fact that a more fundamental question was yet to find its solution, that is: What is a feature7 In
a previous paper [AORI], it is shown that features can assume a wide variety of local
luminance forms. In our analysis, a catalogue of 16 templates was sufficient to accommodate
most feature profiles encountered in a number of natural images. By setting up pointers to this
catalogue, we have shown that the local luminance forms of features can be preserved in a
compressed format. Subsequently, local feature forms are reproduced faithfully in the
reconstruction process.
It is also demonstrated that a 2-dimensional averaging algorithm, bolstered by the
incolporation of the FFT coefficients of lower harmonics of the original image, is able to
reconstruct a good quality image from its compressed feature map.
In terms of efficiency, this scheme achieves a compression ratio of around 20:1 with
little sacrifice in image quality. A higher compression ratio is within reach by incorporating
more efficient techniques in the encoding of feature maps. One possible way is to first obtain a
closed contour by using standard region-growing methods in the local energy maps.
It might also be possible to make further gains by taking advantage of the fact that
similar features recur at different scales (see also [AOR1]). Reconstruction might be possible
in stages, beginning at the coarest scale and then moving up the scale pyramid, introducing at
finer scales only features that first occur at those scales. The scheme of transmitting image
data in stages according to scales of interest makes good sense in terms of the use of
transmitting media.
References
[AOR1] Aw, B.Y.K., Owens, R.A., Ross, J.: A catalogue of l-d features in natural images.
Manuscript submitted to Neural Computation. (1991)
[KIK1] Kunt, M., Ikonemopoulos, A., Kocher, M.: Second-generation image-coding
techniques. Proc. of the IEEE. 73 (1985) 549-574
[MOI] Morrone, M.C., Owens, R.A.: Feature detection from local.energy. Pat. Recog.
Letters. 6 (1987) 303-313.
Canonical Frames for Planar Object Recognition*
1 Introduction
There has been considerable recent success in using projective invariants of plane alge-
braic curves as index functions for recognition in model based vision [10, 15, 23, 28]. Less
attention has been given to invariants for smooth non-algebraic curves. In this paper we
present a novel and simple method of constructing a family of invariants for non-convex
smooth curves.
Lamdan et al. [18] proposed and implemented a canonical frame construction. We
improve on this in two important ways:
1. The transformation here is projective not aj~ine. Central projection between two
planes is a type of projective transformation, not subject to the limitations on viewing
distance required for affine approximation to hold. The affine transformation is only
valid if the object depth variation is small compared to the camera viewing distance.
Of course, projection includes the case that the transformation might actually be
affine, because the affine group is a sub-group of the projective group.
2. Recognition is entirely via index functions based on projective invariants. In [18]
recognition was a mixture of indexing and Hough style voting.
As has been argued elsewhere [10, 23]. there is considerable benefit in using invariants
to imaging transformations as indexing functions for generating recognition hypotheses.
In particular, such functions only involve image measurements and avoid comparison
* CAR acknowledges the support of GE. AZ acknowledges the support of the SERC. DAF
acknowledges the support of Magdalen College, Oxford and of GE. JLM acknowledges the
support of the GE Coolidge Fellowship. The GE CRD laboratory is supported in part by the
following: DARPA contract DACA-76-86-C-007, AFOSR contract F49620-89-C-003.
758
against each object in the model library. Complexity is then O(i k) rather than O(Aikm k)
for the pose based comparison, where i is the number of image features, A the number
of models, m the number of features per model, and k the number of features needed
to compute invariants, or determine transformations where invariants are not used 4.
Recognition hypotheses are verified in both cases by back-projection from models to
images, and determining overlap of projected model with image curves.
This paper largely follows the path suggested by Lamdan et al. [18], where a very
good discussion is given of reasonable requirements for curve representation in order to
facilitate recognition tolerant to occlusion and clutter. Briefly, indices should be local
and have some redundancy (i.e. several per outline), so if one index is occluded there
is a good chance recognition can proceed on other visible parts; they should be stable,
so small perturbations in the curve (due to image noise) do not cause large fluctuations
in index value; and they should have sufficient discriminatory power over models in the
library (so all models do not have similar index values). All of these requirements are
satisfied by the bitangent construction described here. Provided the object outline is
sufficiently rich in structure there will be several such constructions for each object, and
thus redundancy in the representation giving partial immunity to occlusion.
We briefly review previous methods for curve recognition under distorting imaging
transformations in section 2. The canonical frame construction and invariant measures
are described in section 3. We apply these techniques to model based recognition for a
library of planar objects of arbitrary (but non-convex) shape. They are recognised from
single perspective views (no affine approximation is assumed) in scenes in which there
may be partial occlusion by other known objects, or unknown clutter. The process does
not require camera calibration. This is described in section 4. Finally, in section 5 we
show how these measures can be used to reassemble a jigsaw.
2 Background
1. S e m i - d l f f e r e n t i a l i n v a r l a n t s :
An ingenious method, proposed and implemented independently by Van Gool r
4 This complexity analysis is not for the asymptotic case as we assume that the library, imple-
mented as a hash table, is sparse. Should this not be the case we increase the dimension of
the library by using further invariants.
5 Recall that the familiar Euclidean curvature has this invariance: ~ = (~(t)~(~)-
~(t)~(t))/(~(t) 2 + y(t)2) 1/2 irrespective of the parameterisation t. That is, t can be replaced
by f(~) without affecting the value of s.
759
ai. [27] and Barrett et al. [3], is to trade derivatives at a point for more points. They
demonstrate that at a combinatorial cost (some "reference" points must be matched)
projective differential invariants can be derived~requiring only first or second deriva-
tives.
2. Representation by algebraic curves:
Since invariants for algebraic curves are so well established it is natural to try and
exploit them by "attaching" algebraic curves to smooth curves. The algebraic invari-
ants of these attached curves are then used to characterise the non-algebraic curve.
This is the approach taken in [12] for affine invariance and [9, 17] for projective in-
variance. The problem here is that such methods tend to be global. Consequently,
the associated algebraic curves and their invariants change if part of the curve is
occluded.
3. Distinguished points:
A common method is to determine distinguished points on the curve, such as in-
flections and corners, which can be located before and after projection. Such points
then effectively represent the curve - either to determine the transformation (e.g.
alignment[14]) or to form algebraic invariants. The disadvantage is curve information
between these points is effectively wasted.
4. Distinguished frame:
The goal is to get to some distinguished frame from any starting point; usually the
frame corresponding to the plane of the object. A typical method is to maximise
a function over all possible transformations - the transformed frame producing the
function maximum determines the distinguished frame. Brady and Yuille considered a
function measuring compactness over orthography [7, 11]; Witkin and others texture
isotropy (over orthography) [5, 16, 30]; Marinos and Blake texture homogeneity (over
perspectivities) [20]; and more recently Blake and Sinclair [6] with compactness over
projectivities. Once in the distinguished frame any measurements act as invariants
(because the measurements are independent of the original frame and transforma-
tion). Again this is a global approach and degrades with occlusion. There are also
problems of uniqueness if the cost function is not convex, i.e. there are many local
maxima.
5. Canonical frame:
Distinguished points are used to transform a portion of the object curve to a canonical
frame [18]. As for the distinguished frame, any measurement made in this frame is
an invariant. However, the canonical frame does not daffy over the disadvantages:
i) it is semi-local (depends on more than a single point) but is not global; ii) the
transformation to the canonical frame is unique.
3.1 P r o j e c t i v e T r a n s f o r m a t i o n s
transformations cannot be uniquely recovered from the single 3 x 3 matrix, since there
are 6 unknown pose parameters, and 4 unknown camera parameters (camera centre, focal
length and aspect ratio). We therefore have 10 unknowns with 8 constraints.
The mapping of four points between the planes is sufficient to determine the trans-
formation matrix T (each point provides two constraints, therefore 4 independent points
provide 4 x 2 = 8 constraints). Corresponding points (xl, Yl) and (Xi, ~ ) are represented
by homogeneous 3 vectors (zi, Yi, 1) w and (Xi, Yi, 1) T. The projective transformation
x = TX is:
with i E {1, .., 4}. These are straightforward to solve, for example by Gaussian elimina-
tion.
Projectivities form a group, so every action has an inverse, and the composition of two
projectivities is also a projectivity. Consequently two images, from different viewpoints,
of the same object are related by a projectivity. This result is used in the verification
stage of matching.
3.2 O b t a i n i n g F o u r D i s t i n g u i s h e d P o i n t s
The aim here is to exploit a construction that is preserved under projection. Certain
properties, such as tangency and point of tangency are preserved by projection [26]. We
use tangency to select 4 distinguished points on the curve (see figure 1) and then de-
termine the projection that maps these to the corners of a unit square in the canonical
frame. This projectivity is then used to map the curve into this frame. Figure 2 demon-
strates this process for one concavity of a spanner. The object curve, and any projective
view of it, are mapped into the same curve. Consequently, any (metric) measurements
made in this frame are invariant descriptors and hence may be used as index functions to
recognise the object. For example the location of any point in the frame is an invariant;
it is not necessary to use Euclidean such as curvature.
Lamdan et al. [18] used bitangents to obtain two of three points to define a canonical
frame under affine transformations. The third point was obtained by introducing a line
parallel to the bitangent line in contact with the apex of the concavity. Since parallelism is
not preserved under projective transformations, we use tangency conditions to define our
third and fourth points. The selection of the corners of a unit squares as the corresponding
points in the canonical frame is arbitrary - any four points, no three of which are collinear
will do.
Alternative constructions are possible using other projectively preserved properties.
For example, inflections can be used in two ways: i) to define a distinguished point on
the curve; and ii) tO define a line which is tangent at the inflection (3 point contact with
the curve). If a concavity contains an inflection (and therefore it will necessarily have at
least two inflections), then the bitangent contact points and inflections can be used as
the four correspondence points. We believe a construction based on inflections will not be
761
A D
( 1
D,
(a) (b)
Fig. 1, (a) Construction of the four points necessary to define the canonicM frame for a concavity.
The first two points (A D) are points of bitangency that mark the entrance to the concavity.
Two further distinguished points, (B C), are obtained from rays cast from the bitangent contact
points and tangent to the curve segment within the concavity. These four points are used to
map the curve to the canonical frame. (b) Curve in canonical frame. A projection is constructed
that transforms the four points in (a) to the corner of the unit square. The same projection
transforms the curve into this frame.
1. Curves in the canonical frame for differing views of the same object should be very
"similar".
2. Curves in the canonical frame from different objects should "differ" from each other.
3.3 S t a b i l i t y O v e r V i e w s
The stability of the canonical frame representation is illustrated by figures 3a-d, which
show three different views of the same spanner with extracted concavity and four reference
points. The marked edge d a t a is then mapped into the canonical frame. The curves in the
canonical frame are almost identical. Representative images of other objects, a second
e One interesting case of a further application of the bitangent construction is in forming in-
variants for curves with double points. This construction uses the dual space representation of
a curve where the curve tangents (which are lines) are represented as homogeneous points in
the plane. Then, a double point maps to a bitangent in the dual space of the curve (tangent
space), and so invariants can be formed in the dual space.
762
F i g . 2. Canonical frame transformation for a spanner concavity. (a) Original image. (b) Bitan-
gent and tangents (see figure 1). These are determined via a bitangent detectors acting on image
edge data. (c) Four distinguished points and concavity curve. (d) Projected curve in canonical
frame. The curve passes through the corners of the unit square which are the projections of the
four distinguished points. Note, the spanner has four external bitangents, four internal bitan-
gents, and also bitangents which cross the boundary. Each of these can generate a curve in the
canonical frame. Consequently, considerable redundancy is possible in the representation
spanner and a pair of scissors, are shown in figures 4 and 5 with their corresponding
canonical curves (again from three views). Note t h a t the jagged portion (A), of the curve
in figure 5, varies over viewing position, but t h a t the smooth portions (B), are consistent
for all views. This is because (B) is produced by the plastic handles of the scissors, which
are coplanar with the four reference points. The metal hinge is not coplanar with the
reference point and so (A) is not positioned in a projectively invariant manner. This p a r t
of the curve must be excised. The variation emphasises the fact t h a t the canonical frame
construction is defined only on p l a n a r structures.
3.4 I n d e x F u n c t i o n s a n d D i s c r i m i n a t i o n
Since any measurements made in the canonical frame are invariant signatures for the
curve, the question is what is the o p t i m u m set for discrimination over objects in the
library? Clear criteria are t h a t the number of measures should be reasonably small, but
t h a t there should be enough to discriminate objects from clutter, and that each one
should be useful.
It appears t h a t the most naive measurements, area moments, are stable and efficient
discriminators. We use the area bounded by the x-axis and the curve. The moments
763
F i g . 3. (a) - (c) T h r e e views of a spanner with extracted concavity curves and distinguished
points marked. Note the very different appearance due to perspective effects. (d) Canonical
frame curves for the three different views of the spanner. T h e curves are almost identical demon-
s t r a t i n g the stability of the method. Of course the same curve would result from a projective
t r a n s f o r m a t i o n between the object and canonical frame.
F i g . 4. (a) A second spanner with extracted concavity curves and distinguished points marked.
(b) Canonical frame curves for this image and the same spanner from two o t h e r viewpoints.
Again, the curves are very similar.
764
'1
t
9 0:S 1:o
F i g . 5. (a) A pair of scissors with extracted concavity curves and distinguished points marked.
(b) CanonicM frame concavity curves from three views of the pair of scissors. The smooth end
portions of the curve (B) correspond to regions of the concavity coplanar with the four reference
points. These match well between images. The jagged portions (A) do not match as well because
these are formed by edges non-coplanar with the reference points.
c o m p u t e d for the three views of the first spanner are given in table 1, and for all the
objects in table 2. Its clear t h a t in practice this construction gives very good results. For
example, area enclosed by the curve is constant to 5% over views (viewpoint invariance),
whilst differing by more t h a n 30% between the spanner and scissors (discrimination).
However, area alone could not reliably distinguish the two spanners (their area's differ
by only 7%).
T a b l e 1. Moments computed for the spanner concavity when in the canonical frame (figure 3).
The moments are about the x and y axes. Both the first and second moments are computed.
The values are constant over change in the viewing position and so can be used as invariant
measures to index into a library.
1. Given the model base we m a y perform a principal axis analysis to determine the
d o m i n a n t features (ie. the image shapes corresponding to the largest eigenvalues of
765
view Area Mx Mx ~ My My 2
spanner 1 1.35 0.516 0.341 0.720 0.686
scissors 1.99 0.506 0.318 1.107 1.694
spanner 2 1.26 0.507 0.334 0.665 0.584
T a b l e 2. Moments for the three different objects are sufficiently different that they can be
used for model discrimination. Note for example, that the measures for M x 2 appear to be very
similar, but because the values of Mx 2 are very stable the small differences provide sufficient
discrimination.
the library covariant matrix) t h a t provide the best discrimination between different
models. One obvious problem with this approach is t h a t the eigenvectors will be of
high dimension, and so m a y not be realistically computable.
2. We m a y transform the d a t a in the canonical frame to a set of orthogonal functions,
for example using a Walsh or cosine transform, and use the transform coefficients as
indexing values.
We are currently investigating different choices of index function, but in the demon-
stration of object recognition given in the next section we simply use area based mea-
surements.
This closely follows the system described in [23] where more details are given. There are
two stages:
Pre-processlng:
1. M o d e l A c q u i s i t i o n :
Models are extracted directly from a single image of the unoccluded object. T h e
edgel list is stored for later use in the verification process. Segmentation is carried
out as described below to delimit concavities. No measurements are needed on
the actual object, nor are pose or camera intrinsic p a r a m e t e r s required.
2. A d d t o m o d e l l i b r a r y :
Invariant vectors of measures are calculated as described in section 3.4. Each
component of this vector is an invariant measure t h a t m a y be used as an index
to the object. These vectors are entered into a library which will be accessed as
a hash table.
Recognition:
1. E x t r a c t c o n c a v i t i e s :
Feature extraction and segmentation is carried out as below to delimit concavities.
2. C o m p u t e i n d i c e s ,for e a c h c o n c a v i t y :
As described in section 3.4.
3. I n d e x i n t o l i b r a r y :
If the index key corresponds to a table entry this is used to generate a recognition
hypothesis.
4. H y p o t h e s i s V e r i f i c a t i o n :
Verification proceeds in two phases (both based on the verification procedure
of [23]):
766
- Check that the measured and expected model curves in the canonical frame
are similar (that is lie close to each other).
- Project the edgel data from an acquisition image onto the current image.
If sufficient projected edges overlap the target image edgels, the match is
accepted. Note that the projective transformation between acquisition image
and target image is computed directly from the correspondence of the four
points used in the canonical frame construction.
4.1 S e g m e n t a t i o n
A local implementation of Canny's edge detector [8] is used to find edges to sub-pixel
accuracy. These edge chains are linked, extrapolating over any small gaps. Concavities
are detected by finding bitangent lines.
Bitangent lines are found by computing approximations to the tangents of the curve,
and representing these as points in the space of lines on the plane. Pairs of points that
are close together in this space are found by a coarse search. These pairs represent
approximate bitangents, which are refined through a convex hull construction.
Some initial results of the system are demonstrated in figures 6 and 7. At present the
model base consists of 5 objects: 2 spanners, scissors, pliers and a hacksaw. The figures
demonstrate recognition under perspective of two models from this library despite the
presence of partial occlusion and other objects (clutter) not in the library. Note there is
a two fold ambiguity in the matching of curve tangent points to canonical frame. The
matching depends solely on the ordering around the curve. To overcome this problem
indexes and curves for both orderings are stored. Any problems with local symmetry
in the concavity giving rise to ambiguous matches will be detected by back projection
during the verification process.
We have selected the problem of assembling a jigsaw puzzle to illustrate the shape dis-
criminating power of the canonical frame approach. Jigsaw assembly has Mso long been
considered a challenging vision task [13, 31]. The idea is that matching pieces will have
the same invariant signatures for the tab and slot curves. We image a jumble of (unoc-
cluded) puzzle pieces under significant perspective distortion and then "assemble" the
puzzle by matching the pieces using canonical frame matching. The pieces are assumed to
be planar but not necessarily coplanar with each other (in practice the pieces are not even
taken from a single image). The assembly process is carried out by mapping the pieces to
a common canonical frame and then aligning the matching curves. The texture patterns
on the pieces are not used in the matching process but we warp the image of each piece
to portray the assembly in a single plane. This experiment illustrates that the invariants
calculated from the canonical frame can be used to compare unknown objects in a single
image as well as classify objects from a library of model curves. Such comparison tasks
are not readily tackled with conventional model-based recognition systems.
767
Fig. 6. (a) Spanner almost entirely occluded by keys. The keys are not the library, and are clutter
in this scene. (b) Detected concavities, highlighted in white, which are used to compute indexes
(c) The spanner which is the only model in the scene contained in the library, is recognised from
the end slot concavity. The projected outline used for verification is highlighted in white.
5.1 M a t c h i n g d e t a i l s
Edge pieces are extracted from unoccluded views of the pieces using a Canny [8] edge
detector. Each piece is assumed to have four sides and to only connect to at most four
other pieces 7. Each of the four sides of a piece therefore either represent individual shape
descriptors that match other pieces within the jigsaw or are edge pieces. Each of the side
curves is then classified as either a straight side piece, a tab, or a slot, depending on the
general shape of the curve in relation to the rest of the piece of which it is part. Each
curve classed as a tab or a slot contains at least one significant concavity that corresponds
to the tab or slot. We map' each of the concavities into the canonical frame and search
for the unique matching side. Once this is found the pieces can be joined together.
5.2 R e c o n s t r u c t i o n
The first corner piece found is used as the b o t t o m left hand piece of the completed puzzle
(a corner piece has two straight sides). This piece is used as the base unit square in the
canonical frame, on which the puzzle is built. The piece immediately to its right can be
mapped into the corner piece's image frame using the image to canonical frame mappings
of the interlocking tabs and slots. Once this transformation is known the grey level values
7 This restriction is for implementation purposes only, and does not reduce the value of the
demonstration.
768
Fig. 7. (a) Image of various planar objects. (b) Concavities, highlighted in white, which are
used to compute indexes (c) The pliers which are the only model in the scene contained in
the library, is recognised and verified by projecting the edgels from an acquisition image, and
checking overlap with edgels in this image.
of pixels within the bounds of the new piece can be projected into the frame of the corner
piece, and so effect the joining of the two pieces. A similar process can be performed to
map the piece directly above the corner piece into the corner image frame.
The piece diagonally above and to the right of the corner interlocks both the pieces
above and to the right of the corner. We therefore do a least squares fit to determine the
projectivity from the eight correspondences, and again render by projecting image grey
values. A similar process is applied to the rest of the pieces.
Two examples of this are shown in figures 8 and 9. Both show the original pieces, and
the final assembled and rendered puzzle.
This is a O(n ~) algorithm (with n the number of pieces). Extracting indexes in the
canonical frame, building a hash table, and using these indices for matching as in the
recognition system, would reduce the complexity to O(n), but here n is small and the
time taken in computing matches negligible.
Some extensions are obvious:
1. The final reconstruction in the canonical frame should be mapped to a rectangle with
the correct aspect ratio for the assembled jigsaw to remove any projective effects, or
at least to a frame in which corners have right angles.
2. There are gaps in the assembled pattern arising because jigsaw pieces are mapped
by a transformation determined by only a small part of the outline. This can be
improved by determining the projectivity from all distinguished points around each
piece using least squares.
769
Fig. 8. (a) Two pieces of a jigsaw, with (b) the assembled and rendered solution. The puzzle
is solved and rendered using 6nly information from this image. No camera intrinsic parameters
or pose information is needed. Note the large perspective distortion of the pieces in the original
image (a) which are not in the same plane (the right hand piece lies in a plane at about 45 ~ to
the plane of the other piece).
770
Fig. 9, (a) A six piece jigsaw, with (b) the assembled and rendered solution. The puzzle is solved
and rendered using only information from this image. No camera intrinsic parameters or pose
information is needed.
771
1. At present there are five objects in the library. We are currently including more
objects. Efficient development will require attention to the measures used for index
functions.
2. We have assumed that the uncalibrated imaging process may be modeled by a pro-
jectivity. This is exact for a pin-hole camera, but corrections must be applied if
radial-distortion is present. We are currently evaluating this correction [4, 25].
3. We can observe the reliability of an invariant measure by perturbing the distinguished
points and recomputing the invariant values. This will be of benefit both during
library construction and during the recognition process: i) We only use invariant
indexes that are affected little by the bitangent locations, and ii) during verification
confidence in a match is weighted by the stability of the invariant measure.
4. At present we use only concavities (exterior bitangents). This does not exploit the
full structure of the curve. We wish to limit the bitangents used to those that do not
cross the curve, but this does not prevent the use of internal bitangents. Using these
will further improve immunity to occlusion.
5. There are obvious extensions for computing canonical frames for non-smooth curves.
If a tangency discontinuity is observed we can use the two tangents immediately
either side of the discontinuity as reference lines. We then find two more points or
lines and uniquely determine the map to the canonical frame.
Acknowledgements
We are very grateful to both Professor Christopher Longuet-Higgins and Margaret Fleck
for suggesting the jigsaw puzzle problem. We are also grateful to Mike Brady and Andrew
Blake for useful discussions.
References
1. Asada, H. and Brady, M. "The Curvature Primal Sketch," PAMI-8, No. 1, p.2-14, January
1986.
2. Ayache, N. and Faugeras, O.D. "HYPER: A New Approach for the Recognition and Posi-
tioning of Two-Dimensional Objects," PAMI-8, No. 1, p.44-54, January 1986.
3. Barrett, E.B., Payton, P.M. and Brill, M.H. "Contributions to the Theory of Projective
Invariants for Curves in Two and Three Dimensions," Proceedings First DARPA-ESPRIT
Workshop on Invariance, p.387-425, March 1991.
4. Beardsley, P.A. "Correction of radial distortion," OUEL internal report 1896/91, 1991.
5. Blake, A. and Marinos, C. "Shape from Texture: Estimation, Isotropy and Moments,"
Artificial Intelligence, Vol. 45, p.232-380, 1990.
6. Blake, A. and Sinclair,D. On the projective normalisation of planar shape, TR OUEL, in
preparation, 1992.
7. Brady, M. and Yuille, A. "An Extremum Principle for Shape from Contour," PAMI-6, No.
3, p.288-301, May 1984.
772
8. Canny J.F. "A Computational Approach to Edge Detection," PAMI-6, No. 6. p.679-698,
1986.
9. Carlsson, S. "Projectively Invariant Decomposition of Planar Shapes," Geometric Invari-
ance in Computer Vision, Mundy, J.L. and Zisserman, A., editors, MIT Press, 1992.
10. Clemens, D.T. and Jacobs, D.W. "Model Group Indexing for Recognition," Proceedings
CVPR, p.4-9, 1991, and PAMI-13, No. 10, p.1007-1017, October 1991.
11. Duda, R.O. and Hart P.E. Pattern Classification and Scene Analysis, Wiley, 1973.
12. Forsyth, D.A., Mundy, J.L., Zisserman, A.P. and Brown, C.M. "Projectively Invariant
Representations Using Implicit Algebraic Curves," Proceedings ECCV1, Springer-Verlag,
p.427-436, 1990.
13. Freeman, H. an~d Gardner, I. "Apictorial Jigsaw Puzzles: The Computer Solution of a
Problem in Pattern Recognition," IEEE Transaction in Electronic Computers, Vol. 13, No.
2, p.118-127, April 1964.
14. Huttenlocher, D.P. and Ullman, S. "Object Recognition Using Alignment," Proceedings
ICCV1, p.102-111, 1987.
15. Huttenlocher D.P. "Fast Affine Point Matching: An Output-Sensitive Method," Proceedings
CVPR, p.263-268, 1991.
16. Kanatani, K. and Chou, T-C. "Shape From Texture: General Principle," Artificial Intelli-
gence, Vol. 38, No. 1, p.1-48, February 1989.
17. Kapur, D. and Mundy, J.L. "Fitting Affine Invariant Conics to Curves," Geometric lnvari-
ance in Computer Vision, Mundy, J.L. and Zisserman, A., editors, MIT Press, to appear
1992.
18. Lamdan, Y., Schwartz, J.T. and Wolfson, H.J. "Object Recognition by Affine Invariant
Matching," Proceedings CVPR, p.335-344, 1988.
19. Lane, E.P. A Treatise on Projective Differential Geometry, University of Chicago press,
1941.
20. Marinos, C. and Blake, A. "Shape from Texture: the Homogeneity Hypothesis," Proceedings
ICCV3, p.350-354, 1990.
21. Mohr, R. and Morin, L. "Relative Positioning from Geometric Invariants," Proceedings
CVPR, p.139-144, 1991.
22. Mundy, J.L. and Heller, A.J. "The Evolution and Testing of a Model-Based Object Recog-
nition System, ~ Proceedings ICCV3, p.268-282, 1990.
23. RothweU, C.A., Zisserman, A. Forsyth, D.A. and Mundy, J.L. "Fast Recognition using Alge-
braic Invaxiants," Geometric Invariance in Computer Vision, Mundy, J.L. and Zisserman,
A., editors, MIT Press, 1992.
24. Semple, J.G. and Kneebone, G.T. Algebraic Projective Geometry, Oxford University Press,
1952.
25. Slama, C.C., editor, Manual of Photogrammetry, 4tn edition, American Society of Pho-
togrammetry, Falls Church VA, 1980.
26. Springer, C.E. Geometry and Analysis of Projective Spaces, Freeman, 1964.
27. Van Gool, L. Kempenaers, P. and Oosterlinck, A. "Recognition and Semi-Differential In-
variants," Proceedings CVPR, p.454-460, 1991.
28. Wayner, P.C. "Efficiently Using Invariant Theory for Model-based Matching," Proceedings
CVPR, p.473-478, 1991.
29. Weiss, I. "Projective Invariants of Shapes," Proceeding DARPA Image Understanding
Workshop, p.1125-1134, April 1988.
30. Witkin, A.P., "Recovering Surface Shape and Orientation from Texture" Artificial Intelli-
gence, 17, p.17-45, 1981.
31. Wolfson, H., Schonberg, E., Kalvin, A. and Lamdan, Y. "Solving Jigsaw Puzzles by Com-
puter," Annals o] Operational Research, Vol. 12, p.51-64, 1988.
This article was processed using the IgTEXmacro package with ECCV92 style
Measuring the Quality of H y p o t h e s e s in Model-Based
Recognition*
1 Introduction
A number of different model-based techniques have been used to hypothesize instances
of a given model in an image, including searching for possible correspondences of model
and image features (e.g., [4]), the generalized Hough transform (e.g., [1]), alignment of a
model with an (e.g., [5]), and analysis of the space of model transformation parameters
[2]. While these methods differ substantially, they all measure the quality of a given
hypothesis as a function of the number of geometric features of the model that are
consistent with geometric features extracted from the image. The larger the consistent
set of model and image features, the better an hypothesis is judged to be.
In this paper we analyze some of the most common methods for assessing the quality
of an hypothesis in model-based recognition. We focus on the case in which objects
are modeled as a collection of 'atomic' features (such as points). In other words, each
individual model feature is either paired with an image feature or not (there is no partial
matching of individual features). The three quality measures that we investigate are: (i)
the number of pairs of model and image features that are consistent with an hypothesis,
(it) the maximum bipartite matching of such features, and (iii) the number of distinct such
features. We find that the first measure, although widely used, often greatly overestimates
the quality of a match. The second method is more expensive to compute, but is also
much more conservative. In practice we find that the third scoring method is the best.
This final method also turns out to be closely related to some recent theoretical work
* This report describes research done in part at the Artificial Intelligence Laboratory of the
Massachusetts Institute of Technology. Support for the laboratory's research is provided in
part by an ONR URI grant under contract N00014-86-K-0685, and in part by DARPA under
Army contract number DACA76-85-C-0010 and under ONR contract N00014-85-K-0124. DPH
is supported at Cornell University in part by NSF grant IRI-9057928 and matching funds from
General Electric and Kodak, and in part by AFOSR under contract AFOSR-91-0328.
774
on using Hansdorff distances for recognition [6]. Finally, we show that if restrictions are
placed on the spacing of the model features, then the third measure becomes equivalent
to the size of the maximal matching. This provides a more precise meaning to the riotion
that matching is more difficult when the features of a model are too close together.
Our results have two important implications for model-based recognition systems:
1. For atomic features (such as points), the quality of an hypothesis should be measured
based on the number of distinct model and image features that are geometrically
consistent with a given hypothesis, not on the number of consistent model and image
feature pairs.
2. Object models should be constructed such that no two features are close together,
where closeness is a function of the degree of sensory uncertainty.
Existing model-based recognition systems, where appropriate, can be modified with min-
imal effort to take advantage of these results. Using the number of distinct features as a
quality measure simply requires keeping track of which features are accounted for by a
given set of feature pairs. Constructing the object models requires that the sensor error
estimates used by the recognition method be available when the models are made.
A 2
B $
C- ~4
+ ?
Fig. 1. The number of consistent feature pairs overestimates the quality of a match: a) a super-
imposed set of model and image features, b) the corresponding bipartite graph.
Thus whenever a group of model features or a group of image features are close
together, the size of a consistent set will count the same model or image feature multiple
times. For example, Figure la shows an alignment of a five point model with an image
for which the corresponding consistent set contains eight feature pairs. The model points
are shown as crosses and the image points are shown as dots. Any model point mi that
lies within e of an image point sj defines a pair of features (mi, sj) that are consistent
with this position of the model in the image. On the other hand, only three of the five
model features are paired with distinct image features. In this case the difference in the
size of the consistent set (eight) and the number of model features accounted for (three)
occurs because a single image feature is paired with more than one model feature and
vice versa. Such situations are not merely of theoretical interest, as they occur frequently
for reasonable ranges of sensory uncertainty and images containing just tens of features.
776
3 Bipartite Matching
Maximal bipartite matching can be used to rule out the 'multiple counting' that occurred
in the above example, by requiring that each model feature only be paired with a single
image feature and vice versa. A consistent set of feature pairs, C, defines a bipartite graph
G = (U, V, E) where for each feature pair (m~, sj) E C there is a vertex u~ E U corre-
sponding to the model feature m/, a vertex vi E V corresponding to the image feature sj,
and an edge eij E E connecting ui to vj. Each edge is incident on one vertex of U and one
vertex of V, so the graph is bipartite. For example, the set of consistent feature pairs corre-
sponding to Figure la is {(A, 1), (B, 1), (C, 2), (C, 3), (C, 4), (D, 5), (D, 6), (D, 7)} (where
model features are denoted by letters and image features by numbers). This set defines
the bipartite graph shown in Figure lb.
A matching in a bipartite graph, G, is a subset of the edges, F C_ E, such that each
vertex of G has at most one edge incident on it. The size of F is the number of edges
that it contains, IFI. For instance, a trivial matching is a single edge of a bipartite graph.
A maximal matching is one such that there are no larger matchings in the graph. For
our problem, a maximal matching corresponds to the largest set of consistent model
and image feature pairs that can be formed without using any model or image feature
more than once. For the graph in Figure lb the set {(A, 1), (C, 2), (D, 5)} is a maximal
matching. Note that in general there can be more than one maximal matching.
Methods for finding a maximal matching in a bipartite graph require O(IV[ x/2. IEI)
time, or O(n ~~) where n = IVI. Methods that are straightforward to implement require
time O(min([VI, [U[)-IEI) time [7]. In contrast, simply counting the number of pairs in
a consistent set (i.e., the number of edges in E) only requires time O(IED. As model
based recognition methods already require substantial amounts of running time, we are
concerned with how to estimate the size of the maximal matching, IFI, in O(IE D time.
From the bipartite graph representation of a geometrically consistent set, we can
identify several possible measures of the quality of the corresponding hypothesis:
1. The number of feature pairs in the consistent set (i.e., ICI = lED.
2. The number of distinct features accounted for by the consistent set (e.g., [U[, IVI, or
their minimum, maximum, or sum).
3. The size of a maximal matching in the bipartite graph defined by I (i.e., IFI where
F C_ E is a maximal matching).
Given the way we have constructed a bipartite graph from C, the first of these three
quantities is the largest, and the last is the smallest, that is, IEI _ min(lUI, IVI) > IFI-
Clearly the first inequality holds because the number of edges in the graph must be at
least as large as the number of vertices of each type. The second inequality holds because
in a matching each vertex has at most one edge incident on it.
Whereas most recognition systems use the first of these measures, the last is the most
conservative measure. Some recognition systems do use bipartite matchings, but these
are quite expensive to compute compared with counting the number of consistent feature
pairs. Thus we propose using the number of distinct pairs of model and image features,
as measured by the quantity min(JU], IV[). This is cheap to compute, and measures the
minimum number of distinct model or image features accounted for.
The measure min(lUI, IVI) only overestimates the size of a maximal matching, IFI,
when there is branching on both sides of the bipartite graph. This corresponds to a
situation in which there are several neighboring model features that match a single image
feature, and vice versa. If a bipartite graph is guaranteed to only have branching at the
vertices on one side of the graph, then situations such as this cannot occur. In that case,
777
it is trivial to compute the size of a maximal matching - simply count the number of
vertices on the side of the graph where the branching occurs. For example if the branching
occurs only for vertices in U, then each edge e E E is incident on a unique vertex v E V
(otherwise there would be branching for some vertex in V). Thus the number of vertices
in U determines the size of a maximal matching.
While we cannot in general control the spacing of features in an image, we can do so
for the features in a model. More formally, in order for no two model features to match
a given image feature, it must be that for each pair of model features, mi, mk E M,
the volumes produced by intersection with the same sensor feature sj are disjoint, that
is, Vrn,,rnkEMV(mi, Sj)N W(mk, sj) = 0. Another way to view this is that no two model
features can be close enough together that when mapped into the image coordinate
frame they overlap the uncertainty region around the same image feature. For a rigid-
body transformation this can be accomplished by surrounding each model feature with
a 'buffer area' based on the positional uncertainty value e. As long as no two buffer areas
overlap, no pair of model features can match the same image feature.
As an example, consider the case of points in the plane, where the sensory uncertainty
region is a circle of radius e. If each model point is surrounded by a circle of radius e, and
the model points are placed such that no two circle intersect, then no two model points
can match the same image point. Having done this, the number of distinct features is
equal to the size of the maximal bipartite matching, but is much cheaper to compute. In
practice, we have found this method to be much better than simply counting the number
of feature pairs.
In summary, a good estimate of the size of the size of the maximal bipartite matching
is provided by the number of distinct model and image features that are consistent with
a given hypothesis. Moreover, if models are constructed such that no two model features
are close together (as a function of the degree of sensory uncertainty) then the number of
distinct features is the same as the size of the maximal bipartite matching. This provides
a formal meaning to the intuition that matching is harder when the model features are
'too close together to be resolved by the sensor'.
References
1. D. H. Ballard, 1981. Generalizing the Hough transform to detect arbitrary shapes. Pattern
Recognition 13:111.
2. Cass, T.A., 1990, "Feature Matching for Object Localization in the Presence of Uncertainty",
MIT Artificial Intelligence Laboratory Memo no. 1133.
3. Grimson, W.E.L. and D.P. Huttenlocher, 1990, "On the Verification of Hypothesized Matches
in Model-Based Recognition", Proceedings of the First European Con/erence on Computer
Vision, Lecture Notes in Computer Science No. 427, pp. 489-498, Springer-Verlag.
4. Grimson, W.E.L. & T. Lozano-P~rez, 1987, "Localizing overlapping parts by searching the
interpretation tree," 1EEE Trans. PAM1 9(4), pp. 469-482.
5. Huttenlocher, D.P. and S. Ullman, 1990, "Recognizing Solid Objects by Alignment with an
Imagen,lntl. Journal o] Computer Vision, vol. 5, no. 2, pp. 195-212.
6. Huttenlocher, D.P. and Kedem, K., 1990, "Efficiently computing the Hansdorlf distance for
point sets under Translation", proceedings of Sixth ACM Symposium on Computational Ge-
ometry, pp. 340-349.
7. Papadilnitriou, C.H. and K. Steiglitz, 1982, Combinatorial Optimization: Algorithms and
Complezity, Prentice-Hall.
This article was processed using the IbTEX macro package with ECCV92 style
Using Automatically Constructed View-Independent
Relational Model in 3D Object Recognition
S. Zhang, G. D. Sullivan and K. D. Baker
Intelligent Systems Group
University of Reading, RG6 2AY, UK
and L.) e i = X. If ]eil = k, then e i is called an order-k hyperedge of the hypergraph. In particular,
i
e i is called an edge if lei] = 2. In this work we consider only edges and order-3 hyperedges.
in which s ti represents the projection of model feature ok in sample i. Each sti is either a line
segment in the 2D image plane represented by its coordinates [[x1 Y]] Ix2 y211 or an empty set,
if % is occluded from the given viewpoint.
2.2. Building Nodes and Node Attributes of the VIRM
The model primitives are 1-D line segments which provide only very poor constraints for object
recognition. These 1-D line segments are therefore grouped into 2-D feature complexes which
form invariant patterns, for example, a planar quadrilateral in the 3D world reliably projects to a
2-D quadrilateral in the image. The 2-D complexes can therefore be used as "focus features"
(Bolles & Horaud [2]) providing a starting point for searching for consistent cliques.
A subset of model features is grouped into a 2D complex if they satisfy the following:
1. For a given viewpoint, if any one feature is visible, then all are visible,
2. The features form connected sets on the 3D model and hence the image,
3. They conform to a known class of shapes (quadrilateral or U-shape curve).
For example, the windows of the car satisfy the above conditions, but the bonnet of the car does
not because the line at the bottom of the windscreen may be occluded in views from the rear. For
the hatchback car, this process groups all 6 windows and the 4 wheel arches into 2D complexes,
represented as quadrilaterals and U-shape curves respectively. Complexes of model features are
used in the same way as single model features, and in the following discussion, we use O2a to
express the set of 2-D complexes and Old to represent the remaining (l-D) model features.
780
e Lt Ol
0 . 4 ~ 40
lb
) C"
lc
td
(b) c" a (c)
Fig.3. Definition of parallel ratio
is within both segments (Fig.3(b)), the parallel ratio is defined as zero. Otherwise, (e.g. Fig.3(c)),
the parallel ratio is defined as
rain {1 a , I b, I c, ld}
p (ab, cd) = cosO
m a x { l a , l b, ! c, la]
in which la is the distance between a and its orthogonal projection onto cd, etc., and 0 is the angle
between the two line segments. The parallel ratio between two line segments lies in [0,1], with 1
indicates absolutely parallel. If the mean value of p is high (>0.75) and the standard deviation is
small (<0.25) then the parallel relation is accepted.
u Vl~lntl , Vl~Olnt=
191 t~
/ ,, .. , , "
! ,,_.,.
/,,/
/
,?
: /'
: ..,.'"
,.~ p_r.tlo
, ~
.,.,";
~
P.r.t)*
(a) nr_nsdl P_ratlo (b)or_osill P_~'atio ,mill nfw 1 P Ratio (d)ou nsill P ratio
Fig.4. Parallel ratio between feature pairs
If one of the features is two dimensional (e.g. a quadrilateral or U-shape curve) and the other
is a line segment, we record the parallel ratio between the line segment and each of the lines of
the 2-D feature. If both are two dimensional, we record the parallel ratio of all pairs.
The pdj~ of the parallel ratios (5(p)) for 4 pairs of features are shown in Fig. 4: (a) nearside
782
roof and nearside sill, (b) offside roof and offside sill, (c) nearside sill and the bottom line of the
nearside front window, (d) offside upright (windscreen pillar) and nearside sill. The solid curves
show the distribution function 5(p) of the variable. The dotted curves show the cumulative
distributions. The first three show good parallel ratio of the pair of model features, with most
values in the region of [0.8, 1].Therefore the constraints that these features pairs are parallel are
accepted and are coded into the model. In Fig. 4(d), the parallel ratio is evenly distributed, and
no parallel constraint exists between the pair.
2.4.2. Colinear Ratio
Colinearity between object features is preserved by the perspective transformation. Given two
line segments ab and cd (assuming that ab is longer), as shown in Fig.5, we construct a minimal
rectangle whose long axis is parallel to ab and encloses ab and cd. Let w be the length of the side
of the rectangle parallel to ab, h be the length of the perpendicular side of the rectangle, and 0 be
the angle between the two line segments. Coline ratio is defined as in Fig.5.
o h>w
c (ab, cd) =
(I - h ) cos0 h<:w
Q .~.View-direction~ QI:
~t ~,,View-direction
. View-directioha~
I ~ \
Fig.6. Variation of a quadrilateral viewed from different directions
Relative size is only used in feature pairs including at least one 2-D complex. Under
perspective transformations its longest observed edge (the pseudo-height of the feature) in the
image is expected to remain stable to some degree. The relative size of two features (r) is defined
as the ratio of the distance between the features in the image to the pseudo-height of the 2D
feature (if both features are 213, the larger pseudo-height is used). The distance between two
features is defined as the shortest distance from the vertices in one feature to the vertices in the
ether. If both features are 2D, their pseudo-heights are also compared and stored as a separate
constraint. These ratios vary considerably, and therefore only provide weak bounds for the
distributions. The acceptance criterion is similar to that of parallelism.
2.4.5. Procedural Constraints
As a result of the statistical analysis, each edge is associated with a set of constraints
Cij={Pij, Cij, lij, r ij}
representing relations between the corresponding pair of features {oi, oy}. If a relation is not
applicable to this pair, the entry is open. We combine the hypergraph and the constraints to obtain
the VIRM, with the selected constraints being compiled as procedures associated with edges of
the hypergraph. The compiled constraints are Boolean-valued procedures with two inputs (the
polyline representations of the two image features concerned). All of the constraints are defined
in terms of the individual line segments constituting the corresponding features. Vertices in both
2D model features and 2D image features are labelled in an counter-clockwise order beginning
with the bottom left vertex.
3. Application of the VIRM in the Generation of Pose Hypotheses
The VIRM is used to generate hypotheses of the class and pose of the object of interest. Fig.7
illustrates the different stages of the hypothesis generation process. The Canny [5] edge detector
is first applied to the original image to get edgelets (Fig. 7(b)). These edgelets are grouped into
213 features (Fig. 7(c)). Each image feature is associated with all permissible model features, for
example, quadrilateral image features are associated with any of the windows of the car. The
image features are then matched with the VIRM by a depth-first search, restricted to only those
cases with high co-visibility as recorded in the hypergraph. Typically around 10 consistent
hypotheses are generated which reflect the inherent symmetries of the car (examples are shown
in Fig. 7(d), (g) & (j)).
The hypotheses are used to estimate the pose using a quantitative method described by Du
[12] in which two labelled non-parallel, non-colinear, co-planar lines are used to estimate the
position and orientation of the camera, by means of pre-compiled look-up tables. Each of the 10
extended features groups identified by the VIRM gives rise to a number of pose hypotheses,
based on the labelled 2-D features in the extended group. Fig. 7 shows three labelled feature
groups ((d), (g) & (j)), each containing two 2-D features, so that each gives two pose hypotheses.
Where these pose hypotheses are not consistent with each other, the labelling is rejected (Fig. 7
784
-: _ 4
,~ ...inu
osill
(d) Correct labellings (e) Posebased on ofwin (d) (f) Pose based on orw in (d)
nr
nwr
nsill
(g) Incorrect but consistent (h) Pose based on nfw in (g) (i) Pose based on nrw in (g)
labellings
wscu
~,......I nu jrx-,,
J ws
(j) Inconsistent labellings (k) Pose based on ws in (j) (1) Pose based on ofw in (j)
(k), (1)). In the case here only two of the 10 hypotheses are retained by this requirement (Fig. 7
(d) & (g)), giving two pairs of very similar possible poses. These accepted hypotheses must be
subjected to further evaluation using view specific methods which are not discussed here. Details
of the pose verification process can be found in Brisdon [4], and Worrall [10].
Fig.8 shows further examples of hypotheses superimposed on the images, which in theses
examples have been selected manually from the few candidates. It should be noted that, as
expected, the pose recovered is only approximate. The model can be used in the recognition of
occluded objects (Fig.8(f)), as well as scenes containing multiple objects (Fig.8(c)). Table 1
shows the number of hypotheses generated against the size of the combinatorial search space. To
make the comparison realistic, the search space quoted represents the number of possible triples
of feature labellings, containing at least one 2-D feature - these could form a comparable basis
for viewpoint inversion and subsequent view-dependent reasoning. It can be seen that the VIRM
is very effective in identifying the very few labellings in the interpretation tree which are
mutually consistent with a single view of the vehicle model.
785
(a) Selected from 12 poses (b) Selected from 10 poses (c) Selected from 19 poses
(d) Selected from 14 poses (e) Selected from 7 poses (f) Selected from 7 poses
Fig.8. Correct instances superimposed on a representative set of original images
Table 1: Number of hypotheses against the search space
Image Number of Number of Number of U- Number of Line Size of Search
Hypotheses Quadrilaterals shape Curves Segments Space
4. Results and D i s c u s s i o n
We have built VIRMs for the three vehicles shown in Fig. 1.Table 2 summarises the result of the
model building process. The results are obtained from 500 samples of the view with the camera
upright and its position limited to within 0 and 60~ of the model's ground plane. In representing
the fastback car, an extra shape feature, a triangle, was introduced. The data change slightly each
time the model is built because viewpoints are selected randomly, but we have found that this
change is small and has no appreciable influence in the later object recognition process. The
number of constraints generated in the model building process depends on the thresholds selected
for the acceptance criteria. Such thresholds are inherent m any recognition problem, and must be
determined by experience. However, in our experiments the effects of the thresholds on final
recognition performance appear not to be dramatic, mainly affect the time used in recognition.
The time used to construct the relational model is high, since all the relations among the
component parts of the object need to be assessed. At the present state of development the code
786
runs in pop11 and we have made no attempt to make the code efficient. Model generation takes
about one and a half hours on a Sun Sparc 2 with 24 MB memory. However, storage of the
eventual VIRM is very efficient. We need an m by m matrix to represent the pairwise co-visibility
of the object and a similar m2d by m2a by m matrix (m2ais the number of 2-D complexes) to store
co-visibility of feature clusters, and a set of procedures (typically 100 because a procedure may
include more than one geometrical constraint) to represent geometrical relations among the
component parts of the object.
Table 2: VIRMs for different types of vehicles
Numbar of Hatchback Fastback Estate
1-D Features 22 26 28
2-D F~ttmss 10 12 12
Pairwiso Co-visibility 107 134 152
Triple Co-visibility 300 424 468
Parallel Constraints 43 58 67
Colinear Constraints 21 25 31
Side Relation Constraints 72 85 85
Relative Size Constraints 35 42 49
5. S u m m a r y
A method has been described for creating a view-independent relational model of an object used
in object recognition to aggregate features related to a pose hypothesis. A match is accepted as a
hypothesis, and therefore will be Rurther evaluated, only when its relational support passes a
certain threshold. The model is created off-line and its use in object recognition requires no non-
linear calculations.
Reference
1. l~rge, C., Graph and Hypergrapgh, New York: North-Holland, 1973.
2. Belles, R. C., Horaud, P., "3DPO: A Three Dimensional Part Orientation Systern', Tim International Journal of
Robotics Research, Vol.5, No, 3, Fall 1986.
3. Bray, A. J., "Recognising and Tracking Polyhedral Objects", Ph. D Dissertation, University of Sussex, Oct., 1990.
4. Bfisdon, K., Sullivan, G. D., Baker, K. D., "FeaatumAggregation in Iconic Model Evaluation", Prec. of AVC-88,
Manchester, Sept., 1988.
5. Canny, J. F., "A Computation Approach to Edge Detection", IEEE Transactions on Pattern Analysis and Machine
Intelligence, PAMI-8, No. 6, pp. 179-698, Nov., 1986.
6. Gigus, Z., Malik, J., "Computing the Aspect Graph for Line Drawings of Polyhedral Objects', I F . , Transactions
on Patmrn Analysis and Machine Intelligence, PAMI- 12, No. 2, pp. 113-122, Fob., 1990.
7. Goad, C., "Special Purpose Automatic Programming for 3D Model-Based Vision", Proceedings Image
Understanding Workshop, Virginia, USA, pp.94-104,1983.
8. Kcenderink, J., J., Von Door, A., J., "The Internal Representation of Solid Shupe with Respect to Vision", Biological
Cybernetics, Vol. 32, pp. 211-216, 1979.
9. Sullivan, G., D, "Alvey MMI-007 Vehicle Exemplar: Performance & Limitations", Prec. AVC-87, Cambridge,
England, Sep., 1987.
10. Worrall, A. D., Baker, K. D. and Sullivan, G. D., "Model Based Perspective Inversion", Prec. AVC-88, Manchester,
Aug., 1988.
11. Zhang, S., Du, L., Sullivan, G. D. and Baker, K. D., "Model-Based 3D Grouping by Using 21) Cues", Prec.
BMVC90, Oxford, Sept., 1990.
12. Zhang, S., Sullivan, G.D., and Baker, K. D., "Relational Model Construction and 3D Object Recognition", Prec.
BMVC91, Glasgow, Sept., 1991.
Learning to Recognize Faces from Examples
S h i m o n E d e l m a n , 1 D a n i e l Reisfeld, 2 Yechezkel Y e s h u r u n 2
1 Dept. of Applied Mathematics and Computer Science, The Weizmann Institute of Science,
R~hovot 76100, Israel (edelman@wisdom.weizmann.ac.il)
2 Dept. of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel (reisfeld@math.ta u.ac.il)
Classifying the image of a face as a picture of a given individual is probably the most
difficult recognition task that humans carry out on a routine basis with nearly perfect
success rate. It is not too surprising, therefore, that advances in face recognition by
computer fail to match recent progress in the recognition of general 3D objects. The
major problem in face recognition appears to be the design of a representation that,
on one hand, would be sufficiently informative to allow discrimination among inputs
that are all basically similar to each other, and, on the other hand, would be efficiently
computable. One way around this problem is to learn the required representations, e.g.,
by examining and remembering several instances of the input.
How can such a simple scheme generalize recognition to novel instances? In a standard
formulation of pattern recognition, a characteristic function is defined over a multidimen-
sional space, so that its value is close to 1 over the region corresponding to instances of
the pattern to be recognized, and is close to 0 elsewhere [2]. If the characteristic function
is smooth, recognition may be generalized to novel patterns of the same class by interpo-
lating the characteristic function, e.g., using splines. An efficient scheme for interpolating
(or approximating) smooth functions was proposed recently under the name of HyperBF
networks [9,6]. Within the HyperBF scheme, a multivariate function is expanded in terms
of basis functions, with parameter values that are learned from the data. For a scalar-
valued function, the expansion has the form f ( x ) = ~[~:=1 caG(llx - tall2), where the
parameters ta that correspond to the centers of the basis functions and the coefficients
ca are unknown, and are in general much fewer than the data points (n < N). The pa-
rameters e, t are searched for during learning by minimizing the error functional defined
as g [ f ] = He, t = )"]~=I(A,) ~, where A i = Yl -- f ( x ) = Yl -- ~ : = 1 cr - tall2). If
the centers ta are fixed (e.g., are a subset of the training examples), the coefficients ca
can be found by pseudo-inverting a matrix composed of center responses to the training
vectors [9] (other, iterative, methods such as gradient descent or stochastic search can
be used for the minimization of H). HyperBF interpolation has been previously applied
with success to 3D object recognition [7,3,1].
788
2 L e a r n i n g Face R e c o g n i t i o n
2.1 P r e p r o c e s s l n g
Three-dimensional objects change their appearance when viewed from different direc-
tions and when the illumination conditions vary. We used alignment [13] to remove the
variability in the input images due to changing viewpoint. Our program starts with the
identification of anchor points: image features that are both relatively viewpoint-invariant
and well-localized. Good candidates for such features in face images are the eyes and the
mouth. The input image is then subjected to a 2D affme transformation that normalizes
its shape and size, so that the two eyes and the mouth are situated at fixed locations. The
parameters of the transformation are computed from the desired and the actual locations
of the anchor points in the image. We remark that the central assumption behind the
choice of 2D affine transform as the normalizing operation is that faces are, to a first
approximation, two-dimensional.
Our method of detecting the eyes and the mouth in face images is based on the
observation that the prominent facial features are highly symmetrical, compared to the
rest of the face [10]. We proposed in [11] a low-level operator that captures the intuitive
notion of such symmetries and produces a "symmetry map" of the image. This map is
then subjected to clustering. Geometrical relationships among the clusters, together with
the location of the midline (as defined by a cross-correlation between two halves of that
portion of the image that presumably contains a face), allow us to infer the position of
the face, and of the eyes and the mouth in it. These positions are then used as anchor
points for affine normalization.
After normalization, the input is a standard-size array of (8-hit) pixels, in which the
value of each pixel is determined both by the geometry of the face and by the direction
of the illumination. We next reduce the influence of illumination, by the usual method
of taking a directional derivative of the intensity distribution at each pixel. The input
is then subjected to dimensionaiity reduction, to increase both the efficiency and the
effectiveness of the HyperBF classifier.
A well-known statistical method for dimensionality reduction, principal component
analysis, has been applied recently to face recognition with some success [5,12]. In the
present work we chose to explore a considerably simpler method, based on the neuro-
biological notion of receptive field (RF), defined as that portion of the retinal visual
field whose stimulation affects the response of the neuron. Assuming that the neuron
performs spatial integration over its RF, its output is a (possibly nonlinear) function
of ffRF K(z, y)I(z, y)drdy, where I(x, y) is the input, and K(z, y) is a weighting kernel
that we took to be Gaussian (ef. [8]). As noted in [4], pattern classification requires that
dimensionality reduction facilitate discrimination between classes, rather than faithful
representation of the data. Indeed, the vector of RF activities proved to be adequate for
representing face images for recognition, although it would be impossible to recover from
it the original structure of the image.
A different recognizer was created for each person, and was trained to output 1 for the
images in the training set.
The performance of the individual recognizers was assessed by computing a 16 x 16
confusion table, in which the entries along the diagonal signified mean miss rates and
the off-diagonal entries - - mean false alarm rates. The table (see Figure 1, bottom) was
computed row by row, as follows. First, recognizer for the person whose name appears at
the head of the row was trained. Second, the recognition threshold was set to the mean
output of the recognizer over the training set less two standard deviations. Third, the
performance of the recognizer on the test images of the same person was computed and
the miss rate entered on the diagonal of the table. The above choice of threshold resulted
in a mean miss rate of about 10%. Finally, the false alarm rates for the recognizer on
the images of the other 15 persons were computed and entered under the appropriate
columns of the table.
Our second experiment used no thresholds. Instead, recognition was declared for that
person whose recognizer was the most active among the sixteen. The performance of this
winner-take-all scheme is shown in Figure 2 (left).
An examination of the confusion table reveals that some of the individuals tended to
be confused with almost any other person in the database. To take aclvantage of this
"ensemble phenomenon", we trained another HyperBF module to accept vectors of in-
dividual recognizer activities and to produce vectors of the same length in which the
value corresponding to the activity of the correct recognizer was 1, and all other values
were 0 (see Figure 1, right top). The training set for the second-stage HyperBF module
was obtained by pooling the training sets of all 16 first-stage recognizers. The outcome
of the recognition of a test image was determined by finding the coordinate in the output
vector whose value was the closest to 1. The performance of the two-stage scheme was
considerably better than that of the individual recognizer stage alone (9% error rate,
compared to 22%), demonstrating the importance of ensemble knowledge for recognition
(Figure 2, right).
3 Summary
The approach to face recognition described in this paper was made possible by recent
advances in model-based object recognition [13], in automatic detection of spatial features
[10,11], and in applications of learning and of function approximation to recognition and
other visual functions [7,3,8]. The architecture of our system (in particular, its reliance on
receptive fields for dimensionality reduction and for classification) has been inspired by
the realization that receptive fields are the basic computational mechanism in biological
vision. The system's performance, which at present stands at about 5 - 9 % generalization
error rate under changes of orientation, size and lighting, compares favorably with the
state of the art in face recognition [12]. These results have the potential of contributing to
the evaluation of a recently proposed theory of brain function [6], and of making practical
impact in machine vision.
790
input image
RF z /~7,
d. ]/,J%"
/ r i / ,~--~ \\
9 ~,r \\
II % L dim~sio~lity
*: .
sI- ,2".
.~. ~ - . . .
-'-~
o . %, * s
. 9 9 " - . ~ . ~ - ~ ,#
-'.,' . - . -- ~-~. ;.
individual
dassi6cation
inql)ut x z x s x3 x 4
RBFs
~ m b / e -based
ct~ classi~tlos
r.
I~gl'A
Jndez ~r~mpo~lW~':i~ft ~lduec/~,ed I o l I
[ l~..~el
output
9t r i t n ~ t . e ~ bi I brt dau fc~ i rf Joe Mlk ~tn I~S rob st.e true tre uMb
btl II.t t,2 8.t S.! 0.2
bre e,4 0,4
fcm 1.1
lrf 8.3 e,l o.! 8.4 e,l e.5 0.3 J,2 0,5 S,2 e.1
o.t 0,3
Rlk o.l Q.I e,t
0,3 e.5 e.g 8.2 II1.11 S,3 8.g 8.8 e.G
t.8 1.1 8.2 0.1 Io| $J5 8.8 e.4
10.2 IDol e.3 0.6
0.4 8.~ II.L 0.| 1.4 8,|
S.| 0,4
8.1
0,2 u.l e.l g,2 i.1
e.3 8.2 e,l 8.4 LI
0.1
T
Fig. 1. Left top: a fa~e image from the database we used, courtesy of Turk and Pentland [12],
before preprocessing. Left middle: a HyperBF network. Basis function centezs t~ (points in the
multidimensional input spa~e) are prototypes for which the desired response is known. The
output of the network is a linear superposition of the activities of all the basis function units.
In the limit case, when the bases are delta functions, the network becomes equivalent to a
look-up table holding the examples. Right top: The entire two-stage recognition scheme (see
text for explanation). Bottom: A confusion table representation of the performance of the first
stage. Entries along the diagonal correspond to ~miss" error rates; oi~-di~gonal entries signify
ufalse-alaxm" error rates (zeros omitted for clarity).
791
Fig. 2. Left: performance of the one-stage recognition scheme. Right: performance of the two-
stage scheme that uses ensemble knowledge.
References
1. R. Bruneni and T. Poggio. HyperBF networksfor real object recognition. In Proceedings
HCAI, pages 1278-1284, Sydney, Austraiia, 1991.
2. R. O. Duds and P. E. Hart. Pattern classification and scene analysis. Wiley, New York,
1973.
3. S. Edelman and T. Poggio. Bringing the Grandmother back into the picture: a memory-
based view of object recognition. A.I. Memo No. 1181, AI Lab, MIT, 1990. to appear in
Int. J. Pattern Recog. Artif. Intell.
4. N. Intrator, J. I. Gold, H. It. Bfilthoff, and S. Edelman. Three-dimensional object recog-
nition using an unsupervised neural network: understanding the distinguishing features.
In D. Touretzky, editor, Neural Information Processing Systems, volume 4. Morgan Kanf-
mann, San Msteo, CA, 1992. to appear.
5. M. Kirby and L. Sirovich. Application of the Karhunen-Lo~ve procedure for characteriza.
tion of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12(1):103-108, 1990.
6. T. Poggio. A theory of how the brain might work. Cold Spring Harbor Symposia on
Quantitative Biology, LV:899-910, 1990.
7. T. Poggio and S. Edelman. A network that learns to recognize three-dimensional objects.
Nature, 343:263-266, 1990.
8. T. Poggio, M. Fahle, and S. Edelman. Synthesis of visual modules from examples: learning
hyperacuity. A.I. Memo No. 1271, AI Lab, MIT, 1991. to appear in CVGIP B, 1992.
9. T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to
multilayer networks. Science, 247:978-982, 1990.
10. D. Reisfeld, H. Wolfson, and Y. Yeshurun. Detection of interest points using symmetry. In
Proceedings of the 3rd International Conference on Computer Vision, pages 62-65, Tokyo,
1990. IEEE, Washington, DC.
11. D. Reisfeld and Y. Yeshurun. Robust Detection of Facial Features by Generalized Symme-
try 1991. in preparation.
12. M. Turk and A. Pentland. Eigenfaces for recognition. J. of Cognitive Neuroscience, 3:71-
66, 1991.
13. S. Ullman. Aligning pictorial descriptions: an approach to object recognition. Cognition,
32:193-254, 1989.
Face R e c o g n i t i o n through G e o m e t r i c a l Features
1 Introduction
The problem of face recognition, one of the most remarkable abilities of human vision,
was considered in the early stages of computer vision and is now undergoing a revival.
Different specific techniques were proposed or reproposed recently. Among those, one may
cite neural nets [9], elastic template matching [5, 23], Karhunen-Loewe expansion [20], al-
gebraic moments [11] and isodensity lines [16]. Typically, the relation of these techniques
with standard approaches and their relative performance has not been characterized well
or at all. Even absolute performance has been rarely measured with statistical signifi-
cance on meaningful databases. Psycological studies of human face recognition suggest
that virtually every type of available information is used [22]. Broadly speaking we can
distinguish two ways [19] to get a one-to-one correspondence between the stimulus (face
to be recognized) and the stored representation (face in the database):
In order to investigate the first of the above mentioned approaches we have developed a
set of algorithms and tested it on a data base of 47 different people.
2 Experimental setup
The database we used for the comparison of the different strategies is composed of 188
images, four for each of 47 people. Of the four pictures available, the first two were
taken in the same session (a time interval of a few minutes) while the other pictures
were taken at intervals of some weeks (2 to 4). The pictures were acquired with a CCD
camera at a resolution of 512 x 512 pixels as frontal views. The subjects were asked
to look into the camera but no particular efforts were made to ensure perfectly frontal
images. The illumination was partially controlled: the same powerful light was used but
the environment where the pictures were acquired was exposed to sun light through
windows. The pictures were taken randomly during the day time. The distance of the
subject from the camera was fixed only approximately, so that scale variations of as much
as 30 percent were possible.
As we have mentioned already, the very fact that face recognition is possible even at coarse
resolution, when the single facial features are hardly resolved in detail, implies that the
overall geometrical configuration of the face features is sufficient for discrimination. The
overall configuration can be described by a vector of numerical data representing the
position and size of the main facial features: eyes and eyebrows, nose and mouth. This
information can be supplemented by the shape of the face outline. As put forward by
Kaya and Kobayashi [14] the set of features should satisfy the following requisites:
The first three requirements are satisfied by the set of features we have adopted, while
their information contents is characterized by the experiments described later.
The first attempts at automatic recognition of faces by using a vector of geometri-
cal features are probably due to Kanade [13] in 1973. Using a robust feature detector
(built from simple modules used within a backtracking strategy) a set of 16 features was
computed. Analysis of the inter and intra class variances revealed some of the parame-
ters to be ineffective, yielding a vector of reduced dimensionality (13). Kanade's system
achieved a peak performance of 75% correct identification on a database of 20 different
people using two images per person, one for reference and one for testing.
The computer procedures we implemented are loosely based on Kanade's work and
will be detailed in the next sections. The database used is however more meaningful (in
the sense of being greater) both in the number of classes to be recognized, and in the
number of instances of the same person to be recognized.
face is stored then as a set of distinct(ive) smaller templates [1]. A rather different approach
is based on the technique of elastic templates [6, 5, 23]
794
3.1 N o r m a l i z a t i o n
One of the most critical point when using a vector of geometrical features is that of proper
scale normalization. The extracted features must be somehow normalized in order to be
independent of position, scale and rotation of the face in the image plane. Translation
dependency can be eliminated once the origin of coordinates is set to a point which can
be detected with good accuracy in each image. The approach we have followed achieves
scale and rotation invariance by setting the interocular distance and the direction of the
eye-to-eye axis. We will describe the steps of the normalization procedure in some detail
since they are themselves of some interest.
The first step in our technique resembles that of Baron [1] and is based on template
matching by means of a normalized cross-correlation coefficient, defined by :
3.2 Feature E x t r a c t i o n
Face recognition, while difficult, presents interesting constraints which can be exploited
in the recovery of facial features. An important set of constraints derives from the fact
that almost every face has two eyes, one nose, one mouth with a very similar layout.
While this may make the task of face classification more difficult, it can ease the task of
feature extraction: average anthropometric measures can be used to focus the search of a
795
particular facial feature and to validate results obtained through simple image processing
techniques [3, 4].
A very useful technique for the extraction of facial features is that of integral pro-
jections. Let Z(z, y) be our image. The vertical integral projection of Z(x,y) in the
[zl, z2] x [Yz,Y2] domain is defined as:
Y2
= z ( x ,u ) (2)
Y=Yl
This technique was succesfully used by Takeo Kanade in his pioneering work [13] on
recognition of human faces. Projections can be extremely effective in determining the
position of features provided the window on which they act is suitably located to avoid
misleading interferences. In the original work of Kanade the projection analysis was
performed on a binary picture obtained by applying a laplacian operator (a discretization
of cgx~I+ 0yyI) on the grey-level picture and by thresholding the result at a proper level.
The use of a laplacian operator, however, does not provide information on edge (that
is gradient) directions. We have chosen therefore to perform edge projection analysis by
partitioning the edge map in terms of edge directions. There are two main directions in
our constrained face pictures: horizontal and vertical4.
i~
I
~ " q"i -~*
.r" . ~ ,
Horizontal gradients are useful to detect the left and right boundaries of face and nose,
while vertical gradients are useful to detect the head top, eyes, nose base and mouth.
Once eyes have been located using template matching, the search for the other features
can take advantage of the knowledge of their average layout.
Mouth and nose are located using similar strategies. The vertical position is guessed
using anthropometric standards. A first, refined estimate of their real position is obtained
4 A pixel is considered to be in the vertical edge map if the magnitude of the vertical component
of the gradient at that pixel is greater than the horizontal one. The gradient is computed using
a gaussian regulaxization of the image. Only points where the gradient intensity is above an
automatically selected threshold are considered [21, 3].
796
Fig. 2. LEFT: Horizontal and vertical nose restriction. RIGHT: Horizontal mouth restriction
looking for peaks of the horizontal projection of the vertical gradient for the nose, and
for valleys of the horizontal projection of the intensity for the mouth (the line between
the lips is the darkest structure in the area, due to its configuration). The peaks (and
valleys) are then rated using their prominence and distance from the expected location
(height and depth are weighted by a gaussian factor). The ones with the highest rating
are taken to be the vertical position of nose and mouth. Having established the vertical
position, search is limited to smaller windows.
The nose is delimited horizontally searching for peaks (in the vertical projection of
horizontal edge map) whose height is above the average value in the searched window.
The nose boundaries are estimated from the leftmost and rightmost peaks. Mouth height
is computed using the same technique but applied to the vertical gradient component.
The use of directional information is quite effective at this stage, cleaning much of the
noise which would otherwise impair the feature extraction process. Mouth width is finally
computed thresholding the vertical projection of the horizontal edge map at the average
value (see Fig. 2).
Eyebrows position and thickness can be found through a similar analysis. The search
is once again limited to a focussed window, just above the eyes, and the eyebrows are
found using the vertical gradient map. Our eyebrows detector looks for pairs of peaks
of gradient intensity with opposite direction. Pairs from one eye are compared to those
of the other one: the most similar pair (in term of the distance from the eye center and
thickness) is selected as the correct one.
We used a different approach for the detection of the face outline. Again we have
attempted to exploit the natural constraints of faces. As the face outline is essentially el-
liptical, dynamic programming has been used to follow the outline on a gradient intensity
map of an elliptical projection of the face image. The reason for using an elliptical coor-
dinate system is that a typical face outline is approximately represented by a line. The
computation of the cost function to be minimized (deviation from the assumed shape,
an ellipse represented as a line) is simplified, resulting in a serial dynamic problem which
can be efficiently solved [4].
In summary, the resulting set of 22 geometrical features that are extracted automatically
in our system and that are used for recognition (see Fig. 3), is the following:
3.3 R e c o g n i t i o n P e r f o r m a n c e
Detection of the features listed above associates to each face a twentytwo-dimensional
numerical vector. Recognition is then performed with a Nearest Neighbor classifier, with a
suitably defined metric. Our main experiment aim to characterize the performance of the
feature-based technique as a function of the number of classes to be discriminated. Other
experiments try to assess performance when the possibility of rejection is introduced. In
all of the recognition experiments the learning set had an empty intersection with the
testing set.
The first observation is that the vectors of geometrical features extracted by our
system have low stability, i.e. the intra-elass variance of the different features is of the
same order of magnitude of the inter-class variance (from three to two times smaller).
This is reflected by the superior performance we have been able to achieve using the
centroid of the available examples (either 1 or 2 or 3) to model the frontal view of each
individual (see Fig. 4).
An important step in the use of metric classification using a Nearest Neighbor classifier
is the choice of the metric which must take into account both the interclass variance and
the reliability of the extracted data. Knowledge of the feature detectors and of the face
configuration allows us to establish, heuristically, different weights (reliabilities) for the
single features. Let {xi} be the feature vector, {al} be the inter class dispersion vector
and {wi} the weight (reliability) vector. The distance of two feature vectors {xi} {x~) is
then expressed as:
A'~(x,x ') = Z w, (4)
i=1 O*i
798
A useful data on the robustness of the classification is given by an estimate of the class
separation. This can be done using the so called MIN/MAX ratio [17, 18], hereafter RrnM,
which is defined as the minimum distance on a wrong correspondence over the distance
from the correct correspondence. The performance of the classifier at different values of
r has also been investigated. The value of r giving the best performance is a = 1.2, while
the robustness of the classification decreases with increasing a. This result, if generally
true, may be extremely interesting for hardware implementations, since absolute values
are much easier to compute in silicon than squares. The underlying reason for the good
performance of r values close to 1 is probably related to properties of robust statistics
[12] 5. Once the metric has been set, the dependency of the performance on the number
of classes can be investigated. To Obtain these data, a number of recognition experiments
have been conducted on randomly chosen subsets of classes at the different required
cardinalities, The average values on round robin rotation experiments on the available
sets are reported. The plots in Fig. 4 report both recognition performance and the RmM
ratio. As expected, both data exhibits a monotonically decreasing trend for increasing
cardinality.
A possible way to enhance the robustness of classification is the introduction of a
rejection threshold. The classifier can then suspend classification if the input is not suf-
ficiently similar to any of the available models. Rejection could trigger the action of
a different classifier or the use of a different recognition strategy (such as voice iden-
tification). Rejection can be introduced, in a metric classifier, by means of a rejection
threshold: if the distance of a given input vector from all of the stored models exceeds
the rejection threshold the vector is rejected. A possible figure of merit of a classifier
with rejection is given by the recognition performance with no errors (vectors are either
correctly recognized or rejected). The average performance of our classifier as a function
of the rejection threshold is given in Fig. 5. 6
4 Conclusion
A set of algorithms has been developed to assess the feasibility of recognition using a
vector of geometrical features, such as nose width and length, mouth position and chin
shape. The advantages of this strategy over techniques based on template matching are
essentially:
The dependency of recognition performance using a Nearest Neighbor classifier has been
reported for several parameters such as:
5 We could have chosen other classifiers i n s t e a d o f Nearest Neighbor. The HyperBF classifier,
u s e d i n p r e v i o u s e x p e r i m e n t s o f 3D object recognition, allows the automatic choice of the
appropriate metric, which is still, however, a weighted euclidean metric.
6 Experiments by Lee on a OCR problem [15] suggest that a HyperBF classifier would be
significantly better than a NN classifier in the presence of rejection thresholds.
799
1.00- 1.60-
050-
1.40- ~'~-"
0.80- 1.30-
I
I 1.20-
0.70- J
1.10-
1.O0-
0.60-
0.gO- ~
0.30- I- 0.S0-
I 0.70-
0,40-
0.60-
0.30- 0.$0-
0.40-
0 .2 0 - 0.30-
0.20-
0.10 -
0.10 -
0.00- 0.00-,
~W
1.00 1.50 ZOO ZSO 3.00 10.00 20.00 30.00 40.00
0.80 -
m,oj. , e ' ~ ' ~ ' ~
0.70 - fo"
0.60- /
0.50~ /
0.40-
0.30-
0.20-
0.10 -
0.00-
Threshold z 10-3
0.00 100.00 200.00 300.00
Acknowledgements
The authors thanks Dr. L. Stringa for helpful suggestions and stimulating discussions.
One of the authors (R.B) thanks Dr. M. Dallaserra for providing the image data base.
Thanks are also due to Dr. C. Furlanello for comments on an earlier draft of this paper.
800
References
This article was processed using the IrEX macro package with ECCV92 style
Fusion through Interpretation
1 Introduction
surface with information from a new description of the same patch. If the existing de-
scription is incomplete because of occlusion, the new description may supply information
about the 'missing' parts. There is thus a requirement to be able to combine the infor-
mation from two incomplete descriptions.
Underlying everything is the problem of uncertainty: how to make estimates and take
decisions in the presence of sensor noise. Much attention has been given to this subject
in recent years and stochastic methods have become the most popular way of handling
uncertainty. With these methods, when large numbers of estimates and/or decisions are
required, the computational burden can be quite substantial and there may be a need to
find ways of improving efficiency. Section 2 below touches on this issue.
A more detailed version of this paper can be found in [8].
2 Uncertainty
The type of uncertainty we are talking about is primarily due to noise in the numeri-
cM data delivered by sensors. Recently, it has become standard practice in robotics and
computer vision [1, 2, 9, 7] to represent uncertainty explicitly by treating parameters
as random variables and specifying the first two moments (mean and variance) of their
probability distributions (generally assumed to be Ganssian). This permits the use of
techniques such as the Extended Kalman Filter for estimation problems, and the Maha-
lanobis Distance test for making decisions.
The Mahalanobis Test is used to decide whether two estimates are likely to refer to
the same underlying quantity. For example, suppose two surface descriptions give area
estimates of (a, A) and (/L B) (the first member of each pair is the mean, the second is
the variance). These estimates can be compared by computing the quantity
D~ = (a - ~,)2
A+B '
which has a X2 distribution. Thus one can choose an appropriate threshold on Da to test
the hypothesis that the surface being described is the same in each case.
The same test is applicable in more complicated situations involving binary relations
(between pairs of surfaces) and vector valued parameters. In general, some relation like
g(xl,yl,x2,y=) = 0 (1)
will hold between the true values, xl and x2, of parameters describing some aspect of
two image surfaces and the true values, Yl and y=, of parameters describing the same
aspect of two model surfaces - though only if the two pairs correspond. If the parameter
estimates are (~i,Xi) and (~,i,Yi), i = 1,2, then, to first order, the mean and variance
o f g are
= g ( ~ l , ~1, ~2, ~2) ,
If such measures have to be computed frequently but are usually expected to result in
hypothesis rejections (as in interpretation trees - see Sect. 3), there is an efficient method
for their calculation. We illustrate this for the case of binary relations for the relative
distance of two points (Pi and qi) and the relative orientations of two vectors (ui and
vl). The appropriate functions are, respectively,
go = - vTv
Additive terms of the form xTAx, where A is a variance matrix and x is a vector,
occur in the expressions for the scalar variances Gd and Go. We can use the Rayleigh-
Ritz Theorem [6] and the fact that variance matrices are positive definite to bound such
expressions from above by
x T A x _~ )lmax(A)xTx < t r a c e ( A ) x T x .
This leads to cheaply calculated upper bounds on Gd and Go and corresponding lower
bounds on Dd and Do. Since these will usually exceed the thresholds, only in a minority
of cases will it be necessary to resort to the full, and more expensive, calculations of the
variances.
When the relation holding between the parameters (the function in (1)) is vector
valued (as for direct comparisons of infinite plane parameters - Sect. 3) a similar proce-
dure can be used. This avoids the necessity of performing a matrix inverse for every test
through the inequality
O ----- ~ T G - I g ~__
trace(G) "
Popular methods for solving correspondence problems include constrained search [5] and
generate-and-test [4]. We have adopted a hybrid approach similar to [3] where an inter-
pretation tree searches for consistent correspondences between groups of three surfaces,
the correspondences are used to estimate the image-to-model transform, the transform
is used to predict the location of all image surfaces in the model, and the prediction is
used to test the plausibility of the original three correspondences.
To constrain the search we use a unary relation on surface area, a binary relation
on relative orientation of surface normals and a binary relation on relative distance of
mid-points (see Sect. 2). For surface patches with occluded boundaries it is necessary to
make the variances on area and mid-point position appropriately large. In the case of
mid-point position, efficiency is maximised by increasing the uncertainty only in the plane
of the surface (so that one of the eigenvalues of the variance matrix, with an eigenvector
parallel to the surface normal, is small compared to the other two).
Transforms are estimated from the infinite plane parameters n and d (for any point
x in the plane n T x = d where n is the surface normal). The measurement equation used
in the Extended Kalman Filter is
0]
804
where [n T dm]T and [ItT di]T are the parameter vectors for corresponding model and
image planes, t is the translation and P~ is the rotation matrix, parameterised by a three
component vector equal to the product of the rotation angle and the rotation axis [10].
The transform estimated for each group of three correspondences is used to transform
all the image surfaces into model coordinates allowing a direct comparison of positions
and orientations. Assuming there is at least one group which leads to a sufficiently large
number of further correspondences (hits) to add to the original three, the group with
the most is chosen as the correct interpretation. If there is no overlap between image
and model, none of the groups will develop more hits than the number expected on the
basis of random coincidence (which can be calculated and depends on the noise levels).
Moving objects in the scene result in multiple consistent groups with distinct transform
estimates.
The time required to find all the consistent triples and calculate the number of hits
for each is proportional to a fourth order polynomial in M and N - the number of, re-
spectively, model and image surfaces [8]. The number of consistent triples is proportional
to a third order polynomial in M and N, all but one of them (in a static scene) coming
about by random coincidence. Both also depend on noise levels: as uncertainty increases
the search constraints become less efficient, more time is spent searching the interpreta-
tion tree and more consistent groups are generated by coincidence. In practice, for noise
levels of a few percent and sizes of M > 103 and N > 10, the process is intractable.
A dramatic change can be made by partitioning the environment model into parts and
searching for correspondences between the image and each part separately (instead of
between the image and the whole model). If there are P parts, the search time is reduced
by a factor of p3 and the number of spurious solutions by p2. Such a partition is sensible
because it is unlikely that the robot will be able to simultaneously view surfaces of two
different rooms in the same building. The perceptual organization can be carried out as
a background recognition process with a generic model of what constitutes a part (e.g. a
room model).
Updating the infinite plane parameters of a model surface after a correspondence has
been found between it and an image surface is relatively straight forward using an Ex-
tended Kalman Filter. However, updating the boundary or shape information cannot be
achieved in the same manner because it is impossible to describe the boundary with a
single random variable. Moreover, because of the possibility of occlusion, the shapes of
corresponding surfaces may not be similar at all. The problem has some similarity with
the problem of matching strings which have common substrings.
The method we have adopted is again based on finding correspondences but only
between small data sets with efficient search constraints, so there is not a combinatorial
explosion problem. The features to be matched are the vertices and edges making up
the poly-line boundaries of the two surface patches, there being typically about 10 edge
features in each. If the two boundary descriptions relate to the same coordinate frame
the matching criteria may include position and orientation information as well as vertex
angles, edge lengths and edge labels (occluded or unoccluded). In practice, because of
the possibility of residual errors in the model-image transform estimate (see Sect. 3),
805
we exclude position and orientation information from the matching, calculate a new
transform estimate from the matched boundary features and use the old estimate to
check for consistency.
The search procedure is seeded by choosing a pair of compatible vertices (similar ver-
tex angles) with unoccluded joining edges, so it relies on there being at least one common
visible vertex. A new boundary is then traced out by following both boundaries around.
When neither edge is occluded both edges are followed; when one edge is occluded the
other is followed; when both edges are occluded the outermost is followed. Uncertainty
and over- or under-segmentation of the boundaries may give rise to different possible
feature matches (handled by an interpretation tree) but the ordering of features around
each boundary greatly constraints the combinatorics. If two unoccluded edges don't over-
lap, if an occluded edge lies outside an unoccluded one or if the transform estimate is
incompatible with the previously estimated image-model transform then the seed vertex
match is abandoned and a new one tried. If a vertex match is found which allows the
boundary to be followed round right back to the initial vertices, the followed boundary
becomes the new boundary of the updated surface. Otherwise, the two boundaries must
represent disjoint parts of the same surface and the updated surface acquires both.
5 Conclusions
Constrained search (interpretation trees) with stochastic techniques for handling un-
certainty can be used to solve both the image-model correspondence problem and the
boundary-boundary correspondence problem in order to fuse together multiple range
images into a surface-based environment model. The eombinatorics of the image-model
problem are such that environment models must be divided into small parts if the solution
method is to be tractable while the combinatorics of the boundary-boundary problem
are inherently well behaved.
References
1. N. Ayache and O.D. Faugeras. Maintaining representations of the environment of a mobile
robot. In Robotics Research 4, pages 337-350. MIT Press, USA, 1988.
2. Y. Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press, UK,
1988.
3. T.J. Fan, G. Medioni, and R. Nevatia. Recognzing 3-d object using surface descriptions.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(11):1140-1157, 1989.
4. O.D. Faugeras and M. Hebert. The representation, recognition, and locating of 3d shapes
from range data. International Journal o/Robotics Research, 5(3):27-52, 1986.
5. W.E.L. Grimson. Object Recognition by Computer: the Role of Geometric Constraints.
MIT Press, USA, 1990.
6. R.A. Horn and C.R. Johnson. Matri~ Analysis. Cambridge University Press, USA, 1985.
7. M.J.L. Orr, R.B. Fisher, and J. Hallam. Uncertain reasoning: Intervals versus probabilities.
In British Machine Vision Conference, pages 351-354. Springer-Verlag, 1991.
8. M.J.L. Orr, J. Hallam, and R.B. Fisher. Fusion through interpretation. Research Paper
572, Dept. of Artificial Intelligence, Edinburgh University, 1992.
9. J. Porril. Fitting ellipses and predicting confidence using a bias corrected Kalman Filter.
Image and Vision Computing, 8(1):37-41, 1990.
10. Z. Zhang and O.D. Faugeras. Determining motion from 3d line segment matches: a com-
parative study. Image and Vision Computing, 9(1):10-19, 1991.
3-D Object Recognition using Passively Sensed
Range Data. *
The extraction of a 3-D model can be performed in a series of steps: (1) computing
depth using camera motion, for the significant intensity discontinuities, (2) interpolating
range data between the significant intensity discontinuities, (3) smoothing of the resultant
depth map and (4) deriving a 3-D model from the depth map.
One solution to this dilemma is provided by the computation of depth using camera
motion. The technique employed in this paper (see [2,3,4] for details) uses nine images
taken from different positions on an arc around a fixation point. The instantaneous
optic flow, representing the apparent motion of zero-crossings in each successive image,
is computed from the time derivative of the Laplacian of Gaussian of each image. The
global optic flow is computed by exploiting the instantaneous optic flow to track the
motion of each zero-crossing point thoughout the complete sequence. This provides a
vector field representing the correspondence between zero- crossing points in the initial
and final image in the sequence, i.e. over an extended base-line of camera displacement.
The depth of each zero-crossing is then computed by triangulation, using the start point
and the end point of each global optic flow vector.
An example is shown in Figure 1 in which the third image in a sequence of nine
images of a book is shown, along with its significant intensity discontinuities, the optical
flow determined between the third and eighth images in the sequence, and finally the
depth map which results after interpolation and smoothing. The nine images were taken
with an angular disparity of approximately 2~ between successive camera positions and
a fixation distance of 600mm.
The result of the previous algorithm is a sparse depth map, where depth values are known
only at locations corresponding to the significant zero-crossings which were successfully
tracked throughout the sequence of images. However, the purpose of this research is to
investigate the potential for recognising 3-D structure dervied from passively sensed data,
and hence we must interpolate between the available depth information.
The majority of interpolation techniques attempt to fit a continuous surface to the
available depth information (e.g. [6]). This requires that range data be segmented into
likely surfaces prior to the application of the interpolation technique, or alternatively
808
that only a single surface be presented. We employ a simpler technique invloving planar
interpolation to ensure that the surfaces are correct for polyhedral objects.
The interpolation method defines a depth value for each undefined point in the depth
map by probing in five directions (to both the East and West, where East and West are
parallel to the direction of motion) from that point in order to find defined depth values.
A measure of those defined depth values (based also on the orientations of the features
with which the depth values are associated, and the distances from the undefined point)
is then employed to define the unknown depth value; e.g. for point (z,y):
1.3 F i l t e r i n g r a n g e d a t a
The depth map which results from the interpolation can be quite noisy. That is to say,
that there can be local regions of depth values which vary significantly to those in a
larger area around them. In order both to overcome this and to define values for isolated
undefined points, a smoothing filter was applied to the data. A smoothing filter which
simply averages all depth values within a mask was not appropriate, as resultant values
would be affected by local regions'of noise. This restricted the potential choice of filter
considerably, and only two types of filter, median and modal, were considered. It was
found, experimentally, that a reasonably large (e.g. l l x l l ) modal filter produced the best
results, in terms of the resultant depth map, and hence 3-D structure.
1.4 B u i l d i n g 3-D m o d e l s
Finally, having obtained a reasonably smooth depth map, it is still necessary to convert
from a viewer-centered description to an object-centered surface model. This can be
done by first employing the relevant camera model, to convert from image coordinates
(i, j, depth) to Cartesian coordinates (x, y, z), and then deriving '3-point seed' surfaces
[7] (i.e. surfaces are instantiated between any three points which are within the sampling
distance of each other).
The basic problem of rigid object recognition is to establish the correspondence between
an object model, which is viewed, and a particular known object model, and the com-
putation of the associated pose. The majority of object recognition techniques match
known object models with viewed instances of objects through the comparison of model
primitives (e.g. edges). However, it is extremely unlikely that the model primitives ex-
tracted will be identical to those of a model which is known a priori. In order to overcome
that problem it is possible to employ secondary representations, such as the Extended
Gaussian Image (or EGI) [8], although using the EGI has proved difficult [9].
The technique of implicit model matching introduces a new, more powerful, but sim-
ilar idea. It employs several secondary representations which allow the problem to be
considered in terms of sub-problems. Initially, orientations of known objects which may
809
potentially match the viewed object are identified through the comparison of visible sur-
face normals, for each possible orientation of each known object (where 'each possible
orientation' is defined by the surface normals of a tesselated sphere). Potential orienta-
tions are then fine tuned by correlatiag 1-D histograms of specific components of surface
orientations (known as directional histograms). The object position is estimated using ap-
proximate knowledge of the physical configuration of the camera system, and fine tuned
using a template matching technique between needle diagrams derived from the known
and viewed models. Finally, using normalised correlation, each hypothesis is evaluated
through the comparision of the needle diagrams.
At each stage in the generation, tuning and verification of hypotheses only compar-
isons of the various secondary representations are employed. The central concept behind
implicit model matching is, then, that 3-D object models may be reliably compared
through the use of secondary representations, rather than (or, more properly, as well as)
by comparison of their component primitives. Additionally it is important to note that
object pose may be determined to an arbitrarily high degree of accuracy (through the
fine-tuning stages), although initially only a limited number of views are considered.
2.1 A p p r o x i m a t i n g Object O r i e n t a t i o n
Camer
mf frames
e a of~: IViewp~ ~
vml
YiewpoiInLVi~w~oi2 ra I ~ IYAWHISTOGRAMS
- yOAW
~5
Fig. 2. Example yaw directional histograms. These two yaw histograms of two views of a
garage-like object are a simple example of how directional histograms work. The visible sur-
face areas of the views of the object are mapped to the histograms at their respective yaw
angles (defined with respect to the focal axes of the camera). Notice the shift in the histograms,
which is due to the slightly different values of yaw of the two viewpoints.
~I~ANVmW'
~I TILTHISTOGRAM ~ ~I ~ I
Fig. 3. Example tilt directionM histograms. These two tilt histograms are derived from the
two views of the garage-like object shown in Figure 2. Notice how, for the first view, the two
orientations result in the same value of tilt, and in the second view how the values change.
2.2 F i n e - t u n i n g Object O r i e n t a t i o n
The potentially matching orientations computed can only be guaranteed to be as accurate
as the quantisation of the sampled sphere. Increasing the resolution of the sphere to
an arbitrarily high level, however, would obviously cause a signifcant increase in the
computational overhead required in determining potential orientations. Alternatively, it
is possible to fine-tune the orientations using directional histograms (of roll, pitch and
yaw) in a similar fashion to the method used for the approximate determination of object
roll. Pitch, yaw and roll directional histograms are derived from the view of the known
object and compared with histograms derived from the viewed model. The differences
between the directional histograms indicate the amount by which the orientation may
best be tuned (e.g. see Figure 2). T h e various directional histograms are derived and
compared sequentially and iteratively until the tuning required falls below the required
accuracy of orientation or until the total tuning on any component of orientation exceeds
the range allowed (which is defined by the quantisation of the tesselated sphere).
This stage allows the accuracy of potentially matching Orientations to be determined
to an arbitrarily high level (limited only by the resolution of the directional histograms).
Hence, although only a limited number of possible viewpoints of any known object are
considered, the orientation of the object may be determined to a high level of accuracy.
811
2.3 A p p r o x i m a t i n g O b j e c t P o s i t i o n
Turning now to the approximate determination of object position, it is relatively straight-
forward to employ the position of the viewed model with respect to its viewing camera.
The imaged centroid of the viewed model, and an approximate measure of the distance
of the viewed model from the viewing camera are both easily computed. The position
of the camera which views the known model may then be approximated by placing the
camera in a position relative to the known model's 3-D centroid, such that the centroid
is at the correct approximate distance from the camera and is viewed by the camera in
the same position as the viewed model's imaged centroid.
2.4 F i n e - t u n i n g O b j e c t P o s i t i o n
Fine tuning object position may be considered in terms of two operations; tuning position
in a directional orthogonal to the focal axis of the viewing device, and tuning of the
distance of the object from the same viewing device (i.e. the depth). This separates the
3 degrees of freedom inherent in the determination of object position.
NV(rn, n) = ~'~' ~ f(viewed(i, j)) * (~ - angle(viewed(i, j), known(i - m)(j - n))) (2)
~ , ~'~j f(viewed(i,j)) *
where viewed(i, j) and known(i, j) are the 3-D orientation vectors from the viewed
and known needle diagrams respectively, f(vector) is 1 if the vector is defined and 0
otherwise, and angle(vectorl, vector2) is the angle between the two vectors (or 0 if they
are undefined).
In order to make this template matching operation more efficient, the needle diagrams
are first compared at lower resolutions, using a somewhat simpler measure-of-fit.
T u n i n g O b j e c t D e p t h . Fine tuning the distance between the known model and its
viewing camera is done through direct comparison of the depth maps generated from
both the viewed model and the known model (in its determined pose). The Depth Change
Required (or DCR) is defined as follows:
DCR = ~'~i ~'~ f(viewed(i,j)) . f(known(i,j)) 9 (viewed(i,j) - known(i,j))
~ , ~'~ f(viewed(i, j)) 9 f(known(i, j)) (3)
where viewed( i, j) and known(i, j) are the depths from the viewed and known depth
maps respectively, where f(depth) = 1 if the depth is defined and 0 otherwise. The D C R
is directly applied as a translation to the pose of the camera which views the known
812
model, in a direction defined by the focal axis of the camera. Due to perspective effects
this operation will have effects on the depth map rendered, and so is applied iteratively
until the D C R falls below an acceptable level.
2.5 Verifying H y p o t h e s e s
Having hypothesised and fine tuned poses of known objects it is necessary to determine
some measure of fit for the hypothesis so that it may be accepted (subject to no better
hypothesis being determined), or rejected. The normalised correlation of local surface
orientations (i.e. needle diagrams) between the viewed model and the known model in a
determined pose, as used when fine tuning object position (see section 2.4) gives a degree-
of-fit which represents all aspects of object position and orientation. This degree-of-fit,
then, provides a powerful hypothesis verifcation measure.
The known model, in the computed position and orientation, which gives the best
degree of fit with the viewed model (i.e. maximal correlation between derived needle
diagrams), and which exceeds some predefined threshold, is taken to be the best match;
the viewed model is assumed to be the corresponding object, in the pose of the known
model.
2.6 An E x c e p t i o n
There is, however, one situation in which this technique, implicit model matching, will
fail and that is when only a single surface/orientation is visible. Computation of object
roll in this instance is impossible using directional histograms (as there is an inherent
ambiguity with respect to roll around the orientation vector of the surface).
This situation can be detected by considering the standard deviation of the visible
orientations as mapped to an EGI (as if the standard deviation is less than a small angle
then it may be taken that only one surface is visible). The problem must then be regarded
as one of shape recognition, although it should be noted that it is' possible to adapt the
technique of implicit model matching to cope with this situation (see [10]).
The intention of the testing detailed herein is to investigate the robustness of the recog-
nition technique, and to demonstrate the potential for recognising 3-D models derived
from passively sensed data. The objects employed were all of simple rigid geometric
structure, and scenes contained only one object. The rationale for these choices is that
813
the segmentation and identification of occlusion problems, etc., still require much further
research.
As an example of recognition, consider the book shown Figure 1. The model deter-
mined is quite accurate, with the main exception being the title. Regardless of these
errors, however, sufficient of the model is computed correctly to allow reliable identifica-
tion of the book (See Figure 5) from the database of objects (See Figure 4).
Fig. 5. Recognition of the model derived from the depth map shown in Figure 1.
The discrimination between the various recognition hypotheses is not that significant
however, and as more complex objects are considered (See Table 1) the discriminatory
ability gets progressively worse resulting, eventually, in mistaken recognition. The testing
allows a number of conclusions to be drawn, which follow, and satisfies both of the stated
intentions of this research.
2. Implicit model matching was found to degrade reasonably gracefully in the presence
of noisy and incorrect data. However, its performance with models derived from
actively sensed range data [1] was significantly more reliable.
3. Finally, the limitations on the techniques presented for the development of three
dimensional models are emphasised. The visual processing performed in this research
quite obviously deals in a trivial way with many important issues, and these issues
remain for future research.
Further details of all aspects of the system described in this paper are given in [4].
Scene Figure Cube Cone Book Mug Can Globe Tape Result
Cube 0.8229 0,6504 0,7220 0.6810 0.5270 0.6344 0.3656 Cube
Cone 0,2508 0.7311 0.5130 0.2324 0.1962 0.3777 0.1332 Cone
Book 5 0.2763 0.5432 0.7534 0.2958 0.2189 0.3412 0.1883 Book
Mug 0.6872 0.6061 0.5770 0.6975 0.5892 0.6580 0.3276 Mug
Pepsi Can 0.6643 0.5829 0.7099 0.6705 0.6852 0.5150 0.3658 Book 9
Globe 0.4492 0.5620 0.5789 0.4036 0.3305 0.5725 0.2897 Book 9
Sellotape 0.5706 0.6270 0.6689 0.4979 0.5039 0.4295 0.3735 Book *
Table 1. The complete table of the degrees-of-fit determined between viewed instances of objects
and the known models.
References
1. Dawson, K. Vernon, D.: Model-Based 3-D Object Recognition Using Scalar Transform
Descriptors. Proceedings of the conference on Model-Based Vision Development and Tools,
Vol. 1609, SPIE - The International Society for Optical Engineering (November 1991)
2. Sandini, G., Tistarelli, M.: Active Tracking Strategy for Monocular Depth Inference Over
Multiple Frames. IEEE PAMI, Vo1.12, No.1 (January 1980) 13-27
3. Vernon, D., Tistarelli, M.: Using Camera Motion to Estimate Range for Robotic Part
Manipulation. IEEE Robotics and Automation, Vol.6, No.5 (October 1990) 509-521
4. Vernon, D., Sandini, G. (editors): Parallel computer vision - The V I S a VIS System. Ellis
Horwood (to appear)
5. Horn, B., Schunck, B.: Determining Optical Flow. Artificial Intelligence, Vo1.17, No.1 (1981)
185-204
6. Grimson, W.: From Image to Surfaces: A Computational Study of the Human Early Visual
System. MIT Press, Cambridge, Massachusetts (1981)
7. Faugeras, O., Herbert, M.: The representation, recognition and locating of 3-D objects.
International Journal of Robotics Research, Vol. 5, No. 3 (Fall 1986) 27-52
8. Horn, B.: Extended Gaussian Images. Proceedings of the IEEE, Vol.72, No.12 (December
1984) 1671-1686
9. Brou, P.: Using the Gaussian Image to Find Orientation of Objects. The International
Journal of Robotics Research, Vol.3, No.4 (Winter 1984) 89-125
10. Dawson, K.: Three-Dimensional Object Recognition through Implicit Model Matching.
Ph.D. thesis, Dept. of Computer Science, Trinity College, Dublin 2, Ireland (1991)
This articlewas processed using the I~TEX macro pacl~ge with E C C V 9 2 style
Interpretation of R e m o t e l y Sensed Images
in a C o n t e x t of Multisensor Fusion*
1 Introduction
An extensive literature has grown since the beginning of the decade on the problem
of scene interpretation, especially for aerial and satellite images [NMS0,Mat90] [RH89]
[RIHR84] [MWAW89] [HN88] [Fua88] [GG90]. One of the main difficulties of these appli-
cations is the knowledge representation of objects, of scene, and of interpretation strategy.
Previously mentioned systems use various knowledge such as: object geometry, mapping,
sensor specifications, spatial relations, etc...
In the other hand, there is a growing interest in the use of multiple sensors to increase
both the availability and capabilities of intelligent systems [MWAW89,Mat90] [LK89]
[RH89]. However, if the multi-sensor fusion is a way to increase the number of measures
on the world by complementary or redundancy sensors, problems of control of the data
flow, strategies of object detection, and modeling of objects and sensors are also increased.
This paper presents a scene interpretation system in a context of multi-sensor fusion.
We propose to perform fusion at the intermediate level because it is the most adaptive
and the most general for different applications of scene analysis. First, we present how the
real world and the interpreted scene are modeled; knowledge about sensors and multiple
views notion (shot) are taken into account. Then we give an overview of the architecture
of the system. Finally, some results are shown from an application to S A R / S P O T images
interpretation.
2 Modeling
Consistency of information is one of the relevant problems of multi-sensor fusion systems;
in fact, various models must be used to express the a priori knowledge. This knowledge
can be divided into knowledge about the real world and knowledge about the interpre-
tation.
2.1 R e a l W o r l d M o d e l i n g
For an interpretation system, a prior/knowledge on the scene to be observed is necessary:
for example, the description of objects which might be present in the scene. Moreover,
* This work is in part supported by AEROSPATIALE, Department E/ETRI, F-78114 Magny-
les-hameaux, and by ORASIS contract, PRC/Communication Homme-Machine.
816
Sensors: Some sensors are sensitive to object reflectance, other to their position, or to
their shape.., l~diometric features mainly come from the materials the objects are com-
posed of, and more precisely from features of these materials such as cold, homogeneous,
rough, textured, smooth .... The response to each aspect is quite different depending on
the sensor.
Therefore sensors are modeled in our system using the sensitivity to aspects of various
materials, the sensitivity to geometry of objects, the sensitivity to orientation of objects,
the band width described by minimum and maximum wave length, and the type (active
or passive). Note that the quality of the detection (good, medium, or bad) has been
dissociated from the aspect in the image (light, grey, dark).
Due to their properties, some objects will be well detected by one sensor, and not by
another one; other objects will be well detected by various sensors. To be able to detect
easily and correctly an object, we have to choose the image(s), i.e. the sensor(s), in which
it is best represented. For that, our system uses the sensitivities of the sensors, and the
material composition of the objects.
Knowing the position of the sensor, and its resolution is also important to be able to
determine whether an object could be well detected. We call shot the whole information:
817
the description of the sensor, the conditions of acquisition including the point of view,
and the corresponding image.
2.2 I n t e r p r e t a t i o n
The main problem is how to represent the scene being interpreted. First of all, we are
going to precise what we call an interpreted scene, and which information must be present
in an interpretation. Our goal is not to classify each pixel of the image; it is to build a
semantic model of the real observed scene. This model must include: the precise location
of each detected object, its characteristics (such as shape, color, function...), and its
relations with other objects present in the scene. To capture such information, it is
necessary to have a spatial representation of the scene; in the 2D-case, this can be done
using a location matrix. This representation allows to focus attention on precise areas
using location operators such as surrounded by, near..., and to detect location conflicts.
Location conflicts occur when areas of different objects overlap. Three different kinds
of conflicts can be cited: conflicts among superposed objects (in fact, they are not real
conflicts: a bridge over a road); conflicts among adjacent objects (some common pixels;
such a conflict is due to low level algorithms, digitalization...); conflicts arising because
of ambiguous interpretation between different sorts of objects (this kind of conflict can
be elucidated only by using relational knowledge).
3 Implementation
Our goal was to develop a general framework to interpret various kinds of images such as
aerial images, or satellite images. It has been designed as a shell to d~velop interpretation
systems. Two main knowledge representations are used: frames and production rules. The
system has been implemented using the SMECI expert system generator [II91], and the
NMS multi-specialist shell [CAN90]; it is based on the blackboard and specialist concepts
[HR83]. This approach has been widely used in computer vision [HR87,Mat90], and in
multi-sensor fusion [SST86]. We have simplified the blackboard structure presented by
Hayes-Roth, and we have build a centralized architecture with three types of specialists:
t h e generic specialists (application-independent), t h e s e m a n t i c o b j e c t specialists
(application-dependent), and t h e low level specialists (dependent on image processing,
and feature description). They work at different levels of representation, are independent,
and work only on a strategy level request; so the system is generic and incremental. The
detection strategy is based on the fundamental notion of spatial context linking the
objects in the scene, and the notion of salient object.
To demonstrate the reliability of our approach, we have implemented an application
for the interpretation of SAR images registered with SPOT images, a set of sensors
which are complementary. Five sensors (the SIR-B Synthetic Aperture Radar, and the
panchromatic, XS1 [blue], XS2 [green], XS3 [near infra-red] SPOT sensors), ten materials
(water, metal, asphalt, cement, vegetation, soil, sand, rock, snow, and marsh), and five
kinds of semantic objects (rivers, lakes, roads, urban areas', and bridges) are modeled in
this application. We present Figure 1 an example of result (fig 1.c) we obtained using
three images: SAR (fig 1.a), SPOT XS1, and SPOT XS3 (fig 1.b). Closed contours point
out urban areas. Filled regions indicate lakes. Thin lines represent bridges, roads, and
the river. More details about this application can be found in [CGH92], while low-level
algorithms are described in [HG91].
818
9 O
q, b
,n
(c)
F i g . 1. Top : Sensor images used for scene interpretation : (a) SIR-B Radar image; (b) near
infra-red S P O T XS3 image. Bottom : (c) Objets detected in the scene after interpretation.
Closed contours point out urban areaz. Filled regions indicate lakes. Thin lines represent bridges,
roads, and the river.
819
4 Conclusion
We have proposed a way to model real world and interpreted scene in the context ot'multi-
sensor fusion. A priori knowledge description includes the characteristics of the sensors,
and a semantic object description independent of the sensor characteristics. This archi-
tecture meets our requirements of highly modular structure allowing easy incorporation
of new knowledge, and new specialists. A remote sensing application with S A R / S P O T
sensors aiming at detecting bridges, roads, lakes, rivers, and urban areas demonstrates
the efficiency of our approach.
A c k n o w l e d g m e n t s : The authors would like to thank O.Corby for providing useful
suggestions during this study, and J.M.Pelissou and F.Sandakly for their contribution in
the implementation of the system.
References
[CAN90] O. Corby, F. Allez, and B. Neveu. A multi-expert system for pavement diagnosis
and rehabilitation. Transportation Research Journal, 24A(1), 1990.
[CGH92] V. Cldment, G. Giraudon, and S. Houzelle. A knowledge-based interpretation sys-
tem for fusion of sar and spot images. In Proc. o/IGARSS, Houston, Texas, May
1992.
[FuaS8] P. Fun. Extracting features from aerial imagery using model-based objective func-
tions. PAMI, 1988.
[GG90] P. Garnesson and G. Giraudon. An image analysis system, application for aerial
imagery interpretation. In Proc. of ICPR, Atlantic City, June 1990.
[HG91] S. Houzelle and G. Giraudon. Automatic feature extraction using data fusion in re-
mote sensing. In SPIE Proceedings, Vo11611, Sensor Fusion IV: Control Paradigms
and data structures, Boston, November 1991.
[HN88] A. Huertas and R. Nevatia. Detecting building in aerial images. ICGCV, 41.2:131-
152, February 1988.
[HR83] B. Hayes-Roth. The blackboard architecture : A general framework for problem
solving? Stanford University, Report HPP.83.30, 1983.
[HR87] A. Hanson and E. Riseman. The visions image-understanding system. In C. M.
Brown, editor, Advances in Computer Vision, pages 1-114. Erlbaum Assoc, 1987.
[II91] Ilog and INRIA. Smeci 1.54 : Le manuel de rdfdrvnce. Gentilly, 1991.
[LK891 R. Luo and M. Kay. Multisensor integration and fusion in intelligent systems.
IEEE Trans on Sys. Man and Cyber., 19(5):901-931, October 1989.
[Mat90] T. Matsuyama. SIGMA, a Knowledge.Based Aerial Image Understanding System.
Advances in Computer Vision and Machine Intelligence. Plenum, New York, 1990.
[MWAWS9] D.M. McKeown, Jr. Wilson, W. A.Harvey, and L. E. Wixson. Automating knowl-
edge acquisition for aerial image interpretation. Comp. Vision Graphics and linage
Proc, 46:37-81, 1989.
[NMS0] M. Nagao and T. Matsuyama. A Structural Analysis of Complex Aerial Pho-
tographs. Plenum, New York, 1980.
[RH89] E. M. Riseman and A. R. Hanson. Computer vision research at the university of
massachusetts, themes and progress. Special Issue of Int. Journal of Computer
Vision, 2:199-207, 1989.
[RII{R84] G. Reynolds, N. Irwin, A. Hanson, and E. Riseman. Hierachical knowledge-
directed object extraction using a combined region and line representation. In
Proc. of Work. on Comp. Vision Repres. and Cont., pages 238-247. Silver Spring,
1984.
[SST86] S. Sharer, A. Stentz, and C. Thorpe. An architecture for sensor fusion in a mobile
robot. In Int. Conf. on Robotics and Automation, pages 2202-2011, San Francisco,
June 1986.
Limitations of Non Model-Based Recognition
Schemes
1 Introduction
1 A similar result has been independently proved by Burns et al. 1990 and Clemens & Jacobs
1990.
822
exists a non-trivial consistent function for objects from the scheme's scope. The function
can have in this case arbitrary values for images of objects that do not belong to the
class. The existence of a nontrivial consistent function for a specific class of objects
depends on the particular class in question. In Section (4) we discuss the existence of
consistent recognition function with respect to viewing position for specific classes of
objects. In Section (4.1) we give an example of a class of objects for which every consistent
function is still a constant function. In Section (4.2) we define the notion of the function
discrimination power. The function discrimination power determines the set of objects
that can be discriminated by a recognition scheme. We show that, given a class of objects,
it is possible to determine an upper bound for the discrimination power of any consistent
function for that class. We use as an example the class of symmetric objects (Section 4.3).
Finally, we consider grey level images of objects that consist of n small surface patches
in space (this can be thought of as sampling an object at n different points). We show that
every consistent function with respect to illumination conditions and viewing position
defined on points of the grey level image is also a constant function.
We conclude that every consistent recognition scheme for 3-D objects must depend
strongly on the set of objects learned by the system. That is, a general consistent recog-
nition scheme (a scheme that is not limited to a specific class of objects) must be model-
based. In particular, the invariant approach cannot be applied to arbitrary 3-D objects
viewed from arbitrary viewing positions. However, a consistent recognition function can
be defined for non model-based schemes restricted to specific class of objects. An up-
per bound for the discrimination power of any consistent recognition function can be
determined for every class of objects.
It is worth noting here that the existence of invariant features to viewing position
(such as parallel lines) and invariant recognition function for 2-D objects (see review
Forsyth et al. 1991) is not at odds with our results. Since, the invariant features can
be regarded as model-based recognition function and the recognition of 2-D objects is a
recognition scheme for a specific class of objects (see section 4).
We begin with the general case of a universally consistent recognition function with
respect to viewing position, i.e. a function invariant to viewing position of all possible
objects. The function is assumed to be defined on the orthographic projection of objects
that consist of points in space.
Proof. A function that is invariant to viewing position by definition yields the same
value for all images of a given object. Clearly, if two objects have a common orthographic
projection, then the function must have the same value for all images of these two objects.
We define a reachable sequence to be a sequence of objects such that each two succes-
sive objects in the sequence have a common orthographic projection. The function must
have the same value for all images of objects in a reachable sequence. A reachable object
from a given object is defined to be an object such that there exists a reachable sequence
starting at the given object and ending at the reachable object. Clearly, the value of the
function is identical for all images of objects that are reachable from a single object.
823
Every image is an orthographic projection of some 3-D object. In order to prove that
the function is constant on all possible images, all that is left to show is that every two
objects are reachable from one another. This is shown in Appendix 1. O
We have shown that any universal and consistent recognition function is a constant
function. Any non model-based recognition scheme with a universal scope is subject to
the same limitation, since such a scheme is required to be consistent on all the objects in
its scope. Hence, any non model-based recognition scheme with a universal scope cannot
discriminate between any two objects.
Up to now, we have assumed that the recognition function must be entirely consistent.
That is, it must have exactly the same value for all possible images of the same objects.
However, a recognition scheme may be allowed t o make errors. We turn next to examine
recognition functions that are less than perfect. In Section 3.1 we consider consistent
functions with respect to viewing position that can have errors on a significant subset
of images. In Section 3.2 we discuss functions that are almost consistent with respect to
viewing position, in the sense that the function values for images of the same object are
not necessarily identical, but only lie within a certain range of values.
The human visual system may fail in some cases to identify correctly a given object
when viewed from certain viewing positions. For example, it might identify a cube from
a certain viewing angle as a 2-D hexagon. The recognition function used by the human
visual system is inconsistent for some images of the cube. The question is whether there
exists a nontrivial universally consistent function, when the requirements are relaxed:
for each object the recognition function is allowed to make errors (some arbitrary values
that are different from the unique value common to all the other views) on a subset of
views. The set should not be large, otherwise the recognition process will fail too often.
Given a function f, for every object x let E l ( x ) denote the set of viewing directions
for which f is incorrect ( E / ( x ) is defined on the unit sphere). The object x is taken to
be a point in R n. We also assume that objects that are very similar to each other have
similar sets of "bad" viewing directions. More specifically, let us define for each object
x, the value 4~(x, e) to be the measure (on the unit sphere) of all the viewing directions
for which f is incorrect on at least one object in the neighborhood of radius e around x.
T h a t is, ~(x0, e) is the measure of the set U~B(~o,~) El(x)" We can now show that even
if ~(x, e) is rather substantial (i.e. f makes errors on a significant number of views), f
is still the trivial (constant) function. Specifically, assuming that for every x there exist
an e such that ~ ( z , e) < D (where D is about 14% of the possible viewing directions),
then f is a constant function. The proof of this claim can be found in Moses & Ullman
(1991).
a threshold function is usually used to determine whether the value indicates a given
object.
Let an object neighborhood be the range to which a given object is mapped by such an
"almost consistent" function. Clearly, if the neighborhood of an object does not intersect
the neighborhoods of other objects, then the function can be extended to be a consistent
function by a simple composition of the threshold function with the almost consistent
function. In this case, the result of the general case (Claim 1) still holds, and the function
must be the trivial function.
If the neighborhoods of two objects, a and b, intersect, then the scheme cannot dis-
criminate between these two objects on the basis of images that are mapped to the
intersection. In this case the images mapped to the intersection constitute a set of im-
ages for which f is inconsistent. If the assumption from the previous section holds, then
f must be again the trivial function.
We have shown that an imperfect universal recognition function is still a constant
function. It follows that any non model-based recognition scheme with a universal scope
cannot discriminate between objects, even if it is allowed to make errors on a significant
number of images.
4 C o n s i s t e n t r e c o g n i t i o n f u n c t i o n s for a class o f o b j e c t s
So far we have assumed that the scope of the recognition scheme was universal. That
is, the recognition scheme could get as its input any set of (pointwise) 3-D objects.
The recognition functions under consideration were therefore universally consistent with
respect to viewing position. Clearly, this is a strong requirement. In the following sections
we consider recognition schemes that are specific to classes of objects. The recognition
function, in this case must still be consistent with respect to viewing position, but only
for objects that belong to the class in question. That is, the function must be invariant
to viewing position for images of objects that belong to a given class of objects, but can
have arbitrary values for images of objects that do not belong to this class.
The possible existence of a nontrivial consistent recognition function for an object
class depends on the particular class in question. In Section (4.1) we consider a simple
class for which a nontrivial consistent function (with respect to viewing position) still
does not exist. In Section (4.2) we discuss the existence of consistent functions for certain
infinite classes of objects. We show that when a nontrivial consistent function exist, the
upper bound of any function discrimination power can be determined. Finally, we use
the class of symmetric objects (Section 4.3) in order to demonstrate the existence of
consistent function for an infinite class of objects and its discrimination power.
4.1 T h e class o f a p r o t o t y p i c a l o b j e c t
In this section, we consider the class of objects that are defined by a generic object.
The class is defined to consist of all the objects that are sufficiently close to a given
prototypicM object. For example, it is reasonable to assume that all faces are within a
certain distance from some prototypicM face. The class of prototypical objects composed
of n points in space, can be thought of as a sphere in R 3n around the prototypicM object.
The results established for the unrestricted case hold for such classes of objects. That
is, every consistent recognition function with respect to viewing position of all the objects
that belong to a class of a given prototypical object is a constant function. The proof for
this case is similar to the proof of the general case in Claim 1.
825
4.2 D i s c r i m i n a t i o n p o w e r
Clearly, some class invariants exist. A simple example is the class of eight-point objects
with the points lying on the corners of some rectangular prism, together with the class
of all three-point objects (since at least 4 points will always be visible of the eight-
point object). In this example the function is consistent for the class, all the views of
a given object will be mapped to the same value. However, the function has a limited
discrimination power, it can only distinguish between two subclasses of objects. In this
section we examine further the discrimination power of a recognition function.
Given a class of objects, we first define a teachability partition of equivalence sub-
classes. Two objects are within the same equivalence subclass if and only if they are
reachable from each other. P~eachability is clearly an equivalence relation and therefore
it divides the class into equivalence subclasses. Every function f induces a partition into
equivalent subclasses of its domain. That is, two objects, a and b, belong to the same
equivalent subclass if and only i f / ( a ) = f(b). Every consistent recognition function must
have identical value for all objects in the same equivalence subclass defined by the reach-
ability partition (the proof is the same as in Claim 1). However, the function can have
different values for images of objects from different subclasses. Therefore, reaehability
partition is a refinement of any partition induced by a consistent recognition function.
That is, every consistent recognition function cannot discriminate between objects within
the same reachability partition subclass.
The reachability subclasses in a given class of objects determines the upper bound
on the discrimination power of any consistent recognition function for that class. If the
number of reachability subclasses in a given class is finite, then it is the upper bound for
the number of values in the range of any consistent recognition function for this class.
In particular, it is the upper bound for the number of objects that can be discriminated
by any consistent recognition function for this class. Note that the notion of reachability
and, consequently, the number of equivalence classes, is independent of the particular
recognition function. If the function discrimination power is low, the function is not very
helpful for recognition but can be used for classification, the classification being into the
equivalence subclasses.
In a non model-based recognition scheme, a consistent function must assign the same
value to every two objects that are reachable within the scope of the scheme. In contrast,
a recognition function in a model-based scheme is required to assign the same value to
every two objects that are reachable within the set of objects that the function must
in fact recognize. Two objects can be unreachable within a given set of objects but be
reachable within the scope of objects. A recognition function can therefore discriminate
between two such objects in a model-based scheme, but not in a non model-based scheme.
4.3 T h e class o f s y m m e t r i c o b j e c t s
The class of symmetric objects is a natural class to examine. For example, schemes for
identifying faces, cars, tables, etc, all deals with symmetric (or approximately symmetric)
objects. Every recognition scheme for identifying objects belonging to one of these classes,
should be consistent only for symmetric objects.
In the section below we examine the class of bilaterally symmetric objects. We will
determine the reachability subclasses of this class, and derive explicitly a recognition
function with the optimal discrimination power. We consider images such that for every
point in the image, its symmetric point appears in the image.
826
C l a i m 2: Every two symmetric objects a and b are reachable if and only if h(a) = h(b).
(The proof of this claim can be found in Moses & Ullman (1991).)
It follows from this Claim that a consistent recognition function with respect to view-
ing position defined for all symmetric objects, can only discriminate between objects that
differ in the relative distance of symmetric points.
objects regardless of the illumination condition and viewing position must be model-
based.
6 Conclusion
Appendix 1
In this Appendix we prove that in the general case every two objects are reachable from
one another.
First note that the projection of two points, when viewed from the direction of the
vector that connects the two points, is a single point. It follows that for every object with
n - 1 points there is an object with n points such that the two objects have a common
orthographic projection. Hence, it is sufficient to prove the following claim:
C l a i m 4: Any two objects that consists of the same number of points in space are
reachable from one another.
Proof. Consider two arbitrary rigid objects, a and b, with n points. We have to show
that b is reachable from a. That is, there exists a sequence of objects such that every two
successive objects have a common orthographic projection.
828
Let the first object in the sequence be al = a = (p~,p~, ...,pan) and the last object
be bl = b = (pb,pb, ...,pb). We take the rest of the sequence, a2 .... , an to be the objects:
b b
ai = (Pl,P2, . . . , PbI - D P la , ...,P,)"
a
All t h a t is left to show is t h a t for every two successive
objects in the sequence there exists a direction such t h a t the two objects project to the
same image. By the sequence construction, every two successive objects differ by only
one point. The two non-identical points project to the same image point on the plane
perpendicular to the vector t h a t connects them. Clearly, all the identical points project to
the same image independent of the projection direction. Therefore, the direction in which
the two objects project to the same image is the vector defined by the two non-identical
points of the successive objects. O
References
1. Bolles, R.C. and Cain, R.A. 1982. Recognizing and locating partially visible objects: The
local-features-focus method. Int. J. Robotics Research, 1(3), 57-82 .
2. Brooks, R.A. 1981. Symbolic reasoning around 3-D models and 2-D images, Artificial
Intelligence J., 17, 285-348.
3. Burns, J. B., Weiss, R. and Pdseman, E.M. 1990. View variation of point set and line
segment features. Proc. Image Understanding Workshop, Sep., 650-659.
4. Cannon, S.R., Jones, G.W., Campbell, R. and Morgan, N.W. 1986. A computer vision
system for identification of individuals. Proc. IECON 86 O, WI., 1,347-351.
5. Clemens, D.J. and Jacobs, D.W. 1990. Model-group indexing for recognition. Proc. Image
Understanding Workshop, Sep., 604-613.
6. Forsyth, D., Mundy, L., Zisserman, A., Coelho, C., Heller A. and Rothwell, C. 1991.
Invariant Descriptors for 3-D object Recognition and pose. IEEE Trans. on PAMI. 13(10),
971-991.
7. Grimson, W.E.L. and Lozano-PSrez, T. 1984. Model-based recognition and localization
from sparse data. Int. J. Robotics Research, 8(3), 3-35.
8. Grimson, W.E.L. and Lozano-P6rez, T. 1987. Localizing overlapping parts by searching
the interpretation tree. IEEE Trans. on PAMI. 9(4), 469-482.
9. Horn B. K.P. 1977. Understanding image intensities, Artificial Intelligence J.. 8(2), 201-
231
10. Huttenlocher, D.P. and UNman, S. 1987. Object recognition using alignment. Proceeding
of ICCV Conf., London, 102-111.
11. Kanade, T. 1977. Computer recognition of human faces. Birkhauser Verlag. Basel and
Stuttgart.
12. Lowe, D.G. 1985. Three dimensional object recognition from single two-dimensional im-
ages. Robotics research Technical Report 202, Couraant Inst. of Math. Sciences, N.Y.
University.
13. Moses, Y. and Ullman S. 1991. Limitations of non model-based recognition schemes. A I
MEMO No 1301, The Artificial Intelligence Lab., M.I.T.
14. Phong, B.T. 1975. Illumination for computer generated pictures. Communication of the
A C M , 18(6), 311-317.
15. Poggio T., and Edelman S. 1990. A network that learns to recognize three dimensional
objects. Nature, 343, 263-266.
16. Ullman S. 1977. Transformability and object identity. Perception and Psychophysics,
22(4), 414-415.
17. Ullman S. 1989. Alignment pictorial description: an approach to object recognition. Cog-
nition, 32(3), 193-254.
18. Wong, K.H., Law, H.H.M. and Tsang P.W.M, 1989. A system for recognizing human faces,
Proc. ICASSP, 1638-1642.
This article was processed using the IbTEX macro package with ECCV92 style
Constraints for R e c o g n i z i n g and Locating C u r v e d
3D Objects from Monocular Image Features *
1 Center for Systems Science, Dept. of Electrical Engineering, Yale University, New Haven, CT
06520-1968, USA
2 Beckman Institute, Dept. of Computer Science, University of Illinois, Urbana, IL 61801, USA
1 Introduction
* This work was supported by the National Science Foundation under Grant IRI-9015749.
830
3-tangent ~ T-jmaetion
Ctnvmar*-L
(, ~ Inflection
Fig. 1. a. Some viewpoint dependent image features for piecewise smooth objects, b. A t-junction
and the associated geometry.
3.1 T-junctions
First, consider the hypothesis that an observed t-junction is the projection of two limb
points xx, x2 as shown in figure 1.b which provides the following geometric constraints:
f,(x,) : 0
(xl - : 2 ) . N , = o (4)
N1 9N2 = cos 812,
where i = 1, 2, Ni denotes the unit surface normals, and cos 012 is the observed angle
between the image normals. In other words, we have five equations, one observable cos 012,
and six unknowns (Xl, x2). In addition, the viewing direction is given by v = Xl - x2.
When another t-junction is found, we obtain another set of five equations in six unknowns
xj, x4, plus an additional vector equation: (xl - x2) x (x3 - x4) = 0 where only two
of the scalar equations are independent. This simply expresses the fact that the viewing
direction should be the same for both t-junctions. Two observed t-junctions and the
corresponding hypotheses (i.e., "t-junction one corresponds to patch one and patch two",
and "t-junction two corresponds to patch three and patch four") provide us with 12
equations in 12 unknowns. Such a system admits a finite number of solutions in general.
For each solution, the viewing direction can be computed, and the other parameters of
the viewing transformation are easily found by applying eq. (3). Similar constraints are
obtained for t-junctions that arise from the projection of edge points by noting that the
3D curve tangent, given by t = V f x Vg, projects to the tangent of the image contour.
3.2 C u r v a t u r e - L a n d T h r e e - t a n g e n t J u n c t i o n s
For a piecewise smooth object, curvature-L or three-tangent junctions are observed when
a limb terminates at an edge, and they meet with a common tangent; observe the top
and b o t t o m of a coffee cup, or consider figure 1. Both feature types have the same local
geometry, however one of the edge branches is occluded at a curvature-L junction.
fil
- 82 91R2
Fig. 2. The image plane geometry for pose estimation from three-tangent and curvature-L junc-
tions: The curved branch represents the edge while the straight branch represents a limb.
Consider the two edge points xi, i = 1, 2 formed by the surfaces fi, gi that project to
these junctions, xl is also an occluding contour point for one of the surfaces, say fi. Note
the image measurements (angles a,/31 and f12) shown in figure 2. Since xi is a limb point
of fl, the surface normal is aligned with the measured image normal fii. Thus, the angle
ot between fil and fi2 equals the angle between nl and n2, or c o s a = n l 9n2/Inll]n21.
Now, define the two vectors A = Xl -- x2 and /~ = Xl - x2. Clearly the angle between
fii and z~ must equal the angle between ni and the projection of A onto the image plane
/$ which is given by ,~ = A -- ( A . ~)9 where ~7 = n l x n2/]nx x n21 is the normalized
viewing direction. Noting that n~. 9 = 0, we have Ink I[/$[ cos fli = ni. A. However, z~ is of
relatively high degree and a lower degree equation is obtained by taking the ratio of cos fll
and using the equation for cos a. After squaring and rearrangement, these equations
(nl.nl)(n2.n2)cosa- (nl.n2) 2 = 0,
(s)
~(n2. A ) ( n l . n 2 ) -- coso~(n2, n 2 ) ( n l " A) = 0.
832
along with the edge equations (2) form a system of six polynomial equations in six
unknowns whose roots can be found; the pose is then be determined from (3).
3.3 I n f l e c t i o n s
Inflections (zeros of curvature) of an image contour can arise from either limbs or edges.
In both cases, observing two such points is sufficient for determining object pose.
As Koenderink has shown, a limb inflection is the projection of a point on a parabolic
line (zero Gaussian curvature) [5], and for a surface defined implicitly, this is
where the subscripts indicate partial derivatives. Since both points xl, x2 are limbs,
equation (6) and the surface equation for each point can be added to (5) for measured
values of a,/~1 and/32 as depicted in figure 2. This system of six equations in xl, x2 can
be solved to yield a set of points, and consequently the viewing parameters.
In the case of edges, an image contour inflection corresponds to the projection of an
inflection of the space curve itself or a point where the viewing direction is orthogonal to
the binormal. Space curve inflections typically occur when the curve is actually planar,
and can be treated like viewpoint independent features (vertices). When inflections arise
from the binormal bi being orthogonal to the viewing direction, as in figure 1, two mea-
sured inflections are sufficient for determining pose. It can be shown that the projection
of bl is the image contour normal, and for surfaces defined implicitly, the binormal is
given by b = [ t t g ( g ) t ] V f - [ t t y ( f ) t ] V g where g ( f ) is the Hessian of f. By including
the curve equations (2) with (5) after replacing ni by bi, a system of six equations in
Xl, x~ is obtained. After solving this system, the pose can be readily determined.
3.4 Cusps
Like the other features, observing two cusps in an image is sufficient for determining
object pose. It is well known that cusps occur when the viewing direction is an asymptotic
direction at a limb point which can be expressed as v t H ( x i ) v = 0 where the viewing
direction is v = V f l ( x l ) x Vf2(x2). While the image contour tangent is not strictly
defined at a cusp (which is after all a singular point), the left and right limits of the
tangent as the cusp is approached will be in opposite directions and are orthogonal to
the surface normal. Thus, the cusp and surface equations can be added to the system (5)
which is readily solved for Xl and x2 followed by pose calculation.
Fig. 3.a shows an image of a cylinder with a cYlindrical notch and two inflection points
found by applying the Canny edge detector and fitting cubic splines. The edge constraints
of section 3.3 lead to a system of six polynomial equations with 1920 roots. However, only
two roots are unique, and figs. 3.b and 3.c show the corresponding poses. Clearly the pose
in fig. 3.c could be easily discounted with additional image information. As in [6], elim-
ination theory can used to construct an implicit equation of the image contours of the
intersection curve parameterized by the pose. By fitting this equation to all detected
edgels on the intersection curve using the previously estimated pose as initial conditions
for nonlinear minimization, the pose is further refined as shown in fig. 3.d. Using contin-
uation to solve the system of equations required nearly 20 hours on a SPARC Station
1, though a recently developed parallel implementation running on network of SPARC
stations or transputers should be significantly faster. However, since there are only a few
833
real roots, another effective m e t h o d is to construct a table offiine of (~, fli as a function
of the two edge points. Using table entries as initial conditions to Newton's m e t h o d , the
same poses are found in only two minutes. Additional examples are presented in [7].
Fig. 3. Pose estimation from two inflection points. Note the scale difference in c.
Aeknowledgments:
Many thanks to Darrell S t a m for his distributed implementation of continuation.
Referecnes
1. O. Faugeras and M. Hebert. The representation, recognition, and locating of 3-D objects.
Int. J. Robot. Res., 5(3):27-52, Fall 1986.
2. W. E. L. Grimson. Object Recognition by Computer: The Role o.f Geometric Constraints.
MIT Press, 1990.
3. R. Horaud. New methods for matching 3-D objects with single perspective views. IEEE
Trans. Pattern Anal. Mach. Intelligence, 9(3):401-412, 1987.
4. D. Huttenlocher and S. Ullman. Object recognition using alignment. In International
Conference on Computer Vision, pages 102-111, London, U.K., June 1987.
5. J. Koenderink. Solid Shape. MIT Press, Cambridge, MA, 1990.
6. D. Kriegman and J. Ponce. On recognizing and positioning curved 3D objects from image
contours. IEEE Trans. Pattern Anal. Mach. Intelligence, 12(12):1127-1137, 1990.
7. D. Kriegman, B. Vijayakumar, and J. Ponce. Strategies and constraints for recognizing
and locating curved 3D objects from monocular image features. Technical Report 9201,
Yale Center for Systems Science, 1992.
8. D. G. Lowe. The viewpoint consistency constraint. Int. J. Computer Vision, 1(1), 1987.
9. J. Malik. Interpreting line drawings of curved objects. Int. J. Computer Vision, 1(1), 1987.
10. A. Morgan. Solving Polynomial Systems using Continuation for Engineering and Scientific
Problems. Prentice Hall, Englewood Cliffs, 1987.
11. J. Ponce, S. Petit jean, and D. Kriegman. Computing exact aspect graphs of curved objects:
Algebraic surfaces. In European Conference on Computer Vision, 1991.
12. T. W. Sederberg, D. Anderson, and R. N. Goldman. Implicit representation of parametric
curves and surfaces. Comp. Vision, Graphics, and Image Proces., 28:72-84, 1984.
Polynomial-Time Object Recognition in the
Presence of Clutter, Occlusion, and Uncertainty*
Todd A. Cass
1 Introduction
The task considered here is model-based recognition using local geometric features, e.g.
points and lines, to represent object models and sensory data. The problem is formulated
as matching model features and data features to determine the position and orientation
of an instance of the model. This problem is hard because there axe spurious and missing
features, as well as sensor uncertainty. This paper presents improvements and exten-
sions to earlier work[7] describing robust, complete, and provably correct methods for
polynomial-time object recognition in the presence of clutter, occlusion, and sensor un-
certainty.
We assume the uncertainty in the sensor measurements of the data features is bounded.
A model pose 1 is considered feasible for a given model and data feature match if at that
pose the two matched features are aligned modulo uncertainty, that is, if the image of
the transformed model feature falls within the uncertainty bounds of the data feature.
We show that, given a set of model and data features and assuming bounded sensor
uncertainty, there are only a polynomial number of qualitatively distinct poses matching
the model to the data. Two different poses axe qualitatively distinct if the sets of feature
matches aligned (modulo uncertainty) by them are different.
The idea is that uncertainty constraints impose constraints on feasible model trans-
formations. Using Baird's formulation for uncertainty constraints[2] we show the feature
This report describes research done at the Artificial Intelligence Laboratory of the Mas-
sachusetts Institute of Technology, and was funded in part by an ONR URI grant under
contract N00014-86-K-0685, and in part by DARPA under Army contract DACA76-85-C-
0010, and under ONR contract N00014-85-K-0124.
t The pose of the model is its position and orientation, which is equivalent to the transformation
producing it. In this paper pose and transformation will be used interchangeably.
835
matching problem can be formulated as the geometric problem of analyzing the arrange-
ment of linear constraints in transformation space. We call this approach pose equivalence
analysis. A previous paper [7] introduced the idea of pose equivalence analysis; this pa-
per contributes a simpler explanation of the approach based on linear constraints and
transformations, outlining the general approach for the case of 3D and 2D models with
2D data, and discusses the particular case of 2D models and planar transformations to
illustrate how the structure of the matching problem can be exploited to develop efficient
matching algorithms. This work provides a simple and clean mathematical framework
within which to analyze the feature matching problem in the presence of bounded geo-
metric uncertainty, providing insight into the fundamental nature of this type of feature
matching problem.
1.1 R o b u s t G e o m e t r i c F e a t u r e M a t c h i n g
2 Pose verification may use a richer representation of the model and data to evaluate and verify
an hypothesis [3, 10, 15].
836
are found, including the correct ones. The tractability requirement simply means that
a polynomial-time, and hopefully efficient algorithm exist for the matching procedure.
Except for our previous work [6, 7], and recent work by Breuel[4], among those existing
methods accurately accounting for uncertainty, none can both guarantee that all feasible
object poses will be found and do so in polynomial time. Those that do account for error
and guarantee completeness have expected-case exponential complexity[12].
data, by aligning individual model and d a t a features. Aligning a model feature and a
d a t a feature consists of transforming the model feature such t h a t the transformed model
feature falls within the geometric uncertainty region for the d a t a feature. We can think
of the d a t a as a set of points and uncertainty regions {(Pdj, U~)} in the plane, where
each measured d a t a position is surrounded by some positional uncertainty region UJ'. A
model feature with position Pm~ and a d a t a feature with position Pdj are aligned via a
transformation T if T[Pml] E U~. Intuitively, the whole problem is then to find single
transformations simultaneously aligning in this sense a large number of pairs of model
and image features.
One of the main contributions of this work, and the key insight of this a p p r o a c h is
the idea t h a t under the bounded uncertainty model there are only a polynomial num-
ber of qualitatively different transformations or poses aligning subsets of a given model
feature set with subsets of a given d a t a feature set. Finding these equivalence classes of
transformations is equivalent to finding all qualitatively different sets of feature corre-
spondences. Thus we need not search through an exponential number of sets of possible
feature correspondences as previous systems have, nor consider an infinite set of possible
transformations.
In the 2D case the transformations will consist of a planar rotation, scaling, and trans-
lation. We've said that a model feature rnl and d a t a feature dj are aligned by a transfor-
mation T ill' T[ml] E Uj. Two transformations are qualitatively similar if and only if they
align in this sense exactly the same set of feature matches. All transformations which
align the same set of feature matches are equivalent thus there are equivalence classes
of transformations. More formally, let f / b e the transformation p a r a m e t e r space, and let
T E f/ be a transform. Define ~ ( T ) = { ( m i , d j ) l T [ m i ] E Uj} to be the set of matches
aligned by the transformation T. The function ~,(T) partitions /2 forming equivalence
classes of transformations Eh, where f/ = l,Jk Ek and T = T' ~ r = ~(T'). The
entire recognition approach developed in this p a p e r is based upon computing these equiv-
aience classes of transformations, and the set of feature matches associated with each of
them.
2.1 R e l a t i n g P o s e S p a c e a n d C o r r e s p o n d e n c e Space
If a model feature ml and a d a t a feature dj are to correspond to one another, the set
of transformations on the model feature which are feasible can be defined as the set of
transformations Ym,,dj = {T E f/lT[m,] E Uj}. Let A,t = {(mi, dj)} be some match set. a.
A m a t c h set is called geometrically consistent iff N ( m , , d j ) ~ Ym,,d~ r 0, t h a t is iff there
exists some transformation which is feasible for all (ml, dj) E ,h~.
The match set given by ~o(T) for some transformation T is called a mazimal geo-
metrically consistent match set. A match set .A4 is a maximal geometrically-consistent
match set (or, a mazimal match set) if it is the largest geometrically consistent m a t c h
set at some transformation T. Thus by definition the match set given by ~o(T) is a
maximal match set. The function ~ ( T ) is a m a p p i n g f_tom transformation space to cor-
respondence space. ~o(T) : f/ , 2 {m~}x{di}, and there is a one-to-one correspondence
between the pose equivalence classes and the maximal m a t c h sets given by ~ ( T ) , , E k
3 This is also sometimes called a correspondence, or a matching. To clarify terms, we will define
a match as a pair of a model feature and a data feature, and a match set as a set of matches.
The term matching implies a match set in which the model and data features are in one-to-one
correspondence.
838
iff T E E k. The function ~(T) partitions the infinite set of possible object poses into a
polynomial-sized set of pose equivalence classes; and identifies a polynomial sized subset
of the exponential-sized set of possible match sets.
The important point is that the pose equivalence classes and their associated maxi-
mal match sets are the only objects of interest: all poses within a pose equivalence class
are qualitatively the same; and the maximal geometrically consistent match sets are es-
sentially the only sets of feature correspondences that need be considered because they
correspond to the pose equivalence classes. Note that this implies we do not need to
consider all consistent match sets, or search for one-to-one feature matchings, because
they are simply subsets of some maximal match set, and provide no new pose equiva-
lence classes. However, given a match set we can easily construct a maximal, one-to-one
matching between data and model features[14].
One distinction between this approach, which works in transformation space, and
robust and complete correspondence space tree searches[13, 2] is that for each maximal
geometrically consistent match set (or equivalently for each equivalence class of transfor-
mations) there is an exponential sized set (in terms of the cardinality of the match set)
of different subsets of feature correspondences which all specify the same set of feasible
transformations. Thus the straightforward pruned tree search does too much (exponen-
tially more) work. This is part of the reason why these correspondence space search
techniques had exponential expected case performance, yet our approach is polynomial.
2.2 F e a t u r e M a t c h i n g R e q u i r e s O n l y P o l y n o m i a l T i m e
Formalizing the localization problem in terms of bounded uncertainty regions and trans-
formation equivalence classes allows us to show that it can be solved in time polynomial
in the size of the feature sets. Cass[6] originally demonstrated this using quadratic uncer-
tainty constraints. This idea can be easily illustrated using the linear vector space of 2D
scaled rotations and translations, and the linear constraint formulation used by Baird[2]
and recently by Breuel4[4]. In the 2D case the transformations will consist of a planar
rotation, scaling, and translation. Any vector s = [sx, s2]r = Icecos 8, ~r sin 8]r is equiva-
lent to a linear operator S performing a rigid rotation by an orthogonal matrix R E S02
and a scaling by a positive factor o', where S = ~ R = o" [sin0 cos0 J s2 sl "
denote the group ofail transformations by ~2, with translations given by t = [tl,t~]r and
scaled rotations given by s = [sl,s2]T, so a point x is transformed by T[x] = Sx-{-t, and
a transformation, T, can be represented by a vector T ~-~ [sl, s~.,tl, t2]T E ~t4.
By assuming k-sided polygonal uncertainty regions U~ and following the formulation
of Baird, the uncertainty regions UiP can be described by the set of points x satisfying
inequalities (1): ( x - pdj)Tfi~ _< e~ for l = 1 ...../e and thus by substitution the set of
feasible transformations for a feature match (m~, d#) are constrained by inequalities (2):
(Spm~ + t -pdj)rfiz ~ el for l = 1, ...,k which can be rewritten as constraints on the
transformation vector [sl,s2,tl,t2] r as SlO~1 -~-82Ot2 -~-~IR~-~-t2n ~ ~ pd#rfi/ + ez for
[ : 1 . . . . . ~, Wlth O~I : (~ra~RZ9 - ~ - ~ i ~ ) a n d o~2 : ( ~ zi n Y
I -~iR~) a n d w h e r e Ill is
the unit normal vector and ez the scalar distance describing each linear constraint for
I = 1 ..... k. The first set of linear inequalities, (1), delineate the polygonal uncertainty
regions U~ by the intersection of k halfplanes, and the second set of inequalities, (2),
4 Thomas Breuel pointed out the value of Baiid's linear formulation of the 2D transformation
and his use of linear uncertainty constraints.
839
2.3 T h e C a s e o f 3 D m o d e l s a n d 2D d a t a
Of particular interest is the localization of 3D objects from 2D image data. We'll con-
sider the case where the transformation consists of rigid 3D motion, scaling and ortho-
graphic projection; where a model point Pm~ is transformed by T~pm~] = Spin, + t with
The computational geometric term is arrangement for the topological configuration of geo-
metric objects like linear surfaces and polytopes.
e As was shown in Cass[6] these same ideas apply to cases using non-llnear uncertainty con-
stralnts, such as circles. The basic idea is the same however in these cases we must analyze
an arrangement of quadratic surfaces which is computationally more difficult.
7 To measure the quality of a match set we approximate the size of the largest one-to-one
matching contained in a match set by the minimum of the number of distinct image features
and distinct model features[14].
840
S =
[
s21 s 2 2 s 2 3 1
$11 S12 S13 : t r P R and P :
I0~176 S03,
1 0 '
R C and cr > O E ~ . In the case ofpla-
nav3DobjectsthiscorrespondstothetransformationS= [8,18,2]s~1
s22 , sij E N, describing
the projection of all rotations and scalings of a planar 3D object. The linear constraint
formulation applies to any affine transformation[2, 4]. To exploit linear uncertainty con-
straints on the feasible transformations as before, we must have the property that the
r~176176176176 S= [ sl-s2]s2
s, , S l , S 2 E ~ , i n
O(kemSne) elements in the arrangement, and so analogous to the 2D case there are are
O(kemSne) transformation equivalence classes. Note that the special case of planar ro-
tation, translation, and scaling discussed in the previous section is a restriction of this
transformation to those 2 x 2 matrices S satisfying SS y = ~2I.
To handle the case of 3D non-planar objects we follow the following strategy. We
compute equivalence classes in an extended transformation space ~ which is a vector
space containing the space D. After computing transformation equivalence classes we
then restrict them back to the non-linear transformation space we are interested in. So
consider the vector space ,~ of 2 x 3 matrices S = [A1
~21 ~t2 51s] E S where s~ E ~ ,
~22~2~J
L
and define ~ to be the set of transformations ( S , t ) E ~ where T[pm,] = SPIn, + t
as before. The set ~ is isomorphic to ~ s . Again expressing the uncertainty regions in
t h e form of linear constraints we have (Spm~ + t -pd~)Tfi~ < el for l = 1 ..... k. These
describe k constraint hyperplanes in the linear, 8-dimensional transformation space ~ .
The k m n hyperplanes due to all feature matches again form an arrangement in f?. Anal-
ogous to the 2D case there are O(kSmSns) elements in this arrangement, and O(kSmSns)
transformation equivalence classes Ek for this extended transformation.
To consider the case where the general 2 3 linear transformation, S is restricted
to the case, S of true 3D motion, scaling, and projection to the image plane, we make
the following observations. Each element of this arrangement in ~ s is associated with
a maximal match set, so there are O(kSmSn s) maximal match sets. To restrict to rigid
3D motion and orthographic projection we intersect the hyperplanar arrangement with
the quadratic surface described by the constraints ~ T = cr2i. We still have O(kSmSn s)
maximal match sets with the restricted transformation, although the equivalence classes
are more complicated.
We see from the previous analysis that there are only a polynomial number of qual-
itatively different model poses aligning the model features with the data features. A
simple algorithm for constructing pose hypothesis consists of constructing the transfor-
mation equivalence classes, or representatives points of them (such as vertices[6]) from
841
~.. ,~L~ :
t ~ t- ," Lv
Fig. 1. A real image, the edges, and the correct hypothesis. The dots on the contours are the
feature points used in feature matching. Images of this level of clutter and occlusion are typical.
Due to space limitations we can only outline the algorithm. The interested reader is
referred to Cass[8] for a more complete description. The approach taken is to decompose
the transformation into the rotational component represented by the sl-sz plane and the
translational component represented by the tl-t2 plane. Equivalence classes of rotation
are constructed in the sl-s~ plane in each of which the same set of match sets are feasible,
and these equivalence classes are explored looking for large maximal match sets. For any
given rotational equivalence class the match sets can be derived by analyzing translational
equivalence classes in the tl-t~ plane.
We explore the transformation space locally by sequentially choosing a base feature
match and analyzing the set of other matches consistent with the base match by partially
exploring their mutual constraint arrangement. We used angle constraints to eliminate
impossible match combinations, but only used position constraints to construct equiva-
842
lence classes. Empirically for m model features and n data features we found that ana-
lyzing all maximal match sets took ~ m 2 n 2 time. We can get an expected time speedup
by randomly choosing the base matches[10]. This leads to a very good approximate al-
gorithm in which we expect to do ~ m2n work until a correct base match is found along
with an associated large consistent match set. Experiments s show empirically that for
practical problem instances the computational complexity in practice is quite reasonable,
and much lower than the theoretical upper bound.
References
1. Alt, H. & K. Mehlhorn & H. Wagener& E. Welzl, 1988, "Congruence, Similarity, and Sym-
metries of Geometric Objects", In Discrete and Computational Geometry, Springer-Verlag,
New York, 3:237-256.
2. Baird, H.S., 1985, Model-Based Image Matching Using Location, MIT Press, Cambridge,
MA.
3. Bolles, R.C & R.A. Cain, 1982, "Recognizing and Locating Partially Visible Objects: The
Local-feature-focus Method", International Journal of Robotics Research, 1(3):57-82.
4. Breuel, T. M., 1990, "An Efficient Correspondence Based Algorithm for 2D and 3D Model
Based Recognitlo~t', MIT AI Lab Memo 1259.
5. Cass, Todd A., 1988, "A Robust Implementation of 2D Model-Based Recognition", Pro-
ceedings IEEE Conf. on Computer Vision and Pattern Recognition, Ann Arbor, Michigan.
6. Cass, Todd A., 1990, "Feature Matching for Object Localization in the Presence of Uncer-
tainty", MIT AI Lab Memo 1133.
7. Cass, Todd A., 1990, "Feature Matching for Object Localization in the Presence of Uncer-
tainty", Proceedings of the International Conference on Computer Vision, Osaka, Japan.
8. Cass, Todd A., 1991, "Polynomial-Time Object Recognition in the Presence of Clutter,
Occlusion, and Uncertainty", MIT AI Lab Memo No. 1302.
9. Edelsbrunner, H., 1987, Algorithms in Combinatorial Geometry, Sp$inger-Verlag.
10. Fischler,M.A. & R.C. Bolles, 1981, "Random Sample Consensus: A Paradigm for Model
Fitting with Applications to Image Analysis and Automated Cartography", Communica-
tions of the ACM 24(6):381-395.
11. Ellis, R.E., 1989, "Uncertainty Estimates for Polyhedral Object Recognition", IEEE
Int. Conf. Rob. Aut., pp. 348-353.
12. Grimson, W.E.L., 1990, "The combinatorics of object recognition in cluttered environments
using constrained search", Artificial Intelligence,44:121-166.
13. Grimson, W.E.L. & T. Lozano-Perez, 1987, "Localizing Overlapping Parts by Searching
the Interpretation Tree", IEEE Trans. on Pat. Anal. &Mach. Intel., 9(4):469-482.
14. Huttenlocher, D.P. & T. Cass, 1992, "Measuring the Quality of Hypotheses in Model-Based
Recognition", Proceedings of the European Conference on Computer Vision, Genova, Italy.
15. Huttenlocher, D.P. & S. Ullman, 1990, "Recognizing Solid Objects by Alignment with an
Image," Inter. ]ourn. Comp. Vision 5(2):195-212.
16. Jacobs, D., 1991, "Optimal Matching of Planar Models in 3D Scenes," IEEE Conf. Comp.
Vis. and Patt. Recog. pp. 269-274.
17. Lowe, D.G., 1986, Perceptual Organization and Visual Recognition, Kluwer Academic Pub-
lishers, Boston, MA.
18. Stockman, G. & S. Kopstein & S. Bennet, 1982, "Matching Images to ModeLs for Registra-
tion and Object Detection via Clustering", IEEE Trans. on Pat. Anal. & Mach. Intel.4(3).
19. Thompson, D. & J.L. Mundy, 1987, "Three-Dimensional Model Matching From an Uncon-
strained Viewpoint", Proc. IEEE Conf. Rob. Aut. pp. 280.
This article was processed using the IATEX macro package with ECCV92 style
s Experiments were run with m E [10,100] and n E [0,500]. The uncertainty assumed was e = 8
pixels and 6 = ~ .
H i e r a r c h i c a l S h a p e R e c o g n i t i o n B a s e d o n 3-D
Multiresolution Analysis.
1 Introduction
Recent progress in range finding technology has made it possible to perform direct mea-
surement of 3-D coordinates from object surfaces. Such a measurement provides 3-D
surface data as a set of enormous discrete points. Since the raw data are unstable to
direct interpretation, they must be translated into appropriate representation. In the
recognition of curved objects from depth data, the surface data are usually divided into
patches grouped by their normal vector direction. The segmentation, however, often re-
sults in failure, if the density of the measurement is insufficient to represent a complex
object shape. Other segmentation algorithms based differential geometry iRA1] [PR1]
are noise-sensitive, because calculation of curvature is intrinsically local computation.
One approach to these problems is to analyze an object as a hierarchy of shape primi-
tives from coarse level to fine level. D. Marr stated the importance of the multiresolution
analysis [Dell. A. P. Witkin introduced scale-space technique which generates multires-
olution signals by convolvlng the original signal with Gaussian kernels [Wil]. Lifchitz
applied the multiresolution analysis to image processing [LP1].
Another approach using the structure of geometrical regions and characteristic lines
of a curved surface [TL1] seems attractive. However, the relationships between this at-
tractive approach and multiresolution representation have not been well understood.
Our approach to the issue is essentially based on the multiresolntion analysis. We
have introduced a hierarchical symbolic discription of contour using scale-space filtering
[MK1]. We extend the scale-space approach to 3-D discrete surface data. After a hierarchy
of surface regions is computed by scale-space filtering, the representation is matched with
3-D object models.
In section 2, scale-space signal analysis method is extented to 3-D surface analysis
using difference equation computation. The extended scale-space, unfortunately, does not
show monotonicity required for generating hierarchy, since zero-crossing contours of 3-D
surface often vanish or merge, when the scale increased. The resulting scale-space cannot
be described by a simple tree as reported in 2-D case [YP1]. In this paper, we regard this
3-D scale-space filtering as continuous deformation process which leads surface curvature
to a constant. From this view point, we can generate a hierarchical description by marking
a non-monotonic deformation a.s an exceptional case.
The algorithm of the hierarchical recognition is twofold: KH description generation
and hierarchical pattern matching. In section 3, topological analysis of region deformation
is described and an algorithm to create the KH description is illustrated. We use the
Ganssian curvature and the mean curvature as viewer invariant features to segment
the surface into primitives. First, we extract the local topological region deformation of
zero-crossing lines of the features, and create the KH description from the deformation.
Since the information in the KH description is limited to local changes, further global
interpretation is required. In section 4, we add auxiliary information by analyzing the
global change of regions to translate the KH description into a tree. The tree generated
contains symbolic description of the shape from the coarser level to the finer level and
the pattern matching is performed efficiently with the tree.
In section 5, examples are shown. We apply the algorithm to several sample data sets
from a range finder.
2.1 D e f i n i t i o n s o f C u r v a t u r e
The curvatures used in tllis article are defined as follows.
Suppose a parametric form of a surface X(u, v) = (x(u, v), y(u, v), z(u, v)). A tangen-
tim llne at X(u, v) is denoted by t(u, v) = duXt(u, v) + dvX,(u, v).
The curvature at X along (du, dr) is defined as A(du, dr) - $~.( d ~ , d t ) '
\x..x.,/
With the directional vectors which maximizes and minimizes the curvature at the
point p as (~1, ~?1) and (I2, ~7i), the maximum curvature nl, the minimum curvature ~2,
the mean curvature H, and the Ganssian curvature H are defined as:
~l = ,X((1, ~l ), n2 = ~( (2, ~2 ), H = ~ +-x-.~x~ , and K = ~;ln2, respectively.
Characteristic contours which satisfy H = ~1+,,~ 2
= 0, and K = ~1x2 = 0 are called
H0 contour and K0 contour respectively.
o.2 + ~ = ?~,"(1)
845
where r v) = (x(u, v), y(u, v), z(u, v)) is a parametric representation of a surface. This
equation is approximated by a difference equation:
+nat r - a , , t ) - 2r + r + za,, t)
zav2 ...(2)
Extend form of the equation to arbitrary number of points is:
where ~i0, {r }, and lik are a sample point, its nelghbour samples, and the distance of
between r and r respectively.
Iterating (3), the curvature at each sample point converges to a constant.
2.3 C h o o s i n g G e o m e t r i c F e a t u r e s
The fact that the curvature of the surface converge to a constant value by the filtering
indicates that the concave or convex regions ultimately vanish or merge into a single
region as the scale parameter t increases. This behaviour gives us a hint which contour
must be chosen to segment the curved surface hierarchically.
A convex and a concave are enclosed by K0 contours, and a valley and a ridge by H0
contours. Since each contour is insufficient to characterize a complex surface, we choose
both K0 and H0 to segment the surface. With this segmentation, continuous deformation
of the contours by the Gaussian filtering forms scale-space-llke image (KH-image) of a
3-D shape as shown in figure. 1. Actually, the image does not form a tree because of
non-monotoniclty of contour deformation.
3.1 C o n t o u r T o p o l o g y a n d H i e r a r c h y
Consider the topology of curved contours which are deforming continuously, its state
can be characterized by the number of contours and connections among them. The basic
topological changes are #eneration, connection, inner contact, outer contact, intersection,
and disappearance as indicated in figure 2. The inner contact and outer contact are further
dlvided into contact with other contours and self-contact.
T h e requirement for creating a hierarchical structure is that the number of region
must increase monotonously as the scale decreases 1. Since the requirements are not
satisfied, we must impose the following two assumptions to the interpretation of the
disappearance of intersection or eontact(Fig.2[a2]) which breaks the hierarchy.
Assumption 1 If a contour disappears at the smaller scale, the traces of the contour
i The parameter scale specifies the direction of deformation, in this section. Scale decreases
toward leaf nodes.
846
3.2 K 0 and H 0 as C o n t o u r s
3.3 T o p o l o g i c a l A n a l y s i s of K 0 a n d H 0
In this section, we summerize the topological changes of K0 and H0 with scale decre-
ment. In the case, the numbers of contours and connection points must increase in our
interpretation. With the fact that K0 does not contact H0 except at a singular point,
actual types of contour deformation topology is limited.
Qualitative changes of K0 and H0 contour are generation, inner contact, outer contact,
inner self-contact, and outer self-contact as shown in 3.1. Figures 3 and 4 indicate the
topological changes of H0's and K0's, respectively. Actually, some of these changes does
not occur because of the smoothness of a surface. For example, two K0's do not contact
except when there exist two H0's between them (figure 3(c)).
3.4 G e n e r a t i n g a K H - d e s c r i p t i o n
K0's and H0's divide the surface of an objcet into four types of regions according to their
signs of the Gaussian curvature and the mean curvature. Each region is called a KH-
element. Figure 6 shows their typical shapes. A KH-element is specified by a triplet of the
signs of the Gaussian curvature, the maximum curvature, and the minimum curvature:
(+, + , - ) is such an example.
The topological changes of K0's and H0's with scale increment such as the generation
or contact of regions correspond to the inclusion or connection of regions. Our coarse-to-
fine representation of an object is a sequence of images described by the local topology
of KH-elements which are linked by between scales. The description derived is called
KH-description. The basic topological changes of KH-descriptlon shown in figure 3(a)
and figure 4(a) are symbolized as shown figure (b).
A basic flow to generate a KH-descrlption is the following:
1. Filter the discrete data at each scale.
2. Segment the surface by H0's and K0's and label peak and pit.
3. Analyze topological changes between the current scale and the previous.
847
E
(a) (a)
-Fi--~--I li~-l-~
F~_F~_ F~-3F~BF--~_ ~
l i X i l - l - ~ - ri-Ili-I [ i - - l - F - ~
C
-I-ii-I
~ - ~ C NOT EXISTING
(b) E ~ F
l~ol-l(ll~l)| DT~---I_ ~
(b) A B C D Ib) li-----]--I--N---I
A B C D
Maximum normal curvature : " + " K<0 Saddle surface Adent C dent d
Minimum normal curvature : k2
Gaussian curvature : K=KI*K2 + + + K>0 H>0
Mean curvature : H=(KI+K2)/2 \~ ]/
B bump D bump.d
Surface sign :
(sign of K,sign of Kl,sign of + - - K>O H<O Pe
aC~,ace {m 9 H-OJ 9 H-~
K2)
fig.6 Surface classification. //~ f~.7 pdmitive surface elements.
848
4 Shape Matching
In this section, an efficient shape matching algorithm using the KH-description is pro-
posed. The KH-description describes the local topological changes of a contour caused
by scale change. When the description is parsed, regions surrounded by connected re-
gions( figure 2b(E)(F)) cannot be recognized. To recognize these regions correctly, the
contour-based KH-description is translated into region-bazed network description. The
new description is parsed along the tree structure extracted from it to improve the match-
ing efficiency.
4.1 D e t e c t i o n of R e g i o n Loops
The topological changes of a contour are described with the KH-description, but the
qualitative changes of a region is not described in it, because it does not contain global
topological changes. If a group of regions form a loop as shown in figure. 5, a single region
will be divided into two separate regions. Since it is difficult to detect these closed loops
from a single-scale KH-image, the loops should be searched by coarse-to-fine tracing in
the KH-description. Region loops found by the search are added to the description. These
qualitative changes are shown in Fig.f: (E) shows the region created by inner contacts
and (F) shows that by outer contacts. Thus the global topological information can be
added to the KH-description.
(a)A face image (left) derived from s range finder, (a) HAND-I (left) and HANO-2(right) measured with
s range f i n d e r ~ . ~ , ~
and its filtered image(right).
9 PEAK
[ : ~ O E G H
~
(b)Hierachical descr ,tion derived from HAND-1.
9 PEAK
U
(b) Surface classification for diffemt scales.
fig,8 Filtering and surface classification for a face
A B
data.
b mp mp bump b m
4.3 M a t c h i n g
The matching algorithm is essentially based on the PS-tree. Since the PS-tree loses the
details of shape information at leaf level, the KH-description can be compared if needed.
The outline of shape matching is the following.
step 1. Add the number and distribution of region to the KH-description
step 2. Extract PS-elements from the KH-description to generate a PS-tree.
step 3. Compare the PS-trees from the root node.
step 4. Match the KH-elements corresponding to the PS-elements, if detail examina-
tion is required.
5 Experimental Results
5.1 E x a m p l e 1: A Face
A set of range data from a laser range finder was used to validate our algorithm. Fig-
ure 8 (a) shows the filtering results: the original face data (left) and its filtered image.
Figure 8 (b) shows labeled KH-images at different scales. From these KH-images, the
KH-description is created by adding the number and relation of regions as shown in fig-
ure 8 (b). The symbols [A]-[L] in the KH-description correspond to the labels of regions
in (a). Circles and filled circles in KH-description indicate peaks and pits, respectively.
Frames enclosing peaks and pits represent H0 contours.
In the example, moustaches are found at the surface of an egg-llke shape at the
coarser level, and then eyes, nose, and cheeks are found by tracing the KH-description
downward to the finer level. The KH-description reflects the shape of overall scale. By
extracting PS-elements from the KH-description and by linking them, the description
can be translated into a PS-tree.
5.2 E x a m p l e 2: H a n d s
In the section, similar objects are analyzed to evaluate the hierarchical similarity of the
descriptions. Subjects were requested to imitate a shape of prototype hand: the hands
were measured by the range finder. Figure 10 (a) shows a pair of data set HAND-I(left)
and HAND-2(right). Both the data sets are filtered and analyzed to create labeled KH-
descriptions. Figure 10(b) shows KH-images at different scales. Figure 11 (a) and (b) are
the KH-descriptions derived from HAND-1 and HAND-2, respectively. The symbols A-H
in figure 10 and 11 are corresponding regions.
Comparing with the KH-description and the PS-tree, the KH-elements distribution
of subparts rl and r2 in figure 10(a) evidently differ from those in (b) at the coarser
level, although the PS-elements A-I are common at the finer level. The fact indicates
that the PS-description is more useful than the KH-description to match objects. Fig.ll
is common PS-subtrees extracted from figure 10. The nodes, dent, bump, dent_b, and
bumpM indicates PS-elements introduced in figure 6.
6 Conclusions
In this paper, we have extended the scale-space approach used in 1-D signal analysis
to 3-D shape analysis. Scale-space filtered surface images of an object are divided into
segments based on curvature analysis and linked between consecutive scales.
851
References
This article was processed using the ISTEX macro package with ECCV92 style
Object Recognition by Flexible Template Matching
using Genetic Algorithms *
A. Hill, C. J. Taylor and T. Cootes.
Department of Medical Biophysics, University of Manchester,
Oxford Road, Manchester M13 9PT, England.
provide a good explanation (or interpretation) of the image. Our results demonstrate
the feasibility of this approach.
2 Genetic Algorithms
GAs employ mechanisms analogous to those involved in natural selection to conduct
a search through a given parameter space for the global optimum of some objective
function. The main features of the approach are as follows :
9 A point in the search space is encoded as a chromosome.
9 A population of N chromosomes/search points is maintained.
9 New points are generated by probabilistically combining existing solutions.
9 Optimal solutions are evolved by iteratively producing new generations of
chromosomes using a selective breeding strategy based on the relative values of
the objective function for the different members of the population.
A solution x_ = (xl,x2, ..,xn), where the xi are in our case the model parameters, is
encoded as a string of genes to form a chromosome representing an individual. In many
applications the gene values are [0,1] and the chromosomes are simply bit strings. An
objective function, f, is supplied which can decode the chromosome and assign a fitness
value to the individual the chromosome represents.
Given a population of chromosomes the genetic operators crossover and mutation
can be applied in order to propagate variation within the population. Crossover takes
two parent chromosomes, cuts them at some random gene/bit position and recombines
the opposing sections to create two children e.g. crossing the chromosomes 010-11010
and 100-00101 at position 3-4 gives 010-00101 and 100-11010. Mutation is a
background operator which selects a gene at random on a given individual and mutates
the value for that gene (for bit strings the bit is complemented).
The search for an optimal solution starts with a randomly generated population of
chromosomes; an iterative procedure is used to conduct the search. For each iteration
a process of selection from the current generation of chromosomes is followed by
application of the genetic operators and re-evaiuation of the resulting chromosomes.
Selection allocates a number of trials to each individual according to its relative fitness
value A f t , 7 = 1 I N Ift + 1'2 + .. + fN} 9 T h e fitter an individual the more trials it will
be allocated and vice versa. Average individuals are allocated only one trial.
Trials are conducted by applying the genetic operators (in particular crossover) to
selected individuals, thus producing a new generation of chromosomes. The algorithm
progresses by allocating, at each iteration, ever more trials to the high performance areas
of the search space under the assumption that these areas are associated with short
sub-sections of chromosomes which can be recombined using the random cut-and-mix
of crossover to generate even better solutions.
When we use the objective function to evaluate instances of the model we seek to
minimise f, thus favouring solutions with strong edges (g large) of equal magnitude
( ~i/g- 11 --* 0 ) located close to the boundary position predicted by the model (Pi ~ 0 ).
6 Results
In order to assess the method we employed 2 frames from each of ten echocardiogram
time sequences. The frames showed the LV in its most extended and contracted state
for each sequence and the LV boundary was delineated on each frame by an expert.
The images were interpreted using the model-based approach described above and the
855
resulting LV boundaries were compared with the expert-generated boundaries using the
P
1
mean pixel distance between boundaries A = p Z / ( a i , x - b j J ) 2 + (a~y-bJ,y)2 where
i=1
= (bj~,bja) is the point on the expert boundary closest to the point .~ on the GA
generated boundary. The control parameters for the GA were : population size (N) =
50, crossover rate (C) = 0.6 and mutation rate (M) of 0.005. A limit of 5000 objective
function evaluations was employed. For 19 of the 20 images A < 6 and for 14 of the
20 images A ~ 4. A typical result is show in figure 2.
7 N i c h e s and S p e c i e s
During the course of the above experiments it became obvious that one of the major
problems for the GA search was premature convergence to sub-optimal solutions in the
presence of multiple plausible interpretations. Rather than extract a single solution, we
would like the GA search to extract a handful of strong, ranked candidates for the object
we wish to locate. The problem of locating multiple optima when using GAs is discussed
by Goldberg [2]. The approach adopted is to reduce the number of individuals in
over-crowded areas of the search space by modifying their function values.
In this case the fitness of an individual is weighted by the number of neighbours the
individual has. The more neighbours an individual has the worse its fitness value. The
number of individuals allowed to crowd into the various areas of the search space is
thus proportional to the relative fitness value associated with the different areas of the
search space. In order to maintain stable sub-populations two further modifications are
necessary. The first is simply to increase the size of the population in order to prevent
extinction due to sampling errors. The second is to implement a restricted mating strategy
in order to promote speciation i.e. prefer neighbours to distant individuals for crossover.
The key to the entire procedure is the ability to decide Who is a neighbour? i.e. how
close are two points in the search space. We define two individuals x_l,x~ to be neighbours
if they lie within a sub-space defined by the global transformation parameters, translation
(tx, ty), scaling (s) and rotation (dO) :
Itx.l-txal <- 6tx, Jty:-t,.2l <- 6t,, [sl-s2] < 6s, [r162 -< 6 r
856
~149
, I
/
it. "
group 1
group 2
....... group 3
. . . . group 4
1 Introduction
An ARG is an attributed and weighted graph which we denote by a triple g = (d, rl, r2).
In this notation, d -- (1 .... , rn} represents a set o f m nodes; rl -- {rl(i) I i E d} is a set of
unary relations defined over d in which rl(i) = [r~l)(i), ..., r~K1)(i)] is a vector consisting
of K1 different types of unary relations; r2 = (r2(i,j) I (i,j) e d 2} is a set of binary
(bilateral) relations defined over d 2 = d d in which r2(i, j) = [r~D(i, j), ..., r~g2)(i,j)]
is a vector consisting of K2 different types of binary relations.
* This work was supported by IED and SERC, project number IED-1936. The images were
kindly provided by the Defence Research Agency, RSRE, UK.
858
As binary features we use the relative orientation between lines, which is invariant to
rotation, translation and scale changes and the minimum distance between the end points
of the lines and the distance between their midpoints, which are invariant to rotation
and translation but not to scale changes. Therefore, our graph matching is independent
of translation and rotation, but not of scale changes.
In the following discussion of inexact matching from one ARG to another, lower-case
notation refers to instances of ARGs built from scenes while upper-case refers to instances
of ARGs derived from models.
Given the scene ARG and the model ARG, the sought matching is a mapping from
scene nodes to model nodes which optimises a certain criterion. We include a 0 node to
the model to allow for unmatched lines from the scene. Further, we allow many-to-one
matches. We formulate the problem as one of continuous relaxation labelling [1, 2, 6].
The set L of possible matches consists of those matchings for which the difference in
orientation between the matched lines is within a certain range. (The orientation of each
line is approximately known, either from the plane route or the map.) Associated with
match (i, I) is a real number f(i, I) e [0, 1] called state of the match (i, I). It represents
the certainty with which i E d is mapped to I E D. The set of these numbers is called
the state of matching or mapping. A constraint on f is
Z f(i, I) = 1 Vi 9 d (1)
{ II( i,I)E L }
The optimal mapping f* maximises the global gain functional defined by:
where H 9 [0, 1) is a bias controlling the excitatory (> 0) / inhibitory (< 0) behaviour
of lr and a~k) > 0 are some parameters determined experimentally.
4 The Algorithm
The solution f* is required to be unambiguous f(i, I) E {0, 1}. Finding the solution is a
combinatorial optimisation problem. To perform this, we decided to use the continuous
relaxation labelling method [1, 2, 6]. In this method, a global combinatorial solution is
achieved through propagation of local information. The algorithm is ilrherently parallel
and distributed, and thus could be efficiently implemented on SIMD architectures.
For continuous relaxation labelling, a final solution is reached by allowing f E [0, 1].
We introduce a time parameter into f such that f -- f(O. We are interested in
constructing a dynamic system that can locate as the final solution a maximum of E(f).
859
We require that the energy E should not decrease along its trajectory f ( 0 as the system
evolves, i.e. ~ > 0. The process starts with an initial state f(0) which we set at random.
At time t, the gradient function q(0 is computed using:
The component q(i, I) is the support for the match (i, I) from all other matches.
Based on the gradient, the state f(0 is updated to reduce the energy using an updating
rule. It is meant to update f(t) by an appropriate amount in the appropriate direction.
In the adopted algorithm, the computation of the next state f(t+l) from f(O and q(O is
based on the gradient projection (GP) operation described in [1, 4].
The iteration terminates when the algorithm converges, i.e. when f(t+D = f(O. The
final labelling is in general unambiguous, i.e. f* (i, I) = I or 0. This means that each node
has a single interpretation. For details of the algorithm and its convergence properties,
readers are referred to [1].
5 Experimental results
Figure la shows one of the images we used and Figure lb part of the corresponding map.
A set of line segments, shown in Figure lc, are extracted from the map. These are used to
built a model ARG. This model ARG is used for all experiments presented in this paper.
Shown in Figure ld are the line segments of the image extracted by a preprocessing
procedure [5]. The scale of the scene to the model is about 50:26. We have broken on
purpose long lines in the model into short ones. They are broken in such a way that their
length is about the same as the average length of those in the scene. This is necessary
when distance relations are used as constraints for matching. A scene ARG is extracted
from these scene line segments.
Table 1 shows the iterative process of matching the image lines in Figure ld to the
model lines in Figure lc, with the standard value H = 0.3. At time t = 0, the initial
860
correspondences are all given nearly equal strength. From t = 1 to 25, all scene nodes
(lines) are maximally associated to node 0 ( N U L L ) of the model. This is because during
the first few iterations, there is not enough supporting evidence for n o n - N U L L matches.
Meaningful matches emerge from t = 26 onwards as evidence is gathered and propagated.
The algorithm converges at t = 30. There are basically no wrong matches in the presented
result. The computation on a SUN-4 work station takes about 210 seconds user time and
6 seconds system time for the matching results shown in the above table.
To test the algorithm, we performed several experiments by adding extra lines to the
scene and subtracting others. There were basically no wrong matches in the results. Since
precise knowledge of the scale of scenes may not be guaranteed, we also run experiments
with incorrect scale ratios. Results are meaningful and reasonable for the ratio ranging
from 30:26 to 60:26. However, the results deteriorate rapidly and become disastrous
beyond this range.
6 Conclusion
We have developed an approach for road network matching and recognition from line seg-
ments detected from aerial images. The ARG representation is used to describe scenes and
models. The optimal matching optimises a gain functional which measures the goodness
of fit between the model ARG and the scene ARG. Issues we have dealt with include in-
exact matching in the presence of distortions, N U L L matches and many-to-one matches.
The computational algorithm for the matching works in a parallel and distributed way,
and is suited for implementation on SIMD architectures like the Connectionist Machines.
We have presented experiments to demonstrate the performance of the approach.
Our matching is invariant to translation and rotation but not to scale changes. The
results are not apparently affected if both models and scenes are changed by the same
scale factor. However, they will deteriorate drastically when scene scales are changed
relative to models by a relative factor larger than, say, 20%. Our work is aimed at
performing inexact matching under distortions caused by noise, but not at dealing with
rubber-like shape changes. This is because the constraints used are geometric rather than
topological.
References
t
~t
J
it
Fig. 1. The image on top left is the originai image. The one on top right is the map that contains
the image area. At the bottom left is shown part of the line model extracted from the map. At
the bottom right the lines extracted from the image are shown.
This article was processed using the IbTEX macro package with ECCV92 style
Intensity and Edge-Based Symmetry Detection
Applied to Car-Following *
1 Introduction
i . Detecting leading cars. This means repeated visual scanning of the road in front of
the car until an object appears which can be identified as another vehicle.
2. Visual tracking of a car's rear while its image position and size may vary greatly.
3. Accurate measuring of the car's dynamic image size needed for the speed control.
The m e t h o d s presented here exploit the s y m m e t r y property of the rear view of most
vehicles on normal roads. Mirror s y m m e t r y with respect to a vertical axis is one of the
most striking generic shape features available for object recognition in a car-following
situation. Initially, we use an intensity-based s y m m e t r y finder to detect image regions t h a t
are candidates for a leading car. The vertical axis of s y m m e t r y obtained from this step
is also an excellent feature for measuring the leading car's relative lateral displacement
in consecutive images because it is invariant under (vertical) nodding movements of the
camera and under changes of object size. To exactly measure the image size of the car
in front, a novel edge detector has been developed which enhances edges t h a t have a
symmetric counterpart with respect to a given axis.
* This work has been supported by the German Federal Ministry of Research and Technology
(BMFT) and by the Volkswagen AG (VW). PROMETHEUS PRO-ART Project: ITM8900/2
Lecture Notes in ComputerScience,Vol.588
G. Sandini(Ed.)
ComputerVision- ECCV '92
9 Springer-VerlagBerlinHeidelberg1992
866
In the following, we briefly review previous work on symmetry detection and finding
symmetry azes. There exist optimal solutions for this problem if the d a t a can be m a p p e d
onto a set of points in a plane and only accurately symmetric point configurations are
searched for. However, for real image d a t a such methods cannot be used because they
fall to detect imperfect symmetry. Other algorithms assume t h a t the figure, i.e. the
image region, for which s y m m e t r y axes are sought, can be readily separated from the
background. Friedberg [1], for example, shows how to derive axes of skewed s y m m e t r y of
a figure from its m a t r i x of moments. Marola [2] proposes a method for object recognition
using the s y m m e t r y axis and a s y m m e t r y coefficient of a figure. The m e t h o d is based
on central moments too, but, in contrast to [1], it directly uses intensity values and
hence takes into account the internal structure of the object as well . However, these
m e t h o d s either assume t h a t the segmentation problem has been solved or t h a t there is
no segmentation problem (e.g. uniform background intensity). Saint-Marc and Medioni
[3] propose a B-spline contour representation which facilitates s y m m e t r y detection.
For our application we need methods t h a t do not require computationally expensive
image preproeessing or segmentation. More importantly, we need algorithms which can be
applied locally, since this is the only way real-tlme performance can be achieved without
dedicated hardware. To our knowledge no "low-level" methods for s y m m e t r y detection
in images have been reported upon in the literature to date. T h a t is, all the methods we
found require some kind of object-related preprocessing or a transformation of the whole
object region into a hlgher-level representation.
A n y function G ( z ) can be written as the sum of its even component G e ( z ) and its odd
component Go(z), provided the origin is at the center of the definition interval. We
use d to denote the width of the interval on which G ( z ) is defined. In an image, d is
given by the distance between the intersection points of the scan line with the image
867
otherwise -
For any fixed pair of values z , and w, the significance of either E(=,, z , , w) or O(z, =,,, ~o)
m a y a p p r o p r i a t e l y be expressed by their respective energy contents, the integral over their
squared values ( Energy[f(=,)] : = f f(z)2d=, ). However, the problem arises t h a t the mean
value of the odd function always is zero whereas the even function in general has some
positive mean value. This bias has to be s u b t r a c t e d from the even function in order to
render the two energy quantities comparable in the sense t h a t their dissimilarity indicates
s y m m e t r y or a n t i s y m m e t r y respectively. We introduce a normalized even function which
has a zero mean value:
2.2 T h e S y m m e t r y F i n d e r
Fig. 2. A plot of S (z,, ~ ) [a] and SA (zo, ~ ) [b] for the image row marked in Fig. 1. SA (z o, ~ )
summed over 54 image rows [c]. Two-dimensional symmetry clearly emerges as a stable feature.
869
on the degree of s y m m e t r y within a given interval and on the interval's relative size with
respect to the overall extent of the signal. We define a confidence measure [0...1] for a
s y m m e t r y axis originating from an interval of width to about the position z , as
~0
sA _ ' <- (6)
to,~,~ is the maximal size of the symmetric interval. Considering a global symmetry
comprising the entire scan line to be the most significant case would imply w,~az -- d.
However, to,~= is better thought of as a limitation of the search interval, meaning that
any perfectly symmetric interval of width to,,~ and wider corresponds to an indisputable
symmetry axis.
Two-dlmensionai symmetry detection requires the combination of m a n y symmetry
histograms (accumulated confidence for symmetry axis vs. position of axis). This is easily
done by summation of the confidence values for each axis position, provided the symmetry
axis (axes) is (are) straight. Figure 2 illustrates the process of intensity-based symmetry
detection. The image for which the simulation results are presented here is shown in Fig. I
9 The symmetry position zm is varied in a horizontal range of about half the image width.
Figure 2a is a plot of the function S (zm, to) for one image row across the three cars. M a n y
local symmetry peaks exist for small to's.Figure 2b is a plot of the confidence function for
symmetry axes. Large symmetry intervals give rise to steep peaks. After summation of
SA (zm, to) over a number of image rows, clearly distinguishable peaks emerge, indicating
the symmetry axes of the three cars, Fig. 2c.
3.1 A n a l y s i s o f O r i e n t a t i o n Symmetry
W h e n curves (e.g. contour segments) are used as basic features for s y m m e t r y detection
[3] the relationship between mutually symmetric points along two curves can be defined
as a function of the directions of the two tangents. However, we m a y want to do without
a segmentation process which has to produce differentiable curve segments, or the image
m a y be such t h a t recognizing symmetric objects becomes much easier when s y m m e t r y
is detected first. Starting from a lower feature level we use local orientation instead of
tangent direction and define a s y m m e t r y relationship for it.
Figure 3 depicts two banks of directional filters and all possible pairs of directions
between them. Having a vector of eight direction-specific orientation values at each image
position is actually common in practice (e.g. when Sobel filter masks are being used for
preprocessing). The problem now is to define a s y m m e t r y relationship between the two
local orientations each represented by a vector of n directional values. There are n2/2
different pairs of directions for which it has to be decided whether this combination agrees
with our notion of symmetry. It is not always obvious which pairs of directions are to be
regarded symmetric. In Fig. 3 the direction pairs are grouped into six categories.
Ideal symmetry means that the two directions can be m a p p e d onto each other by
a reflection about the s y m m e t r y axis and a subsequent reversal of direction (the edge
polarities at the two points have opposite sign). In case of inverse symmetry, the two
directions are the reflection of each other. W i t h antisl/mmetry we denote cases which
are the opposite of what we consider ideal symmetry. Non-symmetry covers all cases
where the pair of directions is considered neither symmetric nor antisymmetric, i.e. it
is the neutral relation. The category n e a r (inverse) symmetry expresses approximate
symmetry.
870
N
/e
//
\
//
\
\
//
\
\ t
. . _ y
If orientation is measured by a small number of directional filters which may have over-
lapping angular sensitivity, all the above types of symmetry relations can be present con-
currently between two image loci, varying in degree of significance. We have developed
a method which combines evidence for the different categories of orientation symmetry.
It serves as a mechanism for detecting edges that are related to other edges by a certain
degree of symmetry. In the following we describe a detector element which receives in-
put from two image loci through a discrete representation of local orientation. The two
outputs of the detector element indicate edge points dependent on the strength of the
corresponding local orientation features and on the degree of compatibility of the two
orientation inputs according to a criterion for discrete orientation symmetry.
Figure 4 shows the architecture of the detector element. For clearness of the illus-
tration, only four directional inputs are drawn on either side. The basic idea is to view
the table of symmetry relations in Fig. 3 as a matrix of weight factors (wii) and com-
pute weighted sums of the directional inputs (dLi,d.ej), which then serve as inhibitory
signals for the corresponding directional inputs (dRj,d~) on the respective other side of
the detector. SEED is a detector element implemented by a feedforward network whose
connection weights represent the symmetry conditions.
The weighted sums represent the degrees of accumulated evidence for the compatibil-
ity of a given edge direction on either side of the detector element with the local orienta-
tion present at the respective other side's input. The degree of evidence is thresholded by
a sigmoid function (O). The threshold function prevents a strong directional signal from
causing an edge signal when there is only weak symmetry. By taking the products of
the directional inputs and the normalized degrees of symmetry a second layer of discrete
orientation representation is constructed, now with an enhanced sensitivity to orientation
patterns that have a symmetric counterpart across a given axis.
Let dL~ and dRj (i, j = 1...n) denote the output of the orientation-tuned filters for
n discrete directions and let z01j denote the factors specifying a direction's degree of
compatibility with a direction on the opposite side of the detector element. Then the left
and right output resp. of the detector element is defined as
871
I ~ W..
U
dLi , i = 1 . . . 4 [ dRj j=l...4 '
j J
~(u, T) is a sigmoid threshold function with upper limit 1 and lower limit 0 :
1
9 (u, T) = 1 + e.xpCT - u/k) (8)
Equation (8) contains two parameters that need to be chosen appropriately, k is a scaling
factor for normalizing u, such that T can be chosen independently of the weighting factors
9oij. The factor h has to be determined once for a given weight matrix W. We compute h
from the product of W and an angular influence matrix A, with aij = [[cos(21r(i-j)/n)H
, (i, j = 1...r~). The maximum of the elements of W 9 A is a good approximation for the
"gain" caused by the weighted summation of the directional inputs. The parameter T has
to be determined experimentally. For best results, T should be adapted to the average
edge contrast in the image. We usually set T to half the value of the average response of
the directional filter for the strongest direction at edges.
Figures 5 and 6 show results. With the intensity-based symmetry finder we determine
the position of the symmetry axis. At the axis position found, SEED is applied. The
weight vector [Z,X,M,O,Y,N] (see Fig. 3) used to process the images in this section is
[-2,2,2,0,1,0]. The edges which do not confirm the assumption of mirror symmetry about
the given axis are suppressed. Edges arising from the symmetric structures of an object
clearly stand out.
872
Fig. 5. A picture of a
household cable drum.
Viewed from the front,
it is a mlrror-symmetrlc
object with some asym-
metric internal struc-
tures. Applying a Sobel
edge filter to this image
results in the upper right
image. LI.: The result of
applying SEED at the
position of the symmetry
axis. 1.r.: The symmetric
edges (blnarlzed) super-
imposed onto the origi-
nal image.
Fig. 6. Multiple object detection. In addition to the processing steps demonstrated in Fig. 5,
SEED is applied twice, picking out the edges of the road sign as well.
Our methods for symmetry detection and symmetry filtering have been used to build
a vision system (dubbed C A R T R A C K ) for detecting and tracking ears and other vehi-
cles having an approximately mirror-symmetric rear view. Symmetry is used for three
different purposes:
1. Initial candidates for vehicle objects on the road are detected by means of the as-
sumption that compact image regions with a high degree of (horizontal) intensity
symmetry are likely to be nonincidental.
2. Visual tracking of a car's rear is greatly facilitated by the invariance of the symmetry
axis under changes of size and vertical position of the object in an image sequence.
3. W h e n using the symmetry constraint, separating the lateral vehicle edges from the
image background becomes feasible, even on a low processing level. From the lateral
contours of a vehicle, its image width can be determined accurately.
of an adaptable processing window whose position and size are controlled by a predictive
filter and a number of rules. While the size of the processing window varies depending
on the object size, the amount of image data processed by the symmetry finder is kept
approximately constant by means of a scan line oriented image compression technique.
The result of the symmetry finder is a symmetry histogram. Its highest peak is tenta-
tively taken as the horizontal object position. Then S E E D is used to verify and correct
the object position by trying to find symmetric edges in addition to the intensity sym-
metry. Given that the object's symmetry axis has been found with high accuracy, it is
relatively easy to correct the vertical object position. Finally, a boundary point is located
on either side of the object, providing the start points for a boundary tracking algorithm
which uses S E E D to "see" only symmetric edge pairs. O n both sides of a leading car,
a sufficiently long contour segment in the image is extracted and the maximal lateral
distance between the two contours is taken as its image width.
The current implementation had to be trimmed for fast computation. The weight
factors representing the symmetry criterion within SEED, for example, had to be chosen
as powers of two or zero respectively. The directional filter kernels are simple Sobel
masks. The threshold function 9 is implemented as a ~hard" threshold. Nonetheless,
the C A R T R A C K system has performed well during tests for an intelligent cruise control
system in an experimental van of Volkswagen (VW). With the current computer system,
based on a single M C 6 8 0 4 0 microprocessor (25MHz), we achieve a cycle time of about
300 milliseconds for the output data to the van's control system.
5 Conclusion
References
1. Frledberg, S. A. : Finding Axes of Skewed Symmetry. Computer Vision, Graphics, and Image
Processing 34 (1986) 138-155
2. Marola, G. : Using Symmetry for Detecting and Locating Objects in a Picture. Computer
Vision, Graphics, and Image Processing 46 (1989) 179-195
3. Saint-Marc, P., Medioni, G. : B-Spline Contour Representation and Symmetry Detection,
Proceedings, First European Conf. on Computer Vision, Antibes, France (1990) 604-606
Indexicality and Dynamic Attention Control
in Qualitative Recognition of Assembly Actions
Yasuo Kuniyoshi 1 and Hirochika Inoue 2
1 Introduction
Intelligent systems in multi-agent environment must react to other agents' actions. In
such cases, visual recognition of actions is indispensable. The recognition process must
operate in real-time, generating semantic information suitable for reasoning processes or
matching against stored knowledge.
Research on motion understanding has offered various methods to estimate 3D motion
parameters of various objects including human bodies [1, 2]. But it is not clear how to
extract semantics from these parameters. Moreover, bodily motion parameters are not
sumcient nor of primary importance for action recognition. Early attempts to extract
sematic information from motion pictures dealt with simple animations [3, 4]. Real-time
recognition of general actions in the real world is still an open problem.
Recently, many ill-posed vision problems have been reformulated and solved by as-
suming an active observer [5] and considering behavior context [6, 7]. This approach
should fit well to qualitative action recognition.
In this paper, we propose and test a new method for real-time visual recognition
of simple assembly tasks performed by human workers. The core of the method is two
fold: (1) Define a set of simple "indexical" features which can be quickly calculated
from visual features. They directly point to action concepts under certain conditions. (2)
Employ spatial/temporal/hierarchical attention control and context memory to maintain
the indexieality of extracted features.
Indezical Features of Assembly Actions. Provided that the "attention" and the "context"
are maintained, aa~embly actions belonging to each assembly motion are discriminated by
the "change". Observable features which directly point to the "changes" can be defined
in terms of temporal change of simple visual features.
Action concept of LocaiMotion is stated as follows:
A t t e n t i o n : The hand, the target object and the held object, if any.
C o n t e x t : The hand is moving slowly near (toward/away) the target object.
Change: Aholding(Itand, X) classifies PICK/PLACE/NO.
Let us define a 3D convex region called "target region" which covers the block to be picked
up or the space in which the held one will be placed. Then, the indexical feature is defined
as a change of the intersecting volume of the target region and any object. This feature
can be detected by the following procedure: (1) Predict the location and size of the target
region. (2) Project the 3D model of the target region onto each field of view of binocular
stereo to define 2D attention regions. (3) For each view, take two gray level (possibly,
color) snapshots of the attention region at temporal segmentation points before and after
the current action. (4) Differentiate the snapshots and threshold the change of area into
three cases, decreased/increased/no-change, which point to PICK/PLACE/NO 3. Then
verify the 3D location of the change by triangulating the center of mass of the changed
areas.
Conceptual organization of an ALIGN operation (see Fig. 1) is as follows:
i e= P-q
Fig. 1. ALIGN action (leftmost) and indexical features (in round boxes) of coplanar/no-coplanar
relations.
In the context specified above, simple 2D visual features, namely, contour junction types
shown in Fig. 1, directly point to coplanar/no-coplanar states. Tracking only the edges
El, E2, E3 and check for the junctions, coplanar states are immediately detected. The
"change" feature can be quickly calculated by differentiating the coplanar/no-coplanar
state at start/end points of the current action.
s To simplify the implementation, blocks are painted white and the background is black, in our
experiment.
877
The indexicality of observable features is based on correctness of the "attention" and the
"context".
Spatial attention control maintains the "attention" part. It detects the moving hand
by temporal differentiation of images (pre-attentive), tracks the hand and the moving
objects/edges to solve correspondence problem, and predicts a target region/edge by
visual search procedure. The visual search starts by setting a stereo-pair of 2D regions
on the "base" of attention maintained by tracking. The regions are progressively moved
in the direction in which the base is moving until they hit an object or the worktable. By
looking up the estimated 3D position into the environment model, a correct target region
is determined. Contextual information is important here, eg. the environment model is
updated by recognizing previous actions, and whether to set the target region on (in
PICK) or above (in PLACE) the found object depends on which action is expected now.
Temporal attention control maintains the "context" part. When a shift of attention is
detected by a visual search or a pre-attentive routine, current context is no longer valid,
which means that the indexcality of currently selected features breaks. This context
switch directly signals a temporal segmentation of an action. Different visual features are
selected for monitoring, and tracking or visual search is invoked/stopped depending on
the context. This is done by selecting nodes of the action model.
Hierarchical attention control stabilizes the overall recognition process. It extracts dif-
ferent types of visual features at different timings in parallel from superimposed regions.
The regions and the features are organized in a hierarchy which corresponds to the levels
of assembly motions. Very coarse features such as temporal differentiation of the whole
view are always monitored, while fine features such as edge junctions are checked only
in FineMotions. The hierarchical parallelism contributes to the robustness: Even when
the hand suddenly moves off while a FineMotion, the system readily catches up with the
gross motion. This control scheme is implemented as "attention stack" operations.
An experimental system [10] was built upon the presented theory. It succeeded in rec-
ognizing human assembly actions in real-time. Figure 2 shows a monitor display during
the recognition of an arch construction task. The worker's hand is tracked by multiple
visual windows and an indexical feature is extracted from the target region. Result of
recognition is displayed at the bottom lines. The elapsed time for the whole task was 2
min. Figure 3 shows the coplanar detector at work. In this experiment, the detector was
run continuously to repeatedly check for coplanar relations. The held block was moved
continuously and the detector reported coplanar/no-coplanar discrimination at a speed
of 1 Hz. Other tasks recognized successfully by the system include; (1) pick and place
with ALIGN operation, (2) tower building, (3) inverted arch balanced on a center pillar,
(4) table with four legs.
Acknowledgement
The authors wish to express their thanks to Dr. Masayuki Inaba for his suggestions
and contribution to the base system, and Mr. Tomohiro Shibata for implementing the
coplanar detector.
878
Fig.2. Recognizing the task "Build Arch': (1) Target region defined. (2) Differentiation. (3)
Identified action type =place-on-block" displayed.
References
1. J. O'rourke and N. J. Badler. Model-based image analysis of human motion using con-
straint propagation. IEEE Trans., PAMI-2(6):522-536, 1980.
2. M. Yamamoto and K. Koshikawa. Human motion analysis baaed on a robot arm model.
In Proc. of IEEE Conf. CVPR, 664-665, 1991.
3. S. Tsuji, A. Morizono, and S. Kuroda. Understanding a simple cartoon film by a computer
vision system. In Proc. IJCAI5, 609-610, 1977.
4. R. Thibadeau. Artificial perception of actions. Cognitive Science, 10(2):117-149, 1986.
5. J. Aloimonos and I. Weiss. Active vision. Int. J. of Computer Vision, 333-356, 1988.
6. D. H. Ballard. Reference frames for animate vision. In Proc. IJCAI, 1635-1641, 1989.
7. S. D. Whitehead and D. H. Ballard. Learning to perceive and act. Technical Report 331,
Computer Science Dept., Univ. of Rochester, Rochester, NY 14627, USA, June 1990.
8. Y. Kuniyoshi, M. Inaba, and H. Inoue. Teaching by showing: Generating robot programs
by visual observation of human performance. In Proc. ISIR$O, 119-126, 1989.
9. J. R. Hobbs. Granularity. In Proc. IJCAL 432-435, 1985.
10. Y. Kuniyoshi, M. Inaba, and H. Inoue. Seeing, understanding and doing human ta~k. In
Proc. IEEE Int. Conf. Robotics and Automation, 1992.
This article was processed using the I~TEX macro package with ECCV92 style
R e a l - t i m e Visual Tracking for Surveillance and P a t h
Planning
1 Introduction
Over the last few years, significant advances have been made in estimation of surface
shape from visual motion, that is from image sequences obtained from a moving camera.
By combining differential geometry [6] with spatio-temporal analysis of visual motion [9,
10, 7] it has been shown that local surface curvature can be computed from moving images
[8, 3, 4, 11, 1]. The computation is robust with respect to surface shape, configuration
of the surface relative to the camera and the nature of the camera motion. For example
the ability to discriminate qualitatively between rigid features and silhouettes on smooth
surfaces has been demonstrated [3]. Particularly important for collision-free motion, the
"sidedness" of silhouettes is computed, that is, which side is solid surface and which is
free space.
This paper reports on progress in building a robot with active vision that can ma-
nipulate objects in the presence of obstacles. Our Adept robot has a camera and gripper
on board and is able to make exploratory "dithering" movements around its workspace.
As it moves, it monitors the image motion and deformation of contours in real-time us-
ing parallel dynamic contours. Real-time performance depends on appropriate internal
dynamical modelling with adaptive control of scale. As in earlier versions of our system
[3, 4], contour motion is used to interpret occlusion and surface curvature. More recently
we have built on further features: incremental building of a free-space model, incremental
planning of robot motion and search strategies for navigation.
2 Visual Tracking
The curvature analysis described above relies on the ability to track moving curved
image-contours. We have implemented a series of deformable cubic B-spline "dynamic
contours" that run at video rates using a Transputer network. Points along the dynamic
880
contour are programmed to have an affinity for image features such as brightness or high
intensity-gradient. The contour can now be used to make curvature estimates in a few
seconds. Substantial improvements in tracking performance are r'ealised when Lagrangian
dynamics formalisms are used to model mass distributed along a dynamic contour which
moves in a viscous medium. We have found that use of large Gaussian blur is unnecessary,
a crucial factor in achieving real-time performance.
2.1 C o u p l e d B - s p l i n e m o d e l
Defining a dynamic contour with inertia [5], implies a model in which the image velocity
of features is assumed uniform. However, stronger assumptions about the feature can
be incorporated when appropriate, to considerable effect. This is done by making the
further assumption that the feature shape will change only slowly. As an illustration,
with no shape assumption, the contour flows around corners (Fig. la). When the shape
assumption is incorporated it moves almost rigidly with the corner feature (Fig. lb).
Fig. 1. Dynamic contours tracking a corner. In (a) the dynamic contour is flexible, and slides
round the corner as the object moves to the left, whereas the "coupled" dynamic contour in (b)
follows the true motion of the corner.
The shape assumption is imposed by using a pair of coupled B-splines. The first can
be trained by taking the original (uncoupled) dynamic contour and allowing it to relax
onto a feature. Its shape is then frozen and becomes the "template" shape (see also Yuille
et. al. [12]) in the new model. A second B-spline curve, initially an exact copy of the first,
is then spawned and coupled to the template B-spline. The coupling is defined, naturally,
881
by a new elastic energy term. In the case of first derivative coupling energy, the dynamic
contour behaves like a one-dimensional membrane or string. Second derivative coupling
produces a contour which acts like a one-dimensional thin plate or rod. Suppose the
template contour has control point vector Q,, then the equations of motion, derived by
an analysis similar to the uncoupled model [5]), are:
where H0 and H1 are constant matrices, simply compositions of B-spline coefficients, and
U is the least squares B-spline approximation to the feature vector. The constants w0,
fl0 govern elastic attraction towards the feature and velocity damping respectively. The
constants wl and t31 govern elastic restoring forces and internal damping respectively.
2.2 Parallel i m p l e m e n t a t i o n
The equations of motion are integrated using an implicit Euler scheme for speed on a
network of transputers. The dynamic contours are allocated to worker transputers a span
at a time. The individual spans of a contour communicate their contribution to V in (1)
to the rest of the contour at each Euler step. With six worker transputers, Euler steps
can be performed at frame rate (25Hz). To overcome this problem, three separate frame
grabbers are used, sampling the sample video input. Figure 2 shows a sample of frames
from a multiple contour tracking sequence using coupled dynamic contours.
The dynamic contour can successfully track features whose velocity is such that the
lag caused by viscous drag does not exceed the radius of the tracking window. With a
tracking window radius of approximately 35 mrad (in a field of view of 0.3 tad) maximum
tracking velocity is about 1.4 rad/sec, for our system. Note that varying the tracking
window radius is our mechanism for control of scale. For example, a large window is used
during feature capture, getting smaller as the contour locks on.
Following the first version of our manipulation system [2], we have investigated a more
complex version of the manipulation problem in which the workspace is cluttered with
several obstacles. In addition to visual and spatial geometry, we now need to add Arti-
ficial Intelligence search techniques (A* search). Path-planning works in an incremental
fashion, repeating a cycle of exploratory motion, clearing a triangular chunk of freespace
(figure 3), and viewpoint prediction. After several of these cycles, over which the robot
"jostles" for the best view of the goal (box lid), it may find itself jammed between ob-
stacles and the edge of its workspace. In that case it backtracks to an earlier point in
its search of freespace and investigates a new path. So far we have demonstrated these
techniques with up to three different, unmodelled obstacles in the workspace.
4 Conclusions
We have discussed the principles and practice of visual manipulation of objects in clutter
from several points of view: tracking, dynamics, spatial geometry and geometry of grasp.
We are currently working to extend this in a variety of ways, including:
882
Fig. 2. Dynamic contours tracking objects in a scene (raster order). All dynamic contours are
"coupled". Note the top left hand contour falling off its feature as the object moves over the
image boundary.
- Developing more powerful internal models for tracking to improve ability to ignore
clutter. This would enable the robot to perform efficiently with obstacles in the
background as well as in the foreground.
- Employ more sophisticated geometric modelling to allow fine-motion planning that
takes account of the shape of the gripped object, and the fact that the workspace is
3D not 289
Acknowledgments
We acknowledge the financial support of the SERC and of the EEC Esprit programme.
We have benefitted greatly from the help and advice of members of the Robotics Research
Group, University of Oxford, especially Professor Michael Brady, Roberto Cipolla and
Zhiyan Xie.
References
1. E. Arbogast and R. Mohr. 3D structure inference from images sequences. In H.S. Baird,
editor, Proceedin#s o.f the Syntactical and Structural Pattern Recognition Workshop,, pages
21-37, Murray-Hill, N J, 1990.
2. A. Blake, J.M. Brady, R. Cipolla, Z. Xie, and A. Zisserman. Visual navigation around
curved obstacles. In Proc. 1EEE Int. Conf. Robotics and Automation, volume 3, pages
2490-2499, 1991.
883
Fig. 3. Our visual object manipulation system clears triangular chunks of freespace visually, in
an incremental fashion (freespace shown in grey). Accumulated freespace is represented in a 2D
plane as shown, and is actually a projection of 3D obstacles onto a horizontal plane. The robot
plans paths (black lines), restricted to space currently known to be free. Horizontal paths in the
figure are exploratory "dithering" motions performed deliberately to facilitate structure from
motion computations. The state of freespace is shown after navigation around three obstacles,
towards the goal object at the left of the figure. The robot tried the path on the bottom, reached
a fixed search depth-bound, backtracked, and successfully navigated on the top.
3. A. Blake and R.C. Cipolla. Robust estimation of surface curvature from deformation of
apparent contours. In O. Faugeras, editor, Proe. Ist European Conference on Computer
Vision, pages 465-474. Springer-Verlag, 1990.
4. R. Cipolla and A. Blake. The dynamic analysis of apparent contours. In Proc. 3rd Int.
Conf. on Computer Vision, pages 616-625, 1990.
5. R.M. Curwen, A. Blake, and R. Cipolla. Parallel implementation of lagrangian dynamics
for real-time snakes. Submitted to 2nd British Machine Vision Conference, 1991.
6. M.P. DoCarmo. Differential Geometry of Curves and Surfaces. Prentice-Hall, 1976.
7. O.D. Faugeras. On the motion of 3-d curves and its relationship to optical flow. In Pro-
ceedings of 1st European Conference on Computer Vision, 1990.
8. P. Giblin and R. Weiss. Reconstruction of surfaces from profiles. In Proc. 1st lnt. Conf.
on Computer Vision, pages 136-144, London, 1987.
9. H.C. Longuet-Higgins and K. Pradzny. The interpretation of a moving retinal image. Proc.
R. Soc. Lond., B208:385-397, 1980.
10. S.J. Maybank. The angular velocity associated with the optical flow field arising from
motion through a rigid environment. Proc. Royal Society, London, A401:317-326, 1985.
11. R. Vaillant. Using occluding contours for 3d object modelling. In O. Faugeras, editor, Proc.
1st European Conference on Computer Vision, pages 454-464. Springer-Verlag, 1990.
12. A.L. Yuille, D.S. Cohen, and P.W. Hallinan. Feature extraction from faces using deformable
templates. Proc. CVPR, pages 104-109, 1989.
This article was processed using the ]~TEX macro package with ECCV92 style
Spatio-temporal Reasoning within a Traffic
Surveillance System
these components. Figure 1, illustrates a simplified functional architecture for these two
main modules of our computer vision system.
?
1
Fig. 1. Functional Architecture Fig. 2. Bremer.Stern: Tracking updates
The main exemplar running through this paper will be based around tracking of a
sequence which includes the 3 frames appended to the paper. These 3 frames are from
an image sequence taken at the Bremer Stern roundabout in Germany that we have
used as an exemplar for the VIEWS project. Consider the vehicles where a) the lorry
starts to occlude the saloon car, b) the car is fully occluded by the lorry, and c) the ear
starts to re-emerge from behind the lorry. The ear then moves off. Processing at both the
perceptual and conceptual levels needs spatio-temporal calculations. At the perceptual
level, the detection, tracking and classification of the visible moving objects is performed.
Some of the updates for the model based tracking of the occlusion frames are displayed
in figure 2. This figure displays vehicles as points projected onto the groundplane of the
roundabout which is a complex environment segmented into road, bicycle and tram lanes
(cutting across the middle of the roundabout). Figure 2 also shows the relationship of
the camera field of view to the groundplane, this is important for occlusion reasoning, as
will be discussed later.
The information extracted by the PC is given as updates at different levels of analysis
and speed of processing (but always as estimates with respect to the ground plane) to the
SAC. The basic form is as < label, position, time > updates. The SAC then processes the
information first, at the level of events such as stopping/starting and entering/exiting re-
gions with an additional local check on the spatio-temporal continuity. Second, the global
consistency checking constructs and maintains the space-time histories using behavioural
constraints in space and time from the context of other vehicles in the locality. Finally,
the SAC can also perform selected behavioral prediction and feedback which, for this full
occlusion, must maintain the occlusion relationships for recognition of the re-emerging
vehicle and give feedback both within the SAC and to the PC so that the relabelling of
the vehicle and constructing a consistent history can be performed.
Any vision based tracking system must incorporate some means of overcoming occlu-
sion problems. Even a simple task like counting the number of vehicles passing through
the scene requires occlusion reasoning so that re-emerging vehicles are not counted twice.
886
In this paper, we will present results from a case study where the role of the SAC
is primarily to act as a high level, long term memory keeping explicit representations of
total occlusions occurring in dynamic scenes and utilizing behavioural knowledge to aid
relabelling.
2 Analogical Representation
The analogical spatial representation underlying the behavioural models described later
is a flexible, multi-purpose representation which maintains the structure of the world in
a usable manner. The key requirements for the static knowledge in this representation
are: (1) the conversion of the ground plane geometry into meaningful regions; and (2)
the description of the connectivity of these regions. The objective of the surveillance
is to reason about the vehicle in the scene, so we also require a means of representing
dynamically: (3) spatial extent and edges of vehicle ground plane bounding boxes to
determine occupied regions; and (4) velocity and orientation to derive behaviour. When
we also include vehicle interactions, we require (5) inter-vehicle orientation and inter-
vehicle distance.
The analogical representation provides a uniform framework in the SAC, incorporat-
ing the static scene knowledge and dynamic vehicle histories. 'The static knowledge is
attached to regions. Regions, on the groundplane, are constructed from cell sets. For ex-
ample, these can express: (1) the types of vehicle that use a region (eg roads, cycle-lane,
tram-line); (2) regions of behavioural significance (eg giveway, turning); (3) direction in-
formation (eg lane leading onto roundabout); and (4) the basic connectivity of the leaf
regions.
Cells are also used to represent dynamic vehicle histories. The analogical spatio-
temporal representation is shown in figure 3. Calculation based on relating vehicle histo-
ries can fully capture the semantics of behaviours like 'following', 'overtaking', 'crossing',
'queueing' etc. An example of dynamic relation 'following' is given in figure 5 (overlap in
spatial history within some time delay). Given the dynamic and static knowledge bases,
we can usually determine what a vehicle is doing.
" P tO ~ tl ~ t2[~t3 i ~ t4 ~ t5
i
: i 9 ,. !
3 Occlusion
The Perception Component incorporates a range of vision tracking competencies. Its pri-
mary task is motion focused tracking and classification of vehicles. The specific tracking
competencies are detailed elsewhere ([Sullivan et al. 90]). The VIEWS configuration we
are considering is as follows:
- 'Gated' initialization of tracking: Static areas of the image where vehicles are known
to appear are highlighted for track initialization.
- PC < 1second vehicle memory: The PC is dealing with video rate input, and once a
vehicle track is lost attempts to relocate the vehicle are swiftly curtailed.
- Motion cued tracking: pixel grouped motion in the image is the primary cue for
processing. Vehicles which halt and remain stationary for extended periods may be
'forgotten'. This problem may be handled as a virtual occlusion by the techniques
presented here.
In the case of tracking being lost by total occlusion, some recovery mechanism is
needed so any new tracks initialized by the PC are correctly labelled as continuations.
In the short term, tracking algorithms attempt recovery by focusing processing, directed
by calculations extrapolating the "lost" vehicles velocity. Occluded vehicles manoeuvre
unpredictably, this makes simple velocity based calculation a poor approximation in the
long term.
The SAC maintains occluded-by relationships. The boundary of a lead vehicle in the
image defines the area of potential emergence. The SAC occlusion reasoning aids cor-
rect relabelling after emergence and indicates to the PC which vehicles to monitor for
boundary deformations caused by emerging vehicles. Occlusions are so common these
techniques are developed to be of low cost.
3.1 T h e e x a m p l e
Examine the example occlusion sequence, frames are shown at the end of the paper and
tracks plotted in figure 2. A saloon travels behind a lorry, is occluded for some time and
888
finally emerges. When the saloon emerges, it is labelled as a new track by the PC. The
minimum competences for the SAC are to spot the occlusion occurring, maintain the
occlusion relationships, relabel the saloon correctly upon emergence and complete its
Spatio-temporal history.
While the reconstruction of the specific < label, position, time > track is impossible,
the form of the spatio-temporal history can be bounded in a useful manner (see figure 4).
Qamgt'a
3.3 D u r i n g Occlusion
In the example, partial occlusion of the saloon by the lorry (see first frame), develops
into full occlusion (see frame-2). After the update for frame-96 the SAC generates the
message shown in figure 7 and asserts the relationship occluded-by(096, 5, < 8 >). The
lorry is 'objectS' the saloon 'object8'. The first field is either the time from which the
occlusion is active, or a range indicating the period of time the occlusion is considered to
have lasted. Once an occluded-by relationship is created the SAC must be able to reason
about its maintenance and development.
During a typical occlusion development:
- A vehicle may leave shot never having emerged from occlusion. Once occluding vehicle
leaves shot, all occluded vehicles must do so.
- When not themselves becoming totally occluded, other vehicles may enter into partial
occlusion relationships with the lead vehicle and then separate. In this case, it is
possible that occluded vehicles move from one vehicle 'shadow' to another.
In the next stage, 'relabelling', it will be argued that behavioral knowledge is required
to help disambiguate labelling problems as vehicles emerge. This is because a total oc-
clusion may be of arbitrary length and over a long time almost any reordering of vehicles
could occur. No processing is expended on reasoning about the development of the rel-
ative positions of vehicles until they emerge. Behavioral relationships established prior
to the occlusion eg. following(tara,oath), are stored for consideration during relabelling.
But the consideration of the development of relationships is not useful during the actual
time of occlusion.
A g r a m m a r is used to manage the development of occlusions. Where ta : tb is the time range
from t~ to tb and vj ::< list > represents the union of vj and < list >. < list > @vj represents
< list > with vj removed. Where | is an exclusive-or such that occluded-by(tj,vm,cara) A
occluded.by(tj,vt,cara) cannot be true. Unless otherwise stated ti is before tj and where used
ti < tk < tj.
Exlstenee .................. Does occlusion exist r
occluded-by(ti,vt,l~) - no occlusion
Emergence... What may emerge next?
ocduded-by(ti,v:,< list >) --, next.emerge(tj,vl) E< list >
I f occluded-by(ti,vz,< list >) A
emerge-from(tj,m,Vo~d)
---* occluded-by(tj,m,< list > @Void) A
occluded-by(ti : tj,vl,v~d)
Joining... Has a vehicle joined existing occlusion ?
occluded-by(tl,vl,< list >) A
occl uded-by( t j , vh v~ew )
occluded-by(ti,vl,v,ew ::< list >)
Lead O c c l u d e d . . . Has an occluding vehicle been occluded r
occluded-by(ti,vt,< list >l) A
occluded-by(tk,v. . . . < list > , ~ , ) A
occluded-by(tj,v. . . . vt)
--. occluded-by(tj,v,~,,vl ::< list > , ~ , : : < list >t)
ti Sgtk are not ordered.tj follows both
Leave Scene... Anything occluded leaves show with lead!
occluded-by(ti,vl,< list >) A
left.shot( ti,vl )
left.shot(ti,< list >)
Visible I n t e r a c t i o n . . . I f shadows meet occluder may change!
occluded-by(ti,vh < list >) ^
(partially-occludes(tk,vh vm) V
partially-occludes(t k,v,n, vl))
occluded-by(tj,vh< list >z)
| occluded-by(tj,vm,< list >,n)
where < list >l::< list > m - < list >
The SAC maintains the consistency of the options with constraint reasoning. Updates
from the P C are compared to these possibilities. In our example, the initial relationship
generates expectations:
890
Later the creation of a new track for 'vehicle-10', matches the suggested emergence of
the occluded saloon (see figure 7). This matches the following rule:
occluded-by(096,5,< 8 >) ^ emerge-from(108,5,8)
--* occluded-by(108,5,~) ^ occluded-by(096 : 108,5,< 8 >)
In the example the relabelling is unique (10 = 8) and the no additional occlusion rela-
tionships exist (--* 0). The relabelling and the fact the lorry no longer occludes anything
are passed to the PC. This relabelling case is the simplest possible. More complex cases
and behavioural evaluation, will be considered in the following section.
3,4 E m e r g e n c e
The remaining occlusion reasoning in the SAC can be summarised as two capabilities:
(1) R e l a b e l While the actual moment of emergence is not possible to predict, this
does not mean no useful inferences can be made. When a new track is initialized,
indicated by a new vehicle label in the VIEWS system, it is compared with currently
maintained occlusion relationships. If it matches, the new label is replaced by the
matching previous label for both the PC and SAC.
(2) H i s t o r y C o m p l e t i o n As complete histories are preferable for behavioral evalua-
tion, the SAC needs access to the last and the newest position so the complete
history is 'extruded' as shown in figure 4 to fill in the trajectory of the occluded
vehicle.
Consider the case where two vehicles (cara and carb) are occluded by the same lorry.
When the PC indicates a new vehicle track the following cases, generated by the grammar
given earlier, are all initially plausible, since the 'emergence' may be caused by the vehicles
motion relative to the camera or each other:
occluded-by(lorry, <cara, carb >) ---, visible(lorry) A
1. next_emerge(lorry, ear=) A occluded-by(lorry, < carb >)
~. next-emerge(Iorry,eara) A occluded.by(car=, < carb >)
891
1. To produce consistent histories and alert the PC to any inconsistency in data pro-
vided.
2. To produce ongoing behavioural evaluations; both for the end user and to supply a
source of knowledge for defaults in the production of consistent histories.
5 Conclusion
In summary we have discussed some of the competenees developed for specific applica-
tion of road traffic surveillance. The main competence elaborated here is to is to deal
with total occlusions, in order to develop consistent spatio-temporal histories for use in
behavioral evaluation eg. following, queue formation, crossing and overtaking. Specifi-
cally we have proposed a grammar for the handling of occlusion relationships that allows
us to infer correct, consistent labels and histories for the vehicles. This was illustrated
in with a simple example, although the information made available is sufficient to de-
duce that the saloon ear, having been occluded by the lorry overtakes it. In addition we
also described, the analogical representation that supports the contextual indexing and
behavioural evaluation in the scene.
Work is continuing on more complex behavioural descriptions and their interaction
with more complex occlusions as well as generating more constrained expectations of
vehicles on emergence. In particular, it is important to look at the trade-offs between
behavioural and occlusion reasoning, decision speed and the accuracy of predicted re-
emergence.
892
6 Acknowledgements
We would especially like to thank Simon King & Jerome Thomere at Framentec for
much discussion of ideas a b o u t the SAC and behavioural reasoning and Geoff Sullivan
and Anthony Worrall at Reading University for discussion of the interaction with the
PC and we would like to thank Rabin Ezra for access to his boundless knowledge about
postscript. In addition, we would thank all the VIEWS team for their work in putting
together this project and the ESPRIT II programme for funding.
References
[Buxton and Walker '88] Hilary Buxton and Nick Walker, Query based visual analysis: spatio.
temporal reasoning in computer vision, pages ~47-254 linage and Vision Computing
6(,t), November 1988.
[Fleck '88a] Margaret M. Fleck, Representing space for practical reasoning, pages 75.86, Image
and Vision Computing, volume 6, number P, May 1988.
[Howarth and Toal '90] Andrew F. ToM, Richard Howarth, Qualitative Space and Time for Vi-
sion, Qualitative Vision Workshop, AAAI-90, Boston
[Mohnhaupt and Neumann r Michael Mohnhaupt and Bernd Neumann, Understanding oh.
ject motion: recognition, learning and spatiotemporal reasoning, FB1-HH-B-145//90,
University of Hamburg, March 1990
[Nagel '88] H-H. Nagel, From image sequences towards conceptual descriptions, pages 59-7,~,
Image and vision computing, volume 6, number P, May 1988.
[Steels '88] Luc Steels, Step towards common sense, VUB AI lab. memo 88-3, Brussels, 1988.
[Thibadeau '86] Robert Thibaxteau, Artificial perception of actions, pages 117-149, Cognative
science, volume I0, 1986.
[Sullivan et al. 90] Technical Report DIO$ "Knowledge Based Image Processing", G. Sullivan,
Z. Hnssaln, R. Godden, R. Marslin and A. Worrall. Esprit-II P2152 'VIEWS', 1990.
[Marslin et al. '91] R. Marslin, G.D. Sullivan, K. Baker, Kalman Filters in Constrained Model
Rased Tracking, pp371-37,~, BMVC-91
[Shu and Buxton '90] C. Shu, H. Buxton ".4 parallelpath planning algorithm for mobile robots",
proceedings: International Conference and Automation, Robotics and Computer Vi.
sion, Singapore 1990
[Toal and Buxton '92] A.F. Toal, H. Buxton "Behavioural Evaluation for Traffic Surveillance
using Analogical Reasoning and Prediction" in preparation.
This article was processed using the I$TEX macro package with ECCV92 style
Template Guided Visual Inspection
A. Noble, V.D. Nguyen, C. Marinos,
A.T. Tran, J. Farley, K. Hedengren, J.L. Mundy
GE Corporate Research and Development Center
P.O. Box 8, 1 River Road Schenectady, NY. 12301. USA.
1 Introduction
Automatic Visual inspection is a major application of machine vision technology. How-
ever, it is very difficult to generalize vision system designs across different inspection
applications because of the special approaches to illumination, part presentation, and
image analysis required to achieve robust performance. As a consequence it is necessary
to develop such systems Mmost from the beginning for each applicatign. The resulting
development cost prohibits the application of machine vision to inspection tasks which
provide a high econonfic payback in labor savings, material efficiency or to the detection
of critical flaws involving human safety.
The use of Computer Aided Design (CAD) models has been proposed to derive the
necessary information to automatically program visual inspection [16, 2]. The advantage
of this approach is that the geometry of the object to be inspected and the tolerances
of the geometry can be specified by the CAD model. The model can be used to derive
optimum lighting and viewing configurations as well as provide context for the application
of image analysis processes.
On the other hand, the CAD approach has not yet been broadly successful because
images result from complex physical phenomena, such as specular reflection a~ld mu-
tual illumination. A more significant problem limiting the use of CAD models is that
the actual manufactured parts may differ significantly from the idealized model. During
product development a part design can change rapidly to acconmmdate the realities of
manufacturing processes and the original CAD representation can quickly become obso-
lete. Finally, for curved objects, the derivation of tolerance offset surfaces is quite complex
and requires the solution of high degree polynomial equations [3].
An alternative to CAD models is to use an actual copy of the part itself as a reference.
The immediate objection is that the specific part may not represent the ideal dimensions
or other properties and without any structure it is impossible to know what attributes of
the part are significant. Although the part reference approach has proven highly successful
in the case of VLSI photolithographic mask inspection [5, 14] it is difficult to see how
to extend this simple approach to the inspection of nmre complex, three dimensional,
manufactured parts without introducing some structure defining various regions and
boundaries of the part geometry. The major problem is the interpretation of differences
894
between the reference part and the part to be inspected. These differences can arise from
irrelevant variations in intensity caused by illumination, uniformity or shadows. Even if
the image acquisition process can be controlled, there will be unavoidable part-to:part
variations which naturally arise from the manufacturing process itself, but are irrelevant
to the quality of the part.
In the system to be described here we combine the best features of the CAD model
and part reference approaches by introducing a deformable template which is used to
automatically acquire the significaut attributes of the nominal part by adapting to a large
number of parts (e.g. 100). The template provides a number of important functions:
In Section 2 we consider the theoretical concepts which determine the general struc-
ture of a constraint template. Section 3 describes the general design and principal al-
gorithms used in the current prototype inspection system. Experimental results demon-
strating constraint templates applied to X-ray images of industrial parts are given in
Section 4. We conclude in Section 5.
2 Constraint Templates
We have based the design of our inspection system on the definition of a template which
consists of a set of geometric relationships which are expected to be maintained by any
correctly manufactured instance of a specific part. It is important to emphasize that
the template is a generic specification of the entire class of correct instances which can
span a wide range of specific geometric configurations. We accommodate these variations
by solving each time for an instance of the template which satisfies all of the specified
constraints, while at the same time accommodating for the observed image features which
define the actual part geometry. Currently, the system is focused on single 2D views of
a part, such as X-ray projections. However, there is no limitation of the general concept
to a single image, so that multiple 2D views or 3D volume data could be interpreted by
a similar approach.
More specifically, the template is defined in terms of the following primitive geometric
entities; point, conic (ellipse, circle, hyperbola,line), bezier curve. These primitive curve
types can be topologically bounded by either one or two endpoints to define a ray or
curve segment. The geometric primitives are placed in the template in the context of a
set of geometric relationships. The set of geometric constraints available are as follows:
Size in Ratio Two primitives have some fixed ratio in scale factor.
Linear Size A set of entities are related by a linearly varying scale factor. This relationship is
often observed in machine parts.
Linear Spaeiug The distance between a~et of entities varys linearly over the set. Again, this
constraint is motivated by typical part geometries.
2.1 T h e C o n f i g u r a t i o n C o n c e p t
We have developed the concept of the configuration which provides a systematic approach
to the symbolic definition of geometric entities and many of the geonmtric relationships
"just defined [12]. The geometric constraints are ultimately represented by a system of
polynomials in the primitive shape variables and the constraint parameters.
Except for scalar measures such as length and cosine, all geometric entities are repre-
sented by configurations, which have parameters for the location, orientation, and size of
the primitive shapes. Symbolically, these slots are represented by 2D vectors of variables.
The location of a shape is described by 1T = (I,, ly). This location is usually the center
or the origin of the local fi'ame of the primitive shape. The orientation of a shape in the
plane is described by an angle O, or by a unit vector o T = (o,,ov) = (cos S, sin O). The
later is used to avoid trigonometric functions and to use only polynomial functions of
integer powers. The size of a shape, like for all ellipse, is represented by a vector having 2
scale factors, (k,, kv) along the major and minor axes. To avoid division of polynomials,
the inverse of the size is represented: for example, k T = (k,,kv) = ( a - l , b -1) for an
ellipse.
The configuration is an affine transformation matrix representing the translation,
rotation, and scaling from the local coordinate frame (X,Y) of the shape to the image
frame (x, y):
C X ) = ( k ~ k 0v ) ( c o s 8 sinS~ ~-l~
3 System Design
3.1 P h i l o s o p h y
The inspection system operates in one of two functional modes; inspection template
acquisition mode or part inspection mode, Figure 1. Inspection template acquisition,
involves the derivation of a constraint template which encapsulates the expected geometry
of a "good" part. Initially a template is created manually by a user through a graphical
interface with the aid of blueprint specifications or inspection plans. Once a template
is created, the system is run on a suite of images of "good" parts to refine the nominal
template parameters and provide statistical bounds on parameter values. The end result is
a template description which includes correction for inaccurate placement of prinfitives in
the initial template creation process and which accurately reflects the true part geometry.
Part inspection involves making decisions about whether parts contmn defects. For
example, parts must not contain flaws produced by poor drilling and part dimensions
must satisfy geometric tolerance specifications. In terms of image analysis tasks this pro-
cess involves first extracting empirical geometric features from the image data via image
segmentation and local feature parameterization. Global context for decision-nmking is
provided via the inspection template which is deformed to the empirical prinlitives by
first registering the template to image features and then applying nonlinear optimization
techniques to produce the "best-fit" of the template to the empirical features. Finally, the
deformed template description is used for verification of part feature dimensions and to
provide the context for the application of specialized a.lgorithms for characterizing local
flaWS.
896
I I I m m ~ FIeauk~nm
Fig. 2. (a) A simplied example of requirements for an inspection template, and; (b) A snapshot
in the process of template construction illustrating the introduction of a constraint after line
primitive creation.
3.2 S y s t e m C o m p o n e n t s
Essentially the inspection system can be divided into four functional modules: template
creation; image feature extraction; template refinement and flaw decision-making.
T e m p l a t e C r e a t i o n . A simplified example to illustrate the requirements for all inspec-
tion template is shown in Figure 2a. The general template creation process involves first
specifying a set of geometric primitives and then establishing the relationships between
them. In our system this is achieved using a graphical template editing tool which allows
the user to build a template composed of a selectiou of the 4 types of geometric primitive
specified in section 2 which are related by any of 12 possible constraint types. Figure2b
illustrates a "snap-shot" view in creating a template.
Image S e g m e n t a t i o n . The extraction of geometric primitives is achieved using a mor-
phology bmsed region boundary segmentation technique. Details of this algorithm can
be found elsewhere [13]. This algorithm locates, to pixel accuracy, boundary points on
either side of an edge ms half boundaries which are 4-connected pLxel chains. A typical
897
output from the algorithm is shown in Figure 3b where both edges of the regions are
highlighted.
To detect subtle changes in image geometry and to achieve accurate feature param-
eterization we have implemented a subpixel residual crossing localization algorithm. A
morphological residual crossing is defined as the zero-crossing of the signed maz dilation-
erosion residue, fmaxder(f) [13]:
Fig. 3. Drilled hole segmentation: (a) original; (b) region boundary segmentation; (c) subpixel
localization of boundaries. In (b) both region edges have been marked which explains the ap-
pearance of the boundaries as thick edges.
C o n s t r a i n t Solver. The goal of the constraint solver is to solve the problem of finding
an instance of the inspection template which satisfies all of the geometric constraints
defined by the template and at the same time, nfinimizes the mean-square error between
the template primitives and the image features. The mean-square error can be expressed
as a convex function of the template parameters and a geometric description of the image
features.
Theoretical details of the approach can be found in [12]. Briefly, the two goals of
finding the global minimum of a convex function, V ] ( x ) = 0, and satisfying the con-
straints, h(x) = 0, are combined to give a constrained minimization problem. A linear
approximation to this optimization problem is:
V2f(x) d x = - V f ( x )
Vh(x) d x = - h ( x ) (2)
y/~, which determines the weight given to satisfying the constraints versus minimizing the
cost function. Each iteration of (2) has a line search that minimizes the least-square-error:
which is a merit function similar to the objective of the standard penalty method.
Verification. The output from the constraint solver is a set of deformed prinfitives which
can by used for one of two purposes; either to further refine the parameter values and tol-
erances of the inspection template, or for flaw decision-making. For example, the derived
parameters from the deformed primitives can be compared to the template parmneters to
detect geometric flaws such as inaccurate drilled hole diameters. The deformed inspection
template primitives can also provide the context for applying specialized algorithms for
characterizing shape and intensity-based properties of subtle flaws. Although the detec-
tion of flaws is not the focus of this paper, preliminary results of flaw analysis will be
illustrated in the experiments described in the next section.
4 Experiments
In this section, we present results from our current working system in action. This sys-
tem has been designed using object-oriented methodology and implemented on SPARC
workstations using the C + + language and the X-based graphics toolkit InterViews.
/
T e m p l a t e C r e a t i o n . First, in Figure 2b, we illustrate the process of template creation.
The set of configurations in the template contains a number of lines and a number of
points. These geometric entities (or rather their counterparts in the image} are subjected
to a number of constraints. The original telnplate and the template after deformation
are shown in figure 4.
Fig. 4. Template creation and solving for the best fit. (a) Template specification ; (b) Template
after best fit.
Table 1. Normalized nominal values and tolerances for geometric measurements collected over
a sample set of 10 images using an inspection template.
measurement average s . d . max min
length of hole 4 0.997 0.011 1.020 0.974
length of hole 6 0.995 0.062 1.010 0.971
length of hole 8 0.995 0.02:1.018 0.971
length of hole 14 0.998 0.011 1.018 0.984
hole length for sample set 0.998 0.010 1.023 0.971
outer boundary orientation 1.017 0.1~ 1.220 ).807
sum of hole spacing 0.998 0.002 1.002 0.996
hole separation 1.000 0.018 1.132 0.956
normalized lengths for the 10 parts is shown in Figure 5a. Table 1 also shows statistics
for the sum of hole spacings, hole separation and the orientation of the outer boundary.
These global measurements were specified in the template by l h m a r s p a c i n g and linear
size constraints. As the table shows, the agreement between the template model and
image data is very good. This indicates the template accurately represents both critical
local and global geometric parameters.
lllll
LI. 1511 IIIII
-IIIII
10[ i: iulmilum
IIIIII ....
5~ .lllnll
llllllll_
0 QI ~ nlllllllli
Nom~lized lenglh
Fig. 5. Histogram plots of (a) the lengths of holes for a sample set of 10 good parts. Lengths
have been normalized by tile template values; and, (b) the updated histogram wl,ere the samples
from a part containing a defect has been added to the sample set. The unshaded sample is more
than 26 from the sample average.
901
5 Discussion
To smmnarize, this paper has described progress toward the development of a geometry-
based image analysis system based on the concept of a deformable inspection template.
We have described aspects of our approach, some of the key components of our inte-
grated system and presented results from processing experimental data using our current
implementation.
Our approach differs from elastic 'snake' based techniques [1, 6] and intensity-based
deformable parameterized contours [15] and templates [7] in a number of respects. First,
we use geometric primitives rather than intensity based features as subcomponents to
build the template, although the constraint solving machinery could be modified to handle
this case. Second, a key idea our work addresses is how to use a deformable template for
quantitative interpretation as opposed to feature extraction. Finally, our scheme allows
for the derivation of generic deformable parameterized templates, which is clearly a major
benefit for fast prototyping of new inspection algorithms.
References
1. Burr, D.J.: A Dynamic Model for Image Registration, Computer Vision, Graphics, and
Image Processing, 1981, 15,102-112.
2. Chen, C., Mulgaonkar, P.: CAD-Based Feature-Utility Measures For Automatic Vision
Programming, Proc. IEEE Workshop Auto. CAD-Based Vision, Lahaina HI, June 1991,106.
3. Farouki, R.: The approximation of non-degenerate offset surfaces, Computer Aided Geom-
etry Design, 1986, 3:1, 15-44.
4. Horn, B.K.P.: Robot Vision, McGraw-Hill, New York, 1986.
5. Huang. G: A robotic alignment and inspection system for semiconductor processing, Int.
Conf. on Robot Vision and Sensory Control, Cambridge MA, 1983, 644-652.
6. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models, Int. J. of Computer
Vision, 1988, 1:4, 321-331.
7. Lipson, P., et at.: Deformable Templates for Feature Extraction from Medical Images, Proc.
Europ. Conf. on Computer Vision, Antibes France, April 1990, 413-417.
8. Medioni, G., Huertas, A.: Detection of Intensity Changes with Subpixel Accuracy using
Laplacian-Gaussian Masks, IEEE PAMI, September 1986, 8:5,651-664.
9. Maragos, P., Schafer, R.W.: Morphological Systems for Multidimensional Signal Processing,
Proc. of the IEEE, April 1990, 78:4, 690-710.
10. NalWa,V.S.: Edge Detector Resolution Improvement by Image Interpolation, IEEE PAMI,
May 1987, 9:3, 446-451.
11. Nelson, G.: Juno, a Constraint-Based Graphics System, ACM Computer Graphics, SIG-
GRAPH '85, San Francisco CA, 1985, 19:3, 235-243.
12. Nguyen, V.,Mundy, J.L., Kapur, D.: Modeling Generic Polyhedral Objects with Con-
straints, Proc. IEEE Conf. Comput. Vis. & Part. Recog., Lahaina HI, June 1991, 479-485.
13. Noble, J.A.: Finding Half Boundaries and Junctions in Images, Accepted Image and Vision
Computing, (in press 1992).
14. Okamoto, K., et al.: An automatic visual inspection system for LSI photomasks, Proc. Int.
Conf. on Pattern Recognition, Montreal Canada, 1984, 1361-1364.
15. Staib, L.H., Duncan, J.S.: Parametrically Deformable Contour Models, Proc. IEEE Conf.
Comput. Vis. & Part. Recog., San Diego CA, June 1989, 98-103.
16. West, A., Fernando, T., Dew, P.: CAD-Based Inspection: Using a Vision Cell Demonstrator.
Proc. IEEE Workshop on Auto. CAD-Based Vision, Lahaina HI, June 1991, pp 155.
Hardware Support for Fast Edge-based Stereo
AIVRU (Artificial Intelligence Vision Research Unit) has an evolving vision system,
TINA, which uses edge-based stereo to obtain 3D descriptions of the world. This has been
parallelised to operate on a transputer-based fast vision engine called MARVIN (Mul-
tiprocessor ARchitecture for VisioN) [9] which can currently deliver full frame stereo
geometry from simple scenes in about 10 seconds. Such vision systems cannot yet offer
the thoughput required for industrial applications and the challenge is to provide real-
time performance and be able to handle more complex scenes with increased robustness.
To meet this challenge, AIVRU is constructing a new generation vision engine which will
achieve higher performance through the use of T9000 transputers and by committing
certain computationally intensive algorithms to hardware.
This hardware is required to operate at framerate in the front-end digitised video
pathways of the machine with output routed (under software control) to image memories
within the transputer array. The current generation of general purpose DSP (Digital
SignM Processing) devices offers processing throughputs up to 100Mips but many low
level tasks require 100 or more operations per pixel, and at a framerate of 10Mttz this
is equivalent to over 1000Mops. Four low level processing tasks h~ve been identified
as candidates for framerate implementation: image rectification, ground plane obstacle
detection (GPOD), convolution and edge detection (see [3] for more detail).
Image rectification is a common procedure in stereo vision systems [1] [8]. Stereo match-
ing of image features is performed along epipolar lines. These are not usually raster lines
but the search may be performed rapidly if the images are transformed so that a raster
in one image is aligned with a raster in the other. This also permits increased spatial par-
allelism since horizontal image slices distributed to different processors need less overlap.
The current method is to rectify image feature positions rather than each pixel, since
there are generally far fewer image features than pixels. However, as image features are
added and more complex scenes analysed, this becomes less practical.
Obstacle detection is required as a major component of the zero order safety compe-
tence for AGVs (Autonomously Guided Vehicles), when it is necessary to detect, but not
necessarily recognise, putative obstacles. This would have to be performed very rapidly
to permit appropriate action to be taken. MaUot [7] pointed out that if an image were
projected down into the ground plane (at an angle to the image plane), it should be
903
the same as another image of the same ground plane taken from a different viewpoint,
but that if there were some object outside the ground plane, such as an obstacle, the
projected images would be different and subtracting one from the other would reveal
that obstacle. The image plane warping offered by an image rectification module would
support such an operation by using different transform parameters.
At each point in the transformed image, it is necessary to compute the corresponding
position in the input image. It has been shown that this process is equivalent to effecting
a 3 by 3 homogeneous coordinate transform [5]:
2 e = ( x l Yl 1)
h
from pixel (x2,y2) in the output image to pixel (xl,yl) in the original image. This
computation, solved to yield xl and Yl, is nonlinear and the few commercially available
image manipulation systems are unable to cope with this nonlinearity. A mapping lookup
table could be used but in a system with variable camera geometry, this would be time
consuming to alter, so the transform must be computed on the fly.
The accuracy required of the image rectification process is determined by the accuracy
with which features may be detected in images. We work with just two features at present:
Canny edgels [2]; and Harris corners [6]; both of which offer subpixel location: edgels down
to 0.02 pixels repeatability, falling to about 0.1 pixels on natural images; and corners to
about 1/3 pixel. Propagating this though the transform equation suggests that as many
as 22-24 bits of arithmetic are required for the normal range of operations.
High resolution image processing using subpixel acuity places severe demands on the
sensor. Whereas CCD imaging arrays may be accurately manufactured, lenses of similar
precision are very costly. Reasonable quality lenses give a positional error of 1.25 to
2.5 pixels, though this can be corrected to first order by adjusting the focal length. To
improve upon this, a simple radial distortion model has been used to good effect. It is
necessary to apply distortion correction to the transformed coordinates before they are
used to select pixels in the source image. The most general way to implement this is to
fetch offsets from large lookup tables addressed by the current coordinates.
As exact integer target pixel coordinates are transformed into source pixel coordinates,
the resulting address is likely to include a fractional part and the output pixel will
have to incorporate contributions from a neighbourhood. The simplest approach using
interpolation over a 2x2 window is inaccurate in the case of nonlinear transforms. Our
approach is to try to preserve the 'feature spaces'. Edgels are extracted from gradient
space and corners from second difference space, and if these spaces are preserved, the
features should be invariant under warping. An attempt is made to preserve the gradients
by computing the first and second differences around the point and interpolating using
these. This requires at least a 3x3 neighbourhood to compute second differences. Six
masks are needed in all [5]. Computing each for a given subpixel offset at each point
would be very costly in terms of hardware, but the masks may be combined to form a
single mask (by addition) for any given x and y subpixel offsets. The coefficients of the
desired mask is selected using the fractional part of the transformed pixel address and
the set of 9 pixel data values selected using the integer part. This method was evaluated
by comparing explicitly rectified line intersections with those extracted from the rectified
and anti-aliased image. Tests indicate that subpixel shifts are tracked with a bias of less
than 0.05 pixels and a similar standard deviation, at 0.103 pixels, when using a limited
number (64) of mask sets. Since grey level images are being subtracted during GPOD,
904
signal-to-noise ratio is important and this is affected by the number of masks used. Tests
indicate that of a maximum of 58dB for an 8 bit image, 57dB can be obtained using 6
mask selection bits (64 masks) on test images.
The overall block diagram of the image rectification module, including lens distortion
correction table and mask table is shown in figure 1.
PARAMETERS ~ I TRANSFORM
!xl,yl ),fromtimin PROCESSOR
~ I,,
LENS DISTORTION
CORRECTIONTABLE
OFFSETS~
I
INTEGER FRACTION
(x2'y2)~F (x2,y2)
(
Data C~ff ~ Data
I ~ I Out MASK
UTIMAGEBUFFER Reiult I I TABLE
4
Fig. 1. Block diagram of Image Rectification Module.
3 Edge Detection
On the MARVIN system, approximately 5 of the 10 second full frame stereo processing
time is taken performing edge extraction. Object tracking at 5Hz has been demonstrated
but for this the software has to resort to cunning but fragile algorithmic short-cuts [9]. If
the edge extraction process could be moved into hardware, full frame stereo throughput
and tracking robustness would be significantly increased.
The current TINA implementation of Canny's edge detection algorithm [2] extracts
edge element strength, orientation and subpixel position by: convolution with a gaussian
kernel; computation of gradients in x and y; extraction of the edgel orientation from
x and y gradients using arctangent; extraction of the edgel gradient from the x and y
gradients by pythagoras; application of nonmaximai suppression (NMS) based on local
edgel gradient and orientation in a 3x3 neighbourhood, suppressing the edgel if it is not a
maximum across the edgel slope; computation of edgel position to subpixel acuity using
a quadratic fit; and linking of edgels thresholded with hysteresis against a low and high
threshold. These linked edgels are then combined together into straight line segments
and passed onto a matching stage prior to 3D reconstruction. Despite several attempts
by other researchers to build Canny edge detectors this hardware is still unavailable [3].
Gaussian smoothing may be performed using a convolution module. Convolution is
well covered in standard texts and may be simply performed using commercially available
convolution chips and board level solutions from a number of sources. A prototype dual
channel 8x8 convolution board has been constructed using a pair of Plessey PDSP16488
905
chips. Since edge detection most commonly uses a (r of 1.0, an 8x8 neighbourhood would
be adaquate. Tests with artificial images indicate that least 12 bits are necessary from the
output of the convolution to ensure good edgel location. Real images gave worse results
and 15-16 bits would be preferred. Note that not all commercial convolution boards are
able to deliver this output range. The x and y gradients may easily be obtained by
bufferering a 3x3 neighbourhood in a delay line and subtracting north from south to get
the y gradient dy, and east from west to get the x gradient dz.
The edgel orientation r is normally computed from the x and y gradients using the
arctangent. This is quite poorly behaved with respect to discrete parameters dx and dy
and these have to be carefully manipulated to minimise numerical inaccuracies. This is
achieved by computing the angle over the range 0 to 45 degrees by swapping dx and
dy to ensure that dy is the largest and expanding over the whole range later on. Using
a lookup table for the angle using full range dz and dy would require a 30 bit address
which is beyond the capability of current technology. The m a x i m u m reasonable table
size is 128k or 17 address bits. However the address range may be reduced by a form of
cheap division using single cycle barrel shifting. Since shifting guarantees that the most
significant bit of dy is 1, this is redundant, permitting an extra dz bit to be used in
address generation, thus extending the effective lookup table size by a factor of 2. This
has been shown to result in an error of less than 0.45 degrees, which is smaller than the
quantisation error due to forcing 360 degrees into 8 bits of output [4].
For the gradient strength, the pythagoras root of sum of squares is quite difficult to
perform in hardware. The traditional hardware approach is to compute the Manhattan
magnitude (i.e. sum of moduli) instead but this results in a large error of about 41%
when dz -- dy. A somewhat simpler technique is to apply a correction factor directly by
dividing the maximum of dz and dy by cosr to give:
1
G = max(ldx I, Idyl). cost
This factor ~ may be looked up from the barrel shifted dx and dy as before and it has
been shown that this would result in an error in the gradient G of less than 0.8% [4].
Nonmaximal suppression is a fairly simple stage at which the central gradient mag-
nitude is compared with those of a number of its neighbours, selected according to the
edgel orientation. Quadratic fitting is performed to obtain an offset del which may be
computed from the central gradient c and the two neighbours a and b by:
a-b
del =
2((a + b) - 2c)
Due to the division, the offset is gradient-scale invariant over a wide range. The three
gradients may be barrel shifted as before. The terms a - b, a + b and thence a + b - 2c
may be computed using simple adders, and the subpixel offset looked up from a single
128k lookup table. The x and y subpixel offsets may then be computed from this value
to give an x offset if the edgel is roughly vertical and a y offset if the edgel is roughly
horizontal. This again may be accomplished with a small lookup table.
Other implementations of hysteresis edge linking used an iterative algorithm requiring
5-8 iterations [4]. A simpler solution is to perform thresholding against the low threshold
in hardware but leave the linking and hysteresis to the software. The computational
overhead for hysteresis should be small as it is simply a case of comparing the fetched
gradient with zero (to check that it is valid - and this has to be performed anyway) or
with the high threshold, depending on the branch taken in the linking algorithm. The
hysteresis problem therefore disappears.
906
Tests of the Canny hardware design, carried out with synthetic circle images providing
all possible edgel orientations, gave a pixel accuracy of better than 0.01 pixels worst case
and a standard deviation of 0.0025 pixels, even with a contrast as low as 16 grey levels.
This is far superior to the current repeatability of 0.02 pixels.
4 Conclusions
Image rectification is a common, well understood but costly task in computer vision. It is
well suited to hardware implementation. A scheme for correcting arbitrary lens distortions
was proposed. Anti-aliasing may be performed by convolving the 3x3 neighbourhood
with a set of precomputed masks. Using 64 masks in each axis should be large enough to
ensure good line fitting and signal to noise ratio for GPOD. The surface fitting scheme
proved to preserve line intersections with a variance of better than 0.1 pixels and a
bias of less than 0.05 pixels on artificial images. This is rather poorer than the Canny
edge detector is capable of but adequate for later stages and may be improved using
alternative anti-aliasing methods. Experiments with the simulated hardware rectification
on sample images of a simple object produce good 3D geometry. Such a module would
be useful in other domains, notably realtime obstacle detection where speed is of the
utmost importance and the use of realtime hardware is mandatory. Convolution is usually
solvable using available products. Canny edge detection is a stable algorithm and requires
relatively little additional hardware to obtain framerate performance.
AIVRU's intention is to build the next generation of fast vision engine using T9000
transputers as the network processor and to integrate this with a series of framerate
boards such as the ones just described to obtain full frame stereo in under a second.
5 Acknowledgements
Thanks to Phil McLauchlan and Pete Furness for insightful comments and kind support.
References
1. Ayache, N.: Artificial Vision for Mobile Robots. MIT Press (1991)
2. Canny, J.F.: A Computational Approach to Edge Detection. IEEE PAMI-8 (1986) 679-
698
3. Courtney P.: Evaluation of Opportunities for Framerate Hardware within AIVRU.
AIVRU Research Memo 51 (November 1990)
4. Courtney P.: Canny Post-Processing. AIVRU Internal Report (June 1991)
5. Courtney P., Thacker, N.A. and Brown, C.R.: Hardware Design for Realtime Image
Rectification. AIVRU Research Memo 60 (September 1991)
6. Harris, C. and Stephens, M.: A Combined Corner and Edge Detector. Proc. 4th Alvey
Vision Conference, Manchester, England (1988) 147-151
7. Mallot, H.A., Schulze E. and Storjohann, K.: Neural Network Strategies for Robot Nav-
igation. Proc. nEuro '88, Paris. G. Dreyfus and L. Personnadz (Eds.) (1988)
8. Mayhew, J.E.W. and Frisby, J.P.: 3D Model Recognition from Stereoscopic Cues. MIT
Press (1990)
9. Rygol, M., Pollard, S.B. and Brown, C.R.: MARVIN and TINA: a Multiprocessor 3-D
Vision System. Concurrency 3(4) (1991) 333-356
This article was processed using the I~TEX macro package with ECCV92 style
Author Index
Aggarwal, J.K., 720 C16ment, V., 815
Ahuja, N., 217 Cohen, I., 458,648
Aloimonos, Y., 497 Cohen, L.D., 648
Amat, J., 160 Colchester, A.C.F., 725
Anandan, P., 237 Cootes, T., 852
Ancona, N., 267 Courtney, P., 902
Aoki, Y., 843 Cox, I.J., 72
Arbogast, E., 467 Craw, I., 92
Asada, M., 24 Crowley, J.L., 588
Aw, B.Y.K., 749 Culhane, S.M., 551
Ayache, N., 43, 458,620,648 Curwen, R., 879
Bajcsy, R., 99, 653 Daniilidis, K., 437
Baker, K.D., 277, 778 Davis, L.S., 335
Beardsley, P., 312 Dawson, K.M., 806
Bennett, A., 92 De Floriani, L., 368
Bergen, J.R., 237 DeMenthon, D.F., 335
Berthod, M., 67 Debrunner, C., 217
Bertrand, G., 710 Dhome, M., 681
Black, M. J., 485 Drew, M.S., 124
Blake, A., 187, 879 Duri~, Z., 497
Bobet, P., 588 Edelman, S., 787
Bouthemy, P., 476 Eklund, J.-O., 526, 701
Bowman, C., 272 Embrechts, H., 387
Brady, M., 272 Etoh, M., 24
Brauckmann, M., 865
Breton, P., 135 Farley, J., 893
Brockingston, M., 124 Faugeras, O.D., 203,227, 321,563
Brown, C.M., 542 Ferrie, F.P., 222
Brown, C.R., 902 Fisher, R.B., 801
Brunelli, R., 792 Fleck, M.M., 151
Brunie, L., 670 Florack, L.J., 19
BrunnstrSm, K., 701 Florek, A., 38
Bruzzone, E., 368 Forsyth, D.A., 639, 757
Buchanan, Th., 730 Fua, P., 676
Buurman, J., 363 Funt, B.V., 124
Buxton, H., 884 G~rding, J., 630
Campani, M., 258 Gatti, M., 696
Casadei, S., 174 Geiger, D., 425
Casals, A., 160 Giraudon, G., 67,815
Cass, T.A., 773,834 Glachet, R., 681
Cazzanti, M., 368 Grau, A., 160
Chang, C., 420 Griffin, L.D., 725
Chatterjee, S., 420 Grimson, W.E.L., 291
Chen, X., 739 Grosso, E., 516
Cipolla, R., 187 Grzywacz, N.M., 212
908
Vol. 505: E. H. L. Aarts, J. van Leeuwen, M. Rein (Eds.), PARLE Vol. 525: O. Giinther, H.-J. Schek (Eds.), Advances in Spatial
'91. Parallel Architectures and Languages Europe, Volume I. Databases. Proceedings, 1991. XI, 471 pages. 1991.
Proceedings, 1991. XV, 423 pages. 1991. Vol. 526: T. lto, A. R. Meyer (Eds.), Theoretical Aspects of
Vol. 506: E. H. L. Aarts, J. van Leeuwen, M. Rem (Eds.), PARLE Computer Software. Proceedings, 1991. X, 772 pages. 1991.
'91. Parallel Architectures and Languages Europe, Volume II. Vol. 527: J.C.M. Baeten, J. F. Groote (Eds.), CONCUR '91.
Proceedings, 1991. XV, 489 pages. 1991. Proceedings, 1991. VIII, 541 pages. 1991.
Vol. 507: N. A. Sherwani, E. de Doncker, J. A. Kapenga (Eds.), Vol. 528: J. Maluszynski, M. Wirsing (Eds,), Programming Lan-
Computing in the 90's. Proceedings, 1989. XIII, 441 pages. guage Implementation and Logic Programming. Proceedings,
1991. 1991. XI, 433 pages. 1991.
Vol. 508: S. Sakata (Ed.), Applied Algebra, Algebraic Algo- Vol. 529: L. Budach (Ed.), Fundamentals of Computation
rithms and Error-Correcting Codes. Proceedings, 1990. IX, 390 Theory. Proceedings, 1991. XII, 426 pages. 1991.
pages. 1991.
Vol. 530: D. H. Pitt, P.-L. Curien, S. Abramsky, A. M. Pitts, A.
Vol. 509: A. Endres, H, Weber (Eds.), Software Development Poignr, D. E. Rydeheard (Eds.), Category Theory and Compu-
Environments and CASE Technology. Proceedings, 1991. VIII, ter Science. Proceedings, 1991. VII, 301 pages. 1991.
286 pages. 1991.
Vol. 531: E. M. Clarke, R. P. Kurshan (Eds.), Computer-Aided
Vol. 510: J. Leach Albert, B. Monien, M. Rodriguez (Eds.), Verification. Proceedings, 1990. XIII, 372 pages. 1991.
Automata, Languages and Programming. Proceedings, 1991.
XII, 763 pages. 1991. Vol. 532: H, Ehrig, H.-J. Kreowski, G. Rozenberg (Eds.), Graph
Grammars and Their Application to Computer Science. Pro-
Vol. 511: A. C. F. Colchester, D.J. Hawkes (Eds.), Information ceedings, 1990. X, 703 pages. 1991.
Processing in Medical Imaging. Proceedings, 1991. XI, 512
Vol. 533: E. Btirger, H. Kleine Brining, M. M. Richter, W.
Vol. 512: P. America (Ed.), ECOOP '91. European Conference Schtinfe/d (Eds.), Computer Science Logic. Proceedings, 1990.
on Object-Oriented Programming. Proceedings, 1991. X, 396 VIII, 399 pages. 1991.
pages. 1991.
Vol. 534: H. Ehrig, K. P. Jantke, F. Orejas, H. Reichel (Eds.),
Vol. 513: N. M. Mattos, An Approach to Knowledge Base Man- Recent Trends in Data Type Specification. Proceedings, 1990.
agement. IX, 247 pages. 1991. (Subseries LNAI). VIII, 379 pages. 1991.
Vol. 514: G. Cohen, P, Charpin (Eds.), EUROCODE '90. Pro- Vol. 535: P. Jorrand, J, Kelemen (Eds.), Fundamentals of Arti-
ceedings, 1990. XI, 392 pages. 1991. ficial Intelligence Research. Proceedings, 1991. VIII, 255 pages.
Vol. 515: J. P. Martins, M. Reinfrank (Eds.), Truth Maintenance 1991. (Subseries LNAI).
Systems. Proceedings, 1990. VII, 177 pages. 1991. (Subseries Vol. 536: J. E, Tomayko, Software Engineering Education. Pro-
LNA1). ceedings, 1991. VIII, 296 pages. 1991.
Vol. 516: S. Kaplan, M. Okada (Eds.), Conditional and Typed Vol. 537: A. J. Menezes, S. A. Vanstone (Eds.), Advances in
Rewriting Systems. Proceedings, 1990. IX, 46l pages, 1991. Cryptology -CRYPTO '90. Proceedings. XIII, 644 pages. 1991.
Vol. 517: K. NiSkel, Temporally Distributed Symptoms in Tech- Vol. 538: M. Kojima, N. Megiddo, T. Noma, A. Yoshise, A
nical Diagnosis, IX, 164 pages. 1991. (Subseries LNAI). Unified Approach to Interior Point Algorithms for Linear
Vol, 518: J. G. Williams, Instantiation Theory. VIII, 133 pages. Complementarity Problems. VIII, 108 pages. 1991.
1991. (Subseries LNAI). Vol. 539: H. F. Mattson, T. Mora, T. R. N. Rao (Eds.), Applied
Vol. 519: F. Dehne, J.-R. Sack, N. Santoro (Eds.), Algorithms Algebra, Algebraic Algorithms and Error-Correcting Codes.
and Data Structures. Proceedings, 1991. X, 496 pages. 199 I. Proceedings, 1991. XI, 489 pages. 1991.
Vol. 520: A. Tarlecki (Ed.), Mathematical Foundations of Vol. 540: A. Prieto (Ed.), Artificial Neural Networks. Proceed-
Computer Science 1991. Proceedings, 1991. XI, 435 pages, ings, 1991. XIII, 476 pages. 1991.
1991. Vol. 541 : P. Barahona, L. Moniz Pereira, A. Porto (Eds.), EPIA
Vol. 521: B. Bouchon-Meunier, R. R. Yager, L. A. Zadek (Eds.), '91. Proceedings, 1991. VIII, 292 pages. 1991. (Subseries
Uncertainty in Knowledge-Bases. Proceedings, 1990. X, 609 LNAI).
pages. 1991. Vol. 542: Z. W. Ras, M. Zemankova (Eds.), Methodologies for
Vol. 522: J. Hertzberg (Ed.), European Workshop on Planning. Intelligent Systems. Proceedings, 1991. X, 644 pages. 1991.
Proceedings, 1991. VII, 121 pages. 1991. (Subseries LNAI). (Subseries LNAI).
Vol. 523: J. Hughes (Ed.), Functional Programming Languages Vol. 543: J. Dix, K. P. Jantke, P. H. Schmitt (Eds.), Non-
and Computer Architecture. Proceedings, 1991. VIII, 666 pages. monotonic and Inductive Logic. Proceedings, 1990. X, 243
1991. pages. 1991. (Subseries LNAI).
Vol. 524: G. Rozenberg (Ed.), Advances in Petri Nets 1991. Vol. 544: M. Broy, M. Wirsing (Eds.), Methods of Program-
VIII, 572 pages. 1991. ming. XI1, 268 pages. 1991.
pages. 1991.
Vol. 545: H. Alblas, B. Melichar (Eds.), Attribute Grammars, Vol. 570: R. Berghammer, G. Schmidt (Eds.), Graph-Theoretic
Applications and Systems, Proceedings, 1991. IX, 513 pages. Concepts in Computer Science. Proceedings, 1991. VIII, 253
1991. pages. 1992.
Vol. 546: O. Herzog, C.-R. Rollinger (Eds.), Text Understand- Vol. 571: J. Vytopil (Ed.), Formal Techniques in Real-Time
ing in LILOG. XI, 738 pages. 1991. (Subseries LNAI). and Fault-Tolerant Systems. Proceedings, 1992. IX, 620 pages.
Vol. 547: D. W. Davies (Ed.), Advances in Cryptology - 1991.
EUROCRYPT '91. Proceedings, 1991. XII, 556 pages. 1991. Vol. 572: K. U. Schulz (Ed.), Word Equations and Related Top-
Vol. 548: R. Kruse, P. Siegel (Eds.), Symbolic and Quantitative ics. Proceedings, 1990. VII, 256 pages. 1992.
Approaches to Uncertainty, Proceedings, 199l. XI, 362 pages. Vol. 573: G. Cohen, S. N. Litsyn, A. Lobstein, G. Z6mor (Eds.),
1991. Algebraic Coding. Proceedings, 1991. X, 158 pages. 1992.
Vol. 549: E. Ardizzone, S. Gaglio, F. Sorbello (Eds.), Trends in Vol. 574: J. P. Ban~tre, D. Le M6tayer (Eds.), Research Direc-
Artificial Intelligence. Proceedings, 1991. XIV, 479 pages. 1991. tions in High-Level Parallel Programming Languages. Proceed-
(Subseries LNAI). ings, 1991. VIII, 387 pages. 1992.
Vol. 550: A. van Lamsweerde, A. Fugetta (Eds.), ESEC '91. Vol. 575: K. G. Larsen, A. Skou (Eds.), Computer Aided Veri-
Proceedings, 1991. XII, 515 pages. 1991, fication. Proceedings, 1991. X, 487 pages. 1992.
Vol. 551:S. Prehn, W. J. Toetenel (Eds.), VDM '91. Formal Vol. 576: J. Feigenbaum (Ed.), Advances in Cryptology -
Software Development Methods. Volume 1. Proceedings, 1991. CRYPTO '91. Proceedings. X, 485 pages. 1992.
XIII, 699 pages. 1991. Vol. 577: A. Finkel, M. Jantzen (Eds.), STACS 92. Proceed-
Vol. 552: S. Prehn, W. J. Toetenel (Eds.), VDM '91. Formal ings, 1992. XIV, 621 pages. 1992.
Software Development Methods. Volume 2. Proceedings, 1991. Vol. 578: Th. Beth, M. Frisch, G. J. Simmons (Eds.), Public-
XIV, 430 pages. 1991. Key Cryptography: State of the Art and Future Directions. XI,
Vol, 553: H. Bieri, H. Noltemeier (Eds.), Computational Ge- 97 pages. 1992.
ometry - Methods, Algorithms and Applications '91. Proceed- Vol. 579: S. Toueg, P. G. Spirakis, L. Kirousis (Eds.), Distrib-
ings, 1991. VIII, 320 pages. 1991. uted Algorithms. Proceedings, 1991. X, 319 pages. 1992.
Vol. 554: G. Grahne, The Problem of Incomplete Information Vol. 580: A, Pirotte, C. Delobel, G. Gottlob (Eds.), Advances
in Relational Databases. VIII, 156 pages. 1991. in Database Technology - EDBT '92. Proceedings. XII, 551
Vol. 555: H. Maurer (Ed.), New Results and New Trends in pages. 1992.
Computer Science. Proceedings, 1991. VIII, 403 pages. 1991. Vol. 581: J.-C. Raoult (Ed.), CAAP '92. Proceedings. VIII, 361
Vol. 556: J.-M. Jacquet, Conclog: A Methodological Approach pages. 1992.
to Concurrent Logic Programming. XII, 781 pages. 1991. Vol. 582: B. Krieg-Brtickner (Ed.), ESOP '92. Proceedings. VIII,
Vol. 557: W. L. Hsu, R. C. T. Lee (Eds.), ISA '91 Algorithms. 491 pages. 1992.
Proceedings, 1991. X, 396 pages. 1991. Vol. 583: I. Simon (Ed.), LATIN '92. Proceedings, IX, 545 pages.
Vol. 558: J. Hooman, Specification and Compositional Verifi- 1992.
cation of Real-Time Systems. VIII, 235 pages. 1991. Vol. 584: R. E. Zippel (Ed.), Computer Algebra and Parallel-
Vol. 559: G. Butler, Fundamental Algorithms for Permutation ism. Proceedings, 1990. IX, 114 pages. 1992.
Groups. XII, 238 pages. 1991. Vol. 585: F. Pichler, R. Moreno Dfaz (Eds.), Computer Aided
Vol. 560: S. Biswas, K. V. Nori (Eds.), Foundations of Soft- System Theory - EUROCAST '91. Proceedings. X, 761 pages.
ware Technology and Theoretical Computer Science. Proceed- 1992.
ings, 1991. X, 420 pages. 1991. Vol. 586: A. Cheese, Parallel Execution of Parlog. IX, 184 pages.
Vol. 561: C. Diog, G. Xiao, W. Shan, The Stability Theory of 1992.
Stream Ciphers. IX, 187 pages. 1991. Vol. 587: R. Dale, E. Hovy, D. ROsner, O. Stock (Eds.), As-
Vol. 562: R. Breu, Algebraic Specification Techniques in Ob- pects of Automated Natural Language Generation. Proceedings,
ject Oriented Programming Environments. XI, 228 pages. 1991. 1992. VIII, 311 pages. 1992. (Subseries LNA1).
Vol. 563: A. Karshmer, J. Nehmer (Eds.), Operating Systems Vol. 588: G. Sandini (Ed.), Computer V i s i o n - ECCV '92. Pro-
of the 90s and Beyond. Proceedings, 199 l. X, 285 pages. 1991. ceedings. XV, 909 pages. 1992.
Vol. 564: I. Herman, The Use of Projective Geometry in Com-
puter Graphics. VIII, 146 pages. 1992.
Vol. 565: J. D. Becker, I. Eisele, F. W. Mtindemann (Eds.), Par-
allelism, Learning, Evolution. Proceedings, 1989. VIII, 525
pages. 1991. (Subseries LNAI).
Vol. 566: C. Delobel, M. Kifer, Y. Masunaga (Eds.), Deductive
and Object-Oriented Databases. Proceedings, 1991. XV, 581
pages. 1991.
Vol. 567: H. Boley, M. M. Richter (Eds.), Processing Declara-
tive Kowledge. Proceedings, 1991. XI1, 427 pages. 1991.
(Subseries LNAI).
Vol. 568: H.-J. Biirckert, A Resolution Principle for a Logic
with Restricted Quantifiers. X, 116 pages. 1991. (Subseries
LNAI).
Vol. 569: A. Beaumont, G. Gupta (Eds.), Parallel Execution of
Logic Programs. Proceedings, 1991. VII, 195 pages. 1991.
Referees
G a x i b o t t o G. Italy N o r d s t r S m N. Sweden
A m a t J. Spain G i r a u d o n G. France
A n d e r s s o n M.T. Sweden G o n g S. U.K. Olofsson G. Sweden
A u b e r t D. France G r a n l u n d G. Sweden
A y a c h e N. France G r o s P. France P a h l a v a n K. Sweden
Grosso E. Italy P a m p a g n i n L.H. France
BArman H. Sweden Gueziec A. France P a p a d o p o u l o T. France
Bascle B. France P a t e r n a k B. Germany
Bellissant C. France H a g l u n d L. Sweden P e t r o u M. France
B e n a y o u n S. France Heitz F. France P u g e t P. France
Berger M.O. France H~ranlt H. France
Bergholm F. Sweden Herlin I.L. France Q u a n L. France
Berroir J.P. France H o e h n e H.H. Germany
Berthod M. France H o g g D. U.K R a d i g B. Germany
BesaSez L. Spain Horaud R. France Reid I. U.K.
Betsis D. Sweden Howarth R. U.K Riehetin M. France
Beyer H. France Hugog D. U.K. Rives G. France
Blake A. U.K. H u m m e l R. France R o b e r t L. France
Boissier O. France
Bouthemy P. France Inglebert C. France S a g e r e r G. Germany
Boyle R. U.K. Izuel M.J. Spain Sandini G. Italy
Brady M. U.K. Sanfeliu A. Spain
Burkhardt H. Germany Juvin D. France S c h r o e d e r C. Germany
Buxton B. U:K. Seals B. France
Buxton H. U.K. Kittler J. U.K. S i m m e t h H. Germany
Knutsaon H. Sweden Sinclair D. U.K.
C a l e a n D. France Koenderink I. The Netherlands Skordas Th. France
Carlsson S. Sweden Koller D. Germany S o m m e r G. Germany
Casals A. Spain S p a r r G. Sweden
C a s t a n S. France L a n g e S. Germany Sprengel R. Germany
C e l a y a E. Spain Lapreste J.T. France Stein T h . y o n Germany
Chamley S. France Levy-Vehel J. France Stiehl H.S. Germany
Chassery J.M. France Li M. Sweden
C h e h i k i a n A. France L i n d e b e r g T. Sweden T h i r i o n J.P. France
C h r i s t e n s e n H. France Lindsey P. U.K. T h o m a s B. France
Cinquin Ph. France L u d w i g K.-O. Germany T h o m a s F. Spain
C o h e n I. France L u o n g T. France T h o n n a t M. France
Cohen L. France Lux A. France Tistarelli M. Italy
Crowley J.L. France T o a l A.F. U.K.
Curwen R. U.K. M a g r a s s i M. Italy T o r r a s C. Spain
M a l a n d a i n G. France T o r t e V. Italy
Dagless E. France M a r t i n e z A. Spain Tr~v~n H. Sweden
Daniilidis K. Germany M a y b a n k S.J. France
De Micheli E. Italy M a y h e w J. U.K. Uhlin T. Sweden
Demazeau Y. France M a z e r E. France Usoh M. U.K.
Deriche R. France Mc L a u c h l a n P. U.K.
Devillers O. France Mesrabi M. France Veillon F. France
D h o m e M. France Milford D. France Verri A. Italy
Dickmanns E. Germany Moeller R. Germany Vieville T. France
Dinten J.M. France M o h r R. France Villanueva J . J . Spain
Dreschler-FischerL. Germany M o n g a O. France
Drewniok C. Germany Montseny E. Spain W a h l F. Germany
M o r g a n A. France Westelius C.J. Sweden
Eklundh J.O. Sweden Morin L. France Westin C.F. Sweden
Wieske L. Germany
Faugeras O.D. France Nagel H.H. Germany W i k l u n d J. Sweden
Ferrari F. Italy N a s t a r C. France W i n r o t h H. Sweden
Fossa M. Italy N a v a b N. France W y s o c k i J. U.K.
F u a P. France N e u m a n n B. Germany
N e u m a n n H. Germany Z e r u b i a J. France
GArding J. Sweden N o r d b e r g K. Sweden Z h a n g Z. France